Hierarchical faceting – Pivot facets in trunk

In a large number of implementations which I took part in, sooner or later, the question arise – what can we do to get faceting as a tree structure. Of course there some tricks for that, however, their use was to modify the data and appropriate processing of the results on application side. It was not particularly functional, nor especially comfortable. However, a few days ago Solr version 4.0 has been enhanced with code that is marked as Solr-792 in the system JIRA. Let’s see in this case, how to get the faceting results as a tree.

Important Note – at this point this functionality is only available in version 4.0, Solr, which is the development version. To use this version you need to download the code from trunk of Lucene/Solr SVN repository.

A few words at the beginning

In many projects in which I had the opportunity to deal with there was a need to use a hierarchical faceting. One of the simplest example is the requirement of showing the cities in the provinces and the number of documents in both provinces, as well as in various cities. Till recently, with no changes in the structure of data, it was impossible to achieve such functionality. Now it is possible 😉

Indexing

In order not to unnecessarily complicate the described functionality I decided to use the sample XML documents that are available in the directory /exampledocs of the example deployment. I also didn’t modify the schema.xml file, or solrconfig.xml, so that configurations are standard. So thats all when it comes to configuration. So we can start the indexing process (I called the command from the directory $SOLR_HOME/exampledocs/):

./post.sh *.xml

After seeing several screens of information, and we have our data indexed.

The mechanism

It is not difficult to use hierarchical faceting. Solr creators gave us to use two additional parameters to the ones we already know:

facet.pivot – list of comma-separated fields, which shows at which fields and in what order to calculate the structure,
facet.pivot.mincount – the minimum number of documents there needs to be to the result to be included in faceting results. The default value is 1.

So let’s try it.

Queries

At the beginning of the try with two fields. I query for all the documents from the index and add the parameter facet.pivot=cat,inStock to say Solr that I want to get the results of the hierarchical faceting, where the first level of the hierarchy is the cat field, and the second level is the inStock field. The query looks as follows:

http://localhost:8983/solr/select/?q=*:*&facet=true&facet.pivot=cat,inStock

To shorten the listing I omitted the part responsible for the search results along with a header.

<?xml version="1.0" encoding="UTF-8"?>
<response>
.
.
.
<result name="response" numFound="19" start="0"/>
<lst name="facet_counts">
  <lst name="facet_queries"/>
  <lst name="facet_fields"/>
  <lst name="facet_dates"/>
  <lst name="facet_ranges"/>
  <lst name="facet_pivot">
    <arr name="cat,inStock">
      <lst>
        <str name="field">cat</str>
        <str name="value">electronics</str>
        <int name="count">17</int>
        <arr name="pivot">
          <lst>
            <str name="field">inStock</str>
            <bool name="value">true</bool>
            <int name="count">13</int>
          </lst>
          <lst>
            <str name="field">inStock</str>
            <bool name="value">false</bool>
            <int name="count">4</int>
          </lst>
        </arr>
      </lst>
      <lst>
        <str name="field">cat</str>
        <str name="value">memory</str>
        <int name="count">6</int>
        <arr name="pivot">
          <lst>
            <str name="field">inStock</str>
            <bool name="value">true</bool>
            <int name="count">6</int>
          </lst>
        </arr>
      </lst>
      <lst>
        <str name="field">cat</str>
        <str name="value">connector</str>
        <int name="count">2</int>
        <arr name="pivot">
          <lst>
            <str name="field">inStock</str>
            <bool name="value">false</bool>
            <int name="count">2</int>
          </lst>
        </arr>
      </lst>
      <lst>
        <str name="field">cat</str>
        <str name="value">graphics card</str>
        <int name="count">2</int>
        <arr name="pivot">
          <lst>
            <str name="field">inStock</str>
            <bool name="value">false</bool>
            <int name="count">2</int>
          </lst>
        </arr>
      </lst>
      <lst>
        <str name="field">cat</str>
        <str name="value">hard drive</str>
        <int name="count">2</int>
        <arr name="pivot">
          <lst>
            <str name="field">inStock</str>
            <bool name="value">true</bool>
            <int name="count">2</int>
          </lst>
        </arr>
      </lst>
      <lst>
        <str name="field">cat</str>
        <str name="value">monitor</str>
        <int name="count">2</int>
        <arr name="pivot">
          <lst>
            <str name="field">inStock</str>
            <bool name="value">true</bool>
            <int name="count">2</int>
          </lst>
        </arr>
      </lst>
      <lst>
        <str name="field">cat</str>
        <str name="value">search</str>
        <int name="count">2</int>
        <arr name="pivot">
          <lst>
            <str name="field">inStock</str>
            <bool name="value">true</bool>
            <int name="count">2</int>
          </lst>
        </arr>
      </lst>
      <lst>
        <str name="field">cat</str>
        <str name="value">software</str>
        <int name="count">2</int>
        <arr name="pivot">
          <lst>
            <str name="field">inStock</str>
            <bool name="value">true</bool>
            <int name="count">2</int>
          </lst>
        </arr>
      </lst>
    </arr>
  </lst>
</lst>
</response>

The presentation of faceting results has changed in this case. For each of the main level we have the markers defining the field (the tag with the attribute name=”field”), value (the tag with the attribute name=”value”) and the number of documents (the tag with the attribute name=”count”). Next there is the the second level hierarchy (tag with the attribute name=”pivot”). The second level contains the same elements as the first level – name, value and the number of documents with a given value.

Let’s see how this mechanism can deal with more levels of depth. To check that I run the following query:

http://localhost:8983/solr/select/?q=*:*&facet=true&facet.pivot=cat,inStock,features

I omitted the response header with the results, leaving the faceting results only. In addition, due to the length of the faceting results I only show one level one level faceting:

<?xml version="1.0" encoding="UTF-8"?>
<response>
.
.
.
<result name="response" numFound="19" start="0"/>
<lst name="facet_counts">
  <lst name="facet_queries"/>
  <lst name="facet_fields"/>
  <lst name="facet_dates"/>
  <lst name="facet_ranges"/>
  <lst name="facet_pivot">
    <arr name="cat,inStock,features">
      <lst>
        <str name="field">cat</str>
        <str name="value">electronics</str>
        <int name="count">17</int>
        <arr name="pivot">
          <lst>
            <str name="field">inStock</str>
            <bool name="value">true</bool>
            <int name="count">13</int>
            <arr name="pivot">
              <lst>
                <str name="field">features</str>
                <str name="value">2</str>
                <int name="count">7</int>
              </lst>
              <lst>
                <str name="field">features</str>
                <str name="value">3</str>
                <int name="count">7</int>
              </lst>
              <lst>
                <str name="field">features</str>
                <str name="value">lcd</str>
                <int name="count">5</int>
              </lst>
              <lst>
                <str name="field">features</str>
                <str name="value">x</str>
                <int name="count">5</int>
              </lst>
              <lst>
                <str name="field">features</str>
                <str name="value">ca</str>
                <int name="count">4</int>
              </lst>
              <lst>
                <str name="field">features</str>
                <str name="value">latenc</str>
                <int name="count">4</int>
              </lst>
              <lst>
                <str name="field">features</str>
                <str name="value">tft</str>
                <int name="count">4</int>
              </lst>
              <lst>
                <str name="field">features</str>
                <str name="value">v</str>
                <int name="count">4</int>
              </lst>
              <lst>
                <str name="field">features</str>
                <str name="value">0</str>
                <int name="count">3</int>
              </lst>
              <lst>
                <str name="field">features</str>
                <str name="value">1</str>
                <int name="count">3</int>
              </lst>
              <lst>
                <str name="field">features</str>
                <str name="value">25</str>
                <int name="count">3</int>
              </lst>
              <lst>
                <str name="field">features</str>
                <str name="value">30</str>
                <int name="count">3</int>
              </lst>
              <lst>
                <str name="field">features</str>
                <str name="value">5</str>
                <int name="count">3</int>
              </lst>
              <lst>
                <str name="field">features</str>
                <str name="value">7</str>
                <int name="count">3</int>
              </lst>
              <lst>
                <str name="field">features</str>
                <str name="value">8</str>
                <int name="count">3</int>
              </lst>
              <lst>
                <str name="field">features</str>
                <str name="value">time</str>
                <int name="count">3</int>
              </lst>
              <lst>
                <str name="field">features</str>
                <str name="value">up</str>
                <int name="count">3</int>
              </lst>
              <lst>
                <str name="field">features</str>
                <str name="value">000</str>
                <int name="count">2</int>
              </lst>
              <lst>
                <str name="field">features</str>
                <str name="value">19</str>
                <int name="count">2</int>
              </lst>
              <lst>
                <str name="field">features</str>
                <str name="value">20</str>
                <int name="count">2</int>
              </lst>
              <lst>
                <str name="field">features</str>
                <str name="value">2336</str>
                <int name="count">2</int>
              </lst>
              <lst>
                <str name="field">features</str>
                <str name="value">27</str>
                <int name="count">2</int>
              </lst>
              <lst>
                <str name="field">features</str>
                <str name="value">275</str>
                <int name="count">2</int>
              </lst>
              <lst>
                <str name="field">features</str>
                <str name="value">6</str>
                <int name="count">2</int>
              </lst>
              <lst>
                <str name="field">features</str>
                <str name="value">75</str>
                <int name="count">2</int>
              </lst>
              <lst>
                <str name="field">features</str>
                <str name="value">activ</str>
                <int name="count">2</int>
              </lst>
              <lst>
                <str name="field">features</str>
                <str name="value">built</str>
                <int name="count">2</int>
              </lst>
              <lst>
                <str name="field">features</str>
                <str name="value">cach</str>
                <int name="count">2</int>
              </lst>
              <lst>
                <str name="field">features</str>
                <str name="value">color</str>
                <int name="count">2</int>
              </lst>
              <lst>
                <str name="field">features</str>
                <str name="value">flash</str>
                <int name="count">2</int>
              </lst>
              <lst>
                <str name="field">features</str>
                <str name="value">heat</str>
                <int name="count">2</int>
              </lst>
              <lst>
                <str name="field">features</str>
                <str name="value">heatspread</str>
                <int name="count">2</int>
              </lst>
              <lst>
                <str name="field">features</str>
                <str name="value">matrix</str>
                <int name="count">2</int>
              </lst>
              <lst>
                <str name="field">features</str>
                <str name="value">mb</str>
                <int name="count">2</int>
              </lst>
              <lst>
                <str name="field">features</str>
                <str name="value">ms</str>
                <int name="count">2</int>
              </lst>
              <lst>
                <str name="field">features</str>
                <str name="value">photo</str>
                <int name="count">2</int>
              </lst>
              <lst>
                <str name="field">features</str>
                <str name="value">resolut</str>
                <int name="count">2</int>
              </lst>
              <lst>
                <str name="field">features</str>
                <str name="value">seek</str>
                <int name="count">2</int>
              </lst>
              <lst>
                <str name="field">features</str>
                <str name="value">speed</str>
                <int name="count">2</int>
              </lst>
              <lst>
                <str name="field">features</str>
                <str name="value">spreader</str>
                <int name="count">2</int>
              </lst>
              <lst>
                <str name="field">features</str>
                <str name="value">unbuff</str>
                <int name="count">2</int>
              </lst>
              <lst>
                <str name="field">features</str>
                <str name="value">usb</str>
                <int name="count">2</int>
              </lst>
            </arr>
          </lst>
          <lst>
            <str name="field">inStock</str>
            <bool name="value">false</bool>
            <int name="count">4</int>
            <arr name="pivot">
              <lst>
                <str name="field">features</str>
                <str name="value">0</str>
                <int name="count">2</int>
              </lst>
              <lst>
                <str name="field">features</str>
                <str name="value">1</str>
                <int name="count">2</int>
              </lst>
              <lst>
                <str name="field">features</str>
                <str name="value">16</str>
                <int name="count">2</int>
              </lst>
              <lst>
                <str name="field">features</str>
                <str name="value">2</str>
                <int name="count">2</int>
              </lst>
              <lst>
                <str name="field">features</str>
                <str name="value">20</str>
                <int name="count">2</int>
              </lst>
              <lst>
                <str name="field">features</str>
                <str name="value">3</str>
                <int name="count">2</int>
              </lst>
              <lst>
                <str name="field">features</str>
                <str name="value">9</str>
                <int name="count">2</int>
              </lst>
              <lst>
                <str name="field">features</str>
                <str name="value">90</str>
                <int name="count">2</int>
              </lst>
              <lst>
                <str name="field">features</str>
                <str name="value">adapt</str>
                <int name="count">2</int>
              </lst>
              <lst>
                <str name="field">features</str>
                <str name="value">car</str>
                <int name="count">2</int>
              </lst>
              <lst>
                <str name="field">features</str>
                <str name="value">clock</str>
                <int name="count">2</int>
              </lst>
              <lst>
                <str name="field">features</str>
                <str name="value">direct</str>
                <int name="count">2</int>
              </lst>
              <lst>
                <str name="field">features</str>
                <str name="value">directx</str>
                <int name="count">2</int>
              </lst>
              <lst>
                <str name="field">features</str>
                <str name="value">dual</str>
                <int name="count">2</int>
              </lst>
              <lst>
                <str name="field">features</str>
                <str name="value">dvi</str>
                <int name="count">2</int>
              </lst>
              <lst>
                <str name="field">features</str>
                <str name="value">express</str>
                <int name="count">2</int>
              </lst>
              <lst>
                <str name="field">features</str>
                <str name="value">gddr</str>
                <int name="count">2</int>
              </lst>
              <lst>
                <str name="field">features</str>
                <str name="value">ghz</str>
                <int name="count">2</int>
              </lst>
              <lst>
                <str name="field">features</str>
                <str name="value">gl</str>
                <int name="count">2</int>
              </lst>
              <lst>
                <str name="field">features</str>
                <str name="value">gpu</str>
                <int name="count">2</int>
              </lst>
              <lst>
                <str name="field">features</str>
                <str name="value">gpuvpu</str>
                <int name="count">2</int>
              </lst>
              <lst>
                <str name="field">features</str>
                <str name="value">hdtv</str>
                <int name="count">2</int>
              </lst>
              <lst>
                <str name="field">features</str>
                <str name="value">mb</str>
                <int name="count">2</int>
              </lst>
              <lst>
                <str name="field">features</str>
                <str name="value">mhz</str>
                <int name="count">2</int>
              </lst>
              <lst>
                <str name="field">features</str>
                <str name="value">open</str>
                <int name="count">2</int>
              </lst>
              <lst>
                <str name="field">features</str>
                <str name="value">opengl</str>
                <int name="count">2</int>
              </lst>
              <lst>
                <str name="field">features</str>
                <str name="value">out</str>
                <int name="count">2</int>
              </lst>
              <lst>
                <str name="field">features</str>
                <str name="value">pci</str>
                <int name="count">2</int>
              </lst>
              <lst>
                <str name="field">features</str>
                <str name="value">power</str>
                <int name="count">2</int>
              </lst>
              <lst>
                <str name="field">features</str>
                <str name="value">vpu</str>
                <int name="count">2</int>
              </lst>
              <lst>
                <str name="field">features</str>
                <str name="value">white</str>
                <int name="count">2</int>
              </lst>
              <lst>
                <str name="field">features</str>
                <str name="value">x</str>
                <int name="count">2</int>
              </lst>
            </arr>
          </lst>
        </arr>
      </lst>
    </arr>
  </lst>
</lst>
</response>

As shown in the example, also in this case Solr had no problems with the correct calculation of the hierarchy. The above example is almost the same, in the context of data available, as the previous example, it only contains one more level of depth.

A few words at the end

In my opinion this is one of the more useful features for “ordinary” user. Unfortunately, so far only available in development version of Solr. I have not found any information about whether it is planned to transfer this functionality to version 1.5 of Solr, which is named branch_3x branch in SVN. However, it is important that this functionality was commited, and sooner or later Solr users will be able to use it.

Solr.pl