Solr 6.0 and graph traversal support

One of the new features that are present in the recently released Solr 6.0 is the graph traversal query that allows us to work with graphs. Having a root set and relations between documents (like parent identifier of the document) we can use a single query to get multiple levels of joins in the same request. Let’s look at this new feature working both in old fashioned Solr master – slave as well as in SolrCloud.

For the purpose of this blog post we will use a very simple data set that we can index using a single command and a configuration that can be downloaded from our github account available at: https://github.com/solrpl/blog.

Creating the collection and indexing the data

What we need first is creating the collection and indexing the data itself. We will start Solr with the following using the following command:

bin/solr start -c

This will launch our Solr instance in the cloud mode. Now we need to send our configuration files to Zookeeper which we will do by running the following command:

bin/solr zk -upconfig -n graph_test_config -z localhost:9983 -d graph_test/conf

Next we will create the our graph collection by running the following command:

curl -XGET 'http://localhost:8983/solr/admin/collections?action=CREATE&name=graph&numShards=2&replicationFactor=1&maxShardsPerNode=2&collection.configName=graph_test_config'

Now, after collection creation we can finally index the data by running the following command:

curl -XPOST -H 'Content-Type: application/json' 'http://localhost:8983/solr/graph/update' --data-binary '{
 "add" : { "doc" : { "id" : "1", "name" : "Root document one" } },
 "add" : { "doc" : { "id" : "2", "name" : "Root document two" } },
 "add" : { "doc" : { "id" : "3", "name" : "Root document three" } },
 "add" : { "doc" : { "id" : "11", "parent_id" : "1", "name" : "First level document 1, child one" } },
 "add" : { "doc" : { "id" : "12", "parent_id" : "1", "name" : "First level document 1, child two" } },
 "add" : { "doc" : { "id" : "13", "parent_id" : "1", "name" : "First level document 1, child three" } },
 "add" : { "doc" : { "id" : "21", "parent_id" : "2", "name" : "First level document 2, child one" } },
 "add" : { "doc" : { "id" : "22", "parent_id" : "2", "name" : "First level document 2, child two" } },
 "add" : { "doc" : { "id" : "121", "parent_id" : "12", "name" : "Second level document 12, child one" } },
 "add" : { "doc" : { "id" : "122", "parent_id" : "12", "name" : "Second level document 12, child two" } },
 "add" : { "doc" : { "id" : "131", "parent_id" : "13", "name" : "Second level document 13, child three" } },
 "commit" : {}
}'

So our data have the following relations:

Let’s try searching in the structure.

Basic graph traversal query usage

In the basic form of graph traversal query we need to provide a root set of documents, specify which is the identifier field, which is the parent identifier field and run the query. For example, if we would like to find all the documents in relation to both the root documents and the ones with next levels of relations we could run the following query:

http://localhost:8983/solr/graph/select?q=*:*&fq={!graph from=parent_id to=id}name:"root document"

The documents returned for such query would look as follows:

<?xml version="1.0" encoding="UTF-8"?>
<response>
 <lst name="responseHeader">
  <bool name="zkConnected">true</bool>
  <int name="status">0</int>
  <int name="QTime">8</int>
  <lst name="params">
   <str name="q">*:*</str>
   <str name="fq">{!graph from=parent_id to=id}name:"root document"</str>
  </lst>
 </lst>
 <result name="response" numFound="8" start="0" maxScore="1.0">
 <doc>
  <str name="id">1</str>
  <str name="name">Root document one</str>
  <long name="_version_">1531331026113003520</long></doc>
 <doc>
  <str name="id">11</str>
  <str name="parent_id">1</str>
  <str name="name">First level document 1, child one</str>
  <long name="_version_">1531331026114052096</long></doc>
 <doc>
  <str name="id">12</str>
  <str name="parent_id">1</str>
  <str name="name">First level document 1, child two</str>
  <long name="_version_">1531331026115100672</long></doc>
 <doc>
  <str name="id">13</str>
  <str name="parent_id">1</str>
  <str name="name">First level document 1, child three</str>
  <long name="_version_">1531331026115100673</long></doc>
 <doc>
  <str name="id">122</str>
  <str name="parent_id">12</str>
  <str name="name">Second level document 12, child two</str>
  <long name="_version_">1531331026120343552</long></doc>
 <doc>
  <str name="id">2</str>
  <str name="name">Root document two</str>
  <long name="_version_">1531331026109857792</long></doc>
 <doc>
  <str name="id">3</str>
  <str name="name">Root document three</str>
  <long name="_version_">1531331026110906368</long></doc>
 <doc>
  <str name="id">21</str>
  <str name="parent_id">2</str>
  <str name="name">First level document 2, child one</str>
  <long name="_version_">1531331026111954944</long></doc>
 </result>
</response>

As you can see, we’ve got our root documents returned (identifiers 1, 2 and 3) and we’ve got all the leafs for the whole data set, which is very nice.

Filtering

The results can be filtered by using the traversalFilter property. The defined filter will be applied to each join iteration. For example, if we would like to filter the resulting documents to only those that have term one in the name field we could run the following query:

http://localhost:8983/solr/graph/select?q=*:*&fq={!graph from=parent_id to=id traversalFilter=name:one}name:"root document"

The results would be as follows:

<pre class="brush:xml">
<?xml version="1.0" encoding="UTF-8"?>
<response>
 <lst name="responseHeader">
  <bool name="zkConnected">true</bool>
  <int name="status">0</int>
  <int name="QTime">7</int>
  <lst name="params">
   <str name="q">*:*</str>
   <str name="fq">{!graph from=parent_id to=id traversalFilter=name:one}name:"root document"</str>
  </lst>
 </lst>
 <result name="response" numFound="5" start="0" maxScore="1.0">
 <doc>
  <str name="id">1</str>
  <str name="name">Root document one</str>
  <long name="_version_">1531331026113003520</long></doc>
 <doc>
  <str name="id">11</str>
  <str name="parent_id">1</str>
  <str name="name">First level document 1, child one</str>
  <long name="_version_">1531331026114052096</long></doc>
 <doc>
  <str name="id">2</str>
  <str name="name">Root document two</str>
  <long name="_version_">1531331026109857792</long></doc>
 <doc>
  <str name="id">3</str>
  <str name="name">Root document three</str>
  <long name="_version_">1531331026110906368</long></doc>
 <doc>
  <str name="id">21</str>
  <str name="parent_id">2</str>
  <str name="name">First level document 2, child one</str>
  <long name="_version_">1531331026111954944</long></doc>
 </result>
</response>

As you can see, only the filtered documents were returned for each join and of course the root documents set. Seems that the filter is working 🙂

Returning the root or leafs

Apart from filtering we can also tell Solr to only return leafs and to omit root set documents. For example, to omit root set documents we would add the returnRoot property equal to false (defaults to true) in our query:

http://localhost:8983/solr/graph/select?q=*:*&fq={!graph from=parent_id to=id returnRoot=false}name:"root document"

The results are as follows:

<?xml version="1.0" encoding="UTF-8"?>
<response>
 <lst name="responseHeader">
  <bool name="zkConnected">true</bool>
  <int name="status">0</int>
  <int name="QTime">10</int>
  <lst name="params">
   <str name="q">*:*</str>
   <str name="fq">{!graph from=parent_id to=id returnRoot=false}name:"root document"</str>
  </lst>
 </lst>
 <result name="response" numFound="5" start="0" maxScore="1.0">
 <doc>
  <str name="id">11</str>
  <str name="parent_id">1</str>
  <str name="name">First level document 1, child one</str>
  <long name="_version_">1531331026114052096</long></doc>
 <doc>
  <str name="id">12</str>
  <str name="parent_id">1</str>
  <str name="name">First level document 1, child two</str>
  <long name="_version_">1531331026115100672</long></doc>
 <doc>
  <str name="id">13</str>
  <str name="parent_id">1</str>
  <str name="name">First level document 1, child three</str>
  <long name="_version_">1531331026115100673</long></doc>
 <doc>
  <str name="id">122</str>
  <str name="parent_id">12</str>
  <str name="name">Second level document 12, child two</str>
  <long name="_version_">1531331026120343552</long></doc>
 <doc>
  <str name="id">21</str>
  <str name="parent_id">2</str>
  <str name="name">First level document 2, child one</str>
  <long name="_version_">1531331026111954944</long></doc>
 </result>
</response>

As we can see the results are without the root documents.

If we are interested in leafs only, we should add the returnOnlyLeaf parameter and set it to true (defaults to false).

Controlling maximum depth

Finally, using the maxDepth property we can control the maximum depth of the traversal. By default, it is set to -1 which stands for unlimited. For example, if we are only interested in the first level of graph, we could run the following query:

http://localhost:8983/solr/graph/select?q=*:*&fq={!graph from=parent_id to=id maxDepth=1}name:"root document"

The result includes only documents that are of one join from the documents in the root set:

<?xml version="1.0" encoding="UTF-8"?>
<response>
 <lst name="responseHeader">
  <bool name="zkConnected">true</bool>
  <int name="status">0</int>
  <int name="QTime">10</int>
  <lst name="params">
   <str name="q">*:*</str>
   <str name="fq">{!graph from=parent_id to=id maxDepth=1}name:"root document"</str>
  </lst>
 </lst>
 <result name="response" numFound="7" start="0" maxScore="1.0">
 <doc>
  <str name="id">1</str>
  <str name="name">Root document one</str>
  <long name="_version_">1531331026113003520</long></doc>
 <doc>
  <str name="id">11</str>
  <str name="parent_id">1</str>
  <str name="name">First level document 1, child one</str>
  <long name="_version_">1531331026114052096</long></doc>
 <doc>
  <str name="id">12</str>
  <str name="parent_id">1</str>
  <str name="name">First level document 1, child two</str>
  <long name="_version_">1531331026115100672</long></doc>
 <doc>
  <str name="id">13</str>
  <str name="parent_id">1</str>
  <str name="name">First level document 1, child three</str>
  <long name="_version_">1531331026115100673</long></doc>
 <doc>
  <str name="id">2</str>
  <str name="name">Root document two</str>
  <long name="_version_">1531331026109857792</long></doc>
 <doc>
  <str name="id">3</str>
  <str name="name">Root document three</str>
  <long name="_version_">1531331026110906368</long></doc>
 <doc>
  <str name="id">21</str>
  <str name="parent_id">2</str>
  <str name="name">First level document 2, child one</str>
  <long name="_version_">1531331026111954944</long></doc>
 </result>
</response>

Summary

As you can see, in addition to the Parallel SQL over Map Reduce functionality and the cross data center replication we’ve got a pretty neat graph traversal support in Solr 6.0. We haven’t had a chance to test performance of the query on the larger data set, but we will try to come up with the larger data set and some sample queries to see what we can expect from Solr when it comes to graph traversal query performance.

Update

We didn’t mention it, but you can see that not all documents from our sample data set were included in the results. This is because our collection was created with two shards and we run distributed query. To avoid that we could just create collection with a single shard and live with that until graph query supports more 🙂

Solr.pl