Solr 6.0 and graph traversal support

One of the new features that are present in the recently released Solr 6.0 is the graph traversal query that allows us to work with graphs. Having a root set and relations between documents (like parent identifier of the document) we can use a single query to get multiple levels of joins in the same request. Let’s look at this new feature working both in old fashioned Solr master – slave as well as in SolrCloud.

For the purpose of this blog post we will use a very simple data set that we can index using a single command and a configuration that can be downloaded from our github account available at: https://github.com/solrpl/blog.

Creating the collection and indexing the data

What we need first is creating the collection and indexing the data itself. We will start Solr with the following using the following command:

This will launch our Solr instance in the cloud mode. Now we need to send our configuration files to Zookeeper which we will do by running the following command:

Next we will create the our graph collection by running the following command:

Now, after collection creation we can finally index the data by running the following command:

So our data have the following relations:

Graph Documents Layout

Let’s try searching in the structure.

Basic graph traversal query usage

In the basic form of graph traversal query we need to provide a root set of documents, specify which is the identifier field, which is the parent identifier field and run the query. For example, if we would like to find all the documents in relation to both the root documents and the ones with next levels of relations we could run the following query:

The documents returned for such query would look as follows:

As you can see, we’ve got our root documents returned (identifiers 1, 2 and 3) and we’ve got all the leafs for the whole data set, which is very nice.

Filtering

The results can be filtered by using the traversalFilter property. The defined filter will be applied to each join iteration. For example, if we would like to filter the resulting documents to only those that have term one in the name field we could run the following query:

The results would be as follows:

As you can see, only the filtered documents were returned for each join and of course the root documents set. Seems that the filter is working 🙂

Returning the root or leafs

Apart from filtering we can also tell Solr to only return leafs and to omit root set documents. For example, to omit root set documents we would add the returnRoot property equal to false (defaults to true) in our query:

The results are as follows:

As we can see the results are without the root documents.

If we are interested in leafs only, we should add the returnOnlyLeaf parameter and set it to true (defaults to false).

Controlling maximum depth

Finally, using the maxDepth property we can control the maximum depth of the traversal. By default, it is set to -1 which stands for unlimited. For example, if we are only interested in the first level of graph, we could run the following query:

The result includes only documents that are of one join from the documents in the root set:

Summary

As you can see, in addition to the Parallel SQL over Map Reduce functionality and the cross data center replication we’ve got a pretty neat graph traversal support in Solr 6.0. We haven’t had a chance to test performance of the query on the larger data set, but we will try to come up with the larger data set and some sample queries to see what we can expect from Solr when it comes to graph traversal query performance.

Update

We didn’t mention it, but you can see that not all documents from our sample data set were included in the results. This is because our collection was created with two shards and we run distributed query. To avoid that we could just create collection with a single shard and live with that until graph query supports more 🙂

9 thoughts on “Solr 6.0 and graph traversal support

  • 20 April 2016 at 05:32
    Permalink

    When we execute request “http://localhost:8983/solr/graph/select?q=*:*&fq={!graph from=parent_id to=id}name:”root document”, why documents with id=22, id=121, id=131 do not in the result set?

    Reply
  • 20 April 2016 at 05:38
    Permalink

    I guess that SolrCloud does not support graph traversal. May I right?

    Reply
    • 24 April 2016 at 09:10
      Permalink

      I haven’t dig into the code much yet, but it seems that this is the problem.

      Reply
  • 9 August 2016 at 15:16
    Permalink

    I installed Solr-6.1.0 on my local VM. What are the commands to run this example?

    Thanks,
    Tony

    Reply
    • 13 August 2016 at 21:42
      Permalink

      By default Solr should be available at the localhost of the VM on the 8983 port, so the same as in the blog post, of course if you run the commands from the VM. If you run them from your PC you just need to put the VM IP address instead of localhost.

      Reply
  • 22 August 2016 at 18:27
    Permalink

    The current Graph Query parser only supports single shard Solr indexed. I do have a version that I’ve been working on that uses Kafka as a broker to do a multi-node / distributed graph traversal.

    I posted a patch of the basic approach to a ticket. With some work, I suspect we could remove the Kafka dependency and maybe get it committed back. In the mean time, There is the work that Joel has done with Streaming Aggregations. The gatherNodes function handles the distributed case.

    Reply
  • 20 October 2016 at 18:13
    Permalink

    Could you use this to manage security of version controlled documents? (Different users and groups have access to different versions of the same document)

    Reply
    • 20 January 2017 at 14:54
      Permalink

      This is possible, but I think join would be better for that.

      Reply

Leave a Reply to Tony Cancel reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.