Solr 6.5 and large stored fields – quick look

As we know Solr has a few caches, for example – filterCache for filters, queryResultCache for query results caching and of course the documentCache for caching documents for fast retrieval. Today we will focus on the last of the mentioned caches and what can be done to better utilize the cache if you use it.

The problem

When documentCache is present in solrconfig.xml after the first time a field is retrieved from Lucene Solr will cache its value along with the document and store it in the documentCache. This can be very expensive, especially for large stored fields – image a situation when you have the documents OCRed from a book and you show the content of the pages. If you don’t reuse such data, so basically a lot hit ration in the documentCache, will result in more garbage produced by Solr itself and thus JVM garbage collector having harder time to clean that up. That can lead to higher CPU usage and worse performance of Solr in general. Let’s look at what we can do with such large, stored fields.

Marking the field as large

Starting with Solr 6.5 we got the ability to add additional property to the field definition, one called large which takes a value of true or false by default being false. Field that we want to mark as large should be set as stored=”true” and multiValued=”false”. In such cases, setting the large=”true” property on the field definition will make the field value not cached inside the documentCache.

Noticing the difference

Because this is a quick look type of post, I don’t want to get into too much specifics, but I would like to compare two collections with the same data. Each collection have the same set of fields:

  • id – identifier of the document,
  • name – name of the document,
  • body – text of the document, which can be very, very large.

One collection will have the large=”true” for the body field and the other won’t have that property set. We will also index a few large documents and see how documentCache behaves.

So here are the commands to setup those two collections using Solr.pl Github account (https://github.com/solrpl/). First setup one collection and gather statistics and then remove all the files, restart Solr, create the second collection and gather statistics. The commands are as follows:

$ mkdir /tmp/solr
$ mkdir /tmp/solr/collection_with_large
$ mkdir /tmp/solr/collection_without_large
$ wget https://github.com/solrpl/blog/tree/master/posts/large_field/data.xml /tmp/solr/data.xml
$ wget https://github.com/solrpl/blog/tree/master/posts/large_field/collection_with_large/managed-schema /tmp/solr/collection_with_large/managed-schema
$ wget https://github.com/solrpl/blog/tree/master/posts/large_field/collection_with_large/solrconfig.xml /tmp/solr/collection_with_large/solrconfig.xml
$ wget https://github.com/solrpl/blog/tree/master/posts/large_field/collection_without_large/managed-schema /tmp/solr/collection_without_large/managed-schema
$ wget https://github.com/solrpl/blog/tree/master/posts/large_field/collection_without_large/solrconfig.xml /tmp/solr/collection_without_large/solrconfig.xml
$ bin/solr zk upconfig -z localhost:9983 -n config_with_large -d /tmp/collection_with_large
$ bin/solr create_collection -c collection_with_large -n config_with_large -shards 1 -replicationFactor 1
$ curl -XPOST 'localhost:8983/solr/collection_with_large/update?commit=true' -H 'Content-Type:application/xml' --data-binary @/tmp/solr/data.xml
$ curl 'localhost:8983/solr/collection_with_large/select?q=*:*'

And now let’s create the second collection using the downloaded data:

$ bin/solr zk upconfig -z localhost:9983 -n config_without_large -d /tmp/collection_without_large
$ bin/solr create_collection -c collection_without_large -n config_without_large -shards 1 -replicationFactor 1
$ curl -XPOST 'localhost:8983/solr/collection_without_large/update?commit=true' -H 'Content-Type:application/xml' --data-binary @/tmp/solr/data.xml
$ curl 'localhost:8983/solr/collection_without_large/select?q=*:*'

And now, let’s check the usage of the documentCache that we’ve gathered. So we have this for the collection with the body field marked as large=”true”:

And we have this for the collection with the body field without the large=”true” property:

As you can see, the field marked with large=”true” was not put into the documentCache directly, but only as a lazy loaded large field, which is what we were aiming for. This means, that we can still use the documentCache and not worry about Solr putting the large, stored fields there, which was the case in the second example.

Leave a Reply

Your email address will not be published. Required fields are marked *