Distributed IDF

When Lucene and Solr searches through the data, each document is assigned a score that is calculated on the basis of query terms statistics. When using SolrCloud and our data inside the collection is distributed among multiple shards we are hit by a problem of not exact inverse document frequency calculation. The problem can be defined in the following way – each shard stores the term statistics locally and doesn’t share that with other shards during query execution. Can we do something about it to have more precise IDF calculation? Let’s see what we can do about it.

IDF and Solr

With the release of Lucene and Solr 6.0 the default scoring algorithm was changed from the TF/IDF one to the BM25 one. The new algorithm is said to be better, because of the limitations in how the more frequent terms are affecting the final score of the document.

In BM25 the IDF part of the calculation is calculated in the following way:

log(1 + (docCount - docFreq + 0.5)/(docFreq + 0.5))

It means that the IDF is faster and more cut down for the high frequency terms comparing to the TF/IDF algorithm that was the default algorithm in Lucene and Solr before 6.0.

The problem is that this IDF calculation and the terms statistics is not by default distributed. Each shard keeps the local term statistics which lowers down the precision of the score calculation. In most cases this is not problematic, however there are use cases where this makes a lot of difference. Without distributed IDF even collection architecture changes, like having more or less shards will affect the score of the documents – which can provide additional obstacles when writing automatic tests that compare data between different environments.

Distributed IDF

With Solr 5.0 we got the ability to configure distributed IDF calculation. It is enough to add appropriate statsCache definition to the solrconfig.xml file and that’s all. For example:

<statsCache class="org.apache.solr.search.stats.ExactStatsCache" />

As you can see it is really simple. And there are multiple implementations available:

LocalStatsCache – the default implementation when the statsCache element is not defined in the solrconfig.xml file. In this case the logic is very simple – no distributed IDF calculation is done.

ExactStatsCache – implementation that caches the term statistics requiring additional query step during query execution. Please be aware that because of additional round step through the shards is needed it may lower down the performance of the queries.

ExactSharedStatsCache – implementation similar to the ExactStatsCache with the difference that the term statistics are shared between the requests. It can result in higher performance compared to the ExactStatsCache, but will result in higher memory consumption.

LRUStatsCache – comparing to ExactStatsCache this implementation is using an LRU cache to store the term statistics. Based on this cache Solr is able to determine is if an additional step in the query processing is needed thus lowering the need of running this additional step effectively increasing the query performance when doing distributed IDF calculation. Terms are stored in maps – there is a one map per shard, so the more shards you have the more maps will be created.

Does it Work?

The answer to such question can be only one – let’s check that. To do that we will create two collections – distrib_idf and non_distrib_idf and we will index a very small set of example data (the configurations and data can be found on our Github account – https://github.com/solrpl/blog). Each collection will have two shards, one replica and the SolrCloud cluster will be built of two Solr instances.

After indexing the example documents we can see the difference in score calculation by running the following query to both of them:

.../select?q=title:solr&fl=*,score

When it comes to the collection that doesn’t have the statsCache configured the results look as follows:

{
   "responseHeader":{
     "zkConnected":true,
     "status":0,
     "QTime":7,
     "params":{
       "q":"title:solr",
       "fl":"*,score"}},
   "response":{"numFound":4,"start":0,"maxScore":0.082873434,"docs":[
       {
         "id":"2",
         "title":["Solr document two"],
         "version":1633875263859195904,
         "score":0.082873434},
       {
         "id":"3",
         "title":["Solr document three"],
         "version":1633875263924207616,
         "score":0.082873434},
       {
         "id":"1",
         "title":["Solr document one"],
         "version":1633875263562448896,
         "score":0.082873434},
       {
         "id":"4",
         "title":["Solr document four"],
         "version":1633875263679889408,
         "score":0.082873434}]
   }}

The result of query execution on the collection that does have the statsCache turned on looks as follows:

{
   "responseHeader":{
     "zkConnected":true,
     "status":0,
     "QTime":12,
     "params":{
       "q":"title:solr",
       "fl":"*,score"}},
   "response":{"numFound":4,"start":0,"maxScore":0.04789114,"docs":[
       {
         "id":"2",
         "title":["Solr document two"],
         "version":1633875264251363328,
         "score":0.04789114},
       {
         "id":"3",
         "title":["Solr document three"],
         "version":1633875264254509056,
         "score":0.04789114},
       {
         "id":"1",
         "title":["Solr document one"],
         "version":1633875264186351616,
         "score":0.04789114},
       {
         "id":"4",
         "title":["Solr document four"],
         "version":1633875264189497344,
         "score":0.04789114}]
   }}

It is of course a very simple example, but we already see the difference between those collections. The score calculated when using the collection without statsCache is more or less two times as high as the score in the second collection. It makes sense taking into consideration that solr term in the title field appears four times – two times in each shard. It just means that each of the shard is aware about its own local statistics when not using distributed IDF.

Summary

The lack of distributed IDF calculation can cause problems, especially in the cases where the quality of the results is very important. Luckily starting with Solr 5 we have a very simple way of turning on the distributed IDF calculation allowing us to completely overcome the problem. Of course you need to remember that turning on the statsCache can result and probably will result in lowering the performance of your queries because of the need of additional operations.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.