Imagine the following problem – we have an application that expects Solr to return the results sorted on the basis of some field. Those results will be than paged in the GUI. However, if the person using the GUI application immediately selects the tenth, twentieth, or fiftieth page of search results there is a problem – the wait time. Is there anything we can do about this? Yes, we can help Solr a bit.
A few numbers at the beginning
Let’s start with the query and statistics. Imagine that we have the following query, which is sent to Solr to get the five hundredth page of search results:
q=*:*&sort=price+asc&rows=100&start=50000
What Solr must do Solr to retrieve and render the results list ? Of course, read the documents from Lucene index. There is of course the question of how many documents to be read from the index? Is it 100? Unfortunately, no. Solr must collect 50.100 sorted documents from the Lucene index, due to the fact that we want 100 documents starting from 50.000-th. Kinda scary. Now let’s look at the comparison of how long it takes for Solr download the first page of search results (the query q=*:*&sort=price+asc&rows=100&start=0
) and how long it takes to render the last page of search results (ie the query q=*:*&sort = price+asc&rows=100&start=999900
). The test was performed on an index containing a million documents, consisting of four fields: id
(string
), name
(text
), description
(text
), price
(long
). Before every test iteration Solr was started the query was run and Solr was truned off. These steps were repeated 100 times for each of the queries, and times is seen in the table are the arithmetic mean of the query time execution. Here are the test results:
Typical solutions
Of course we can try to set the cache or the size of queryResultWindowSize, but there will be a problem of how to set the size, there may be a situation where it will be insufficient or not relevant entry in the memory of Solr and then waiting time for the n-th search page will be very long. We can also try adding warming queries, but we won’t be able to prepare all the combinations, but even if we could the the cache would have to be big. So we won’t be able to achieve the desired results with any of these solutions.
Filters, filters and filter again
This behavior Solr (and other applications based on Lucene too) is caused by the queue of documents, called. priority queue, which in order to display the last page of results must download all documents matching the query and return the ones we want located in the desired page. Of course, in our case, if we want the first page of search results queue will have 100 entries. However, if we want the last page will have Solr search in the queue to put one million documents. Of course what I told is in big simplification.
The idea is to limit the number of documents Lucene must put in the queue. How to do it ? We will use filters to help us, so in Solr we will use the fq parameter. Using a filter will limit the number of search results. The ideal size of the queue would be the one that is passed in the rows parameter of query. However, this situation is ideal and not very possible in most situations. An additional problem is that asking a query with a filter we can not determine the number of results, because we do not know how much documents will the filter return. The solution to this problem is the making two queries instead of just one – the first one to see how fimiting is our filter thus using rows=0
and start=0
, and the second is already adequately calculated (example below).
The maximum price of the product in the test data is 10,000, and the minimum is 0. So to the first query we will add the following bracket: <0; 1000>
, and to the second query, we add the following bracket: <9000; 10000>
.
Disadvantages of solution based on filters
There is one minus the filter-based solution and it is quite large. It may happen that the number of results to which the filter is limiting the query is too small. What then? We should increase the choosen bracket for the filter. Of course we can calculate the optimal brackets on the basis of our data, but it depends on the data and queries and why I won’t be talking about this at this point.
What is the performance after the change?
So let’s repeat the tests, but now let’s implement the filter based approach. So the first will just return the first page of results (the query q=*:*&sort=price+asc&rows=100&start=0&fq=price:[0+TO+1000]
). The second query (actually two queries) will will be used the check the number of results and then fetch those results (those two queries: q=*:*&sort=price+asc&rows=0&start=0&fq=price:[9000+TO+10,000]
and q=*:*&sort=price+asc&rows=100&start=100037&fq=price:[9000+TO+10000]
). It is worth noting about the changed start parameter in the query, due to the fact that we get fewer search results (this is caused by the fq parameter). This test was was carried out in similar way to the previous one – start Solr, run a query (or queries), and shutdown Solr. The number seen in the table are the arithmetic mean of query time execution.
As you can see, the query performance changed. We can therefore conclude that we succeeded. Of course, we could be tempted by further optimizations, but for now let’s say that we are satisfied with the results. I suspect however that you can ask the following question:
How is this handled by other search engines ?
Good question, but the answer is trivial in total – one of the methods is to prevent a very deep paging. This method shall include Google. Try to type the word “ask” for the search and go further than 91 page of search results. Google didn’t let me 😉
Conclusions
As you can see deep paging performance after our changes has increased significantly. We can now allow the user to search results without worrying that it will kill our Solr and the machine on which he works. Of course, this method is not without flaws, due to the fact that we need to know certain parameters (eg in our case, the maximum price for descending sort), but this is a solution that allows you to provide search results in a relatively low latency time when compared to the pre-optimization method.