Solr 4.7 – efficient deep paging

Long, long time ago, we described a problem called deep paging. To keep things short – the deeper you want to go in the results, the slower the query will be. This is because Solr needs to prepare the data from the beginning for each query. Until Solr 4.7 there wasn’t a good solution for that problem. With the recently released Solr version, we got a possibility of using so called cursor to drastically improve performance of deep paging.

The problem

The deep paging problem is quite easy to define. To return search results Solr must prepare an in-memory structure and return part of it. Returning the part of the structure is simple, if that part comes from the beginning of the structure. However, if we want to return page number 10.000 (where we return 20 results per page) Solr needs to prepare a structure containing minimum of 200.000 elements (10.000 * 20). You see that it not only takes time, but also memory.

The good thing is, that with the release of Solr 4.7 the situation had changed – the cursor has been introduced. Cursor is a logic structure, that doesn’t require its state to be stored on the server side. Cursor contains information about storing and lest document returned in the results. Because of that, Solr doesn’t need to start search from beginning each time we want to get next page of results. It results in drastic performance improvement when using cursor and going deep into results.

Usage

Cursor usage is very simple. To tell Solr to return cursor, in the first query we need to pass an additional parameter – cursorMark=*. In result, apart from documents, we will get a cursor identifier returned in the nextCursorMark parameter. Let’s look at the example.

The query

Let’s start with a very simple query:

There are four things here that we are interested in. First of off, we either omit the start parameter or we set it to 0. The rows parameter can take values we need, there is no limitation on it. Of course, we passed the cursorMark=* parameter, to tell Solr that we want the cursor to be used. The final thing we did is sorting definition. We need to define sorting for cursor to be working, one that will tell cursor how to behave. That’s why we needed to overwrite default sorting and include sorting not only by score, by also by document identifier.

Search results

Our query returns the following search results:

As we can see, in addition to standard search results, we got the cursor identifier in the nextCursorMark section. Now, to get the next results bound to that cursor, we need to pass that identifier using the cursorMark parameter.

Next query

Our next query looks as follows (note the cursorMark parameter value):

The results were as follows:

As we can see, the returned nextCursorMark was different again.

Further queries

Logic for further queries is simple – we use the cursorMark parameter with the value returned with the previous search results. So again, our next query would look as follows:

Summary

Simple API and massive gain on performance in case of deep paging. That’s how I think the cursor introduced in Solr 4.7 could be summarized. I decided not do re-do performance tests, there are ones already done by Chris Hostetter in his entry about this functionality. If you are interested please look at: http://searchhub.org/2013/12/12/coming-soon-to-solr-efficient-cursor-based-iteration-of-large-result-sets/.

2 thoughts on “Solr 4.7 – efficient deep paging

  • 23 December 2014 at 10:44
    Permalink

    In this case, aren’t we making 10000/20 server requests instead of 1. And doesn’t it take 5000x more time than using start=9980?

    Reply
  • 23 January 2015 at 12:16
    Permalink

    using cursorMark we can over come deep paging so far so good I have a question

    if that is the case then how mush time does this mark is live on serve if there are different unique requests cursorMark=*

    then it might take all the server ram right. Is there any way we can destroy the mark if the use is completed in case it it holding the results

    Reply

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.