At the Apache Lucene Eurocon 2010 conference, which took place in May this year, Andrew Białecki in his presentation talked about how to obtain satisfactory search results when using early termination search techniques. Unfortunately the tool he mentioned, was not available in Solr – but it changed.
At the time of writing, the described tools are available only in branch named branch_3x in SVN, but it is planned to migrate this functionality to version 4.x.
But what is it?
Using the techniques of terminating the search after a predetermined time, without looking at the number of search results, at some point we come across the problem of the quality of search results. Instead of receiving the best results, in the context of the search query, we get them in a random fashion (or at least they may look random). This means that we are not able to ensure that the user that uses the system gets the best matching results. Of course, we talk about the situation, when you terminate the search after a predetermined period of time and we that is why Solr can’t gather all the documents that match your query.
Is it useful for me ?
When ending a search after a predetermined time may be useful? There are many uses cases of such a search. Imagine that our implementation is composed of many separate shards, which operate on large amounts of data each. When making a distributed query, each of the shards, present in the search system, must be queried for relevant documents, then all results must be gathered and displayed to the end user (of course, this not need to be a man, this may be an application). But what if each of the shards needs a very long time to process all search results, and we are, for example, only interested in those added in recent times (eg last week). This is where we have the possibility of early termination of search query – assuming that we are more interested in documents added the day before rather than two weeks ago.
How to achieve it ?
Example above illustrates the case when we can use the search that is terminated after a specified time. However, when looking further into search results we come to a problem – to sort search results Solr must collect them all. So when making query with a sort parameter like sort=added+desc
to get the documents sorted correctly, each of the shards would have to return all search results – this mean that we can’t use early termination of search ? Not really. To help us, Solr provides a tool – IndexSorter, which until now was available only in the Apache Nutch project, but recently was commited to Lucene and Solr. With this tool, we can pre-sort the index by the parameter that we need. Thus, an index sorted descending by date of a document adding, Solr would first get the documents that have been added lately, and thus we would be able to use early termination.
Using IndexSorter
What to do to use the IndexSorter tool ? Can I tell You the truth ? – It’s not that complicated. Note, however, that at the time of publication of this entry the mentioned tool is only available in branch_3x of Lucene/Solr project. To sort an index on the basis of a field, run the following command from the command line (of course keeping in mind the appropriate location of the library lucene-misc-3.1.jar – after building the project we find it in directory lucene/build/contrib/misc):
java IndexSorter SOURCE_DIRECTORY TARGET_DIRECTORY FIELD_NAME
The parameters mean:
- SOURCE_DIRECTORY – a catalog with an index that you want to sort,
- TARGET_DIRECTORY – the directory where sorted index will be saved,
- FIELD_NAME – the field on which basis the index will be sorted.
If everything goes correctly, You should see something like this:
IndexSorter: done, 896 total milliseconds
The end
In my opinion, Lucene and Solr just got a very interesting feature, which can be used for example wherever the amount of data is very large, when response time can not exceed a certain time limit, or when the results beyond the first (the first 100 or 1000) are not significant. All who are interested in the subject or index sorting and early termination techniques should watch a slide presentation titled “Munching and Crunching: Lucene Index Post-Processing” (slides) led by Andrzej Bialecki during Lucene Eurocon Conference 2010, who discussed these topics.