Sorting by function value in Solr (SOLR-1297)

Rafał Kuć — Mon, 28 Feb 2011 08:17:46 +0000

In Solr 3.1 and later we have a very interesting functionality, which enables us to sort by function value. What that gives us ? Actually a few interesting possibilities.

Let’s start

The first example that comes to mind, perhaps because of the project on which I worked some time ago, it’s sorting on the basis of distance between two geographical points. So far, to implement such functionality was needed changes in Solr (for example, LocalSolr or LocalLucene). Using Solr 3.1 and later, you can sort your search results using the value returned by the defined functions. For example, is Solr, we have the dist function calculating the distance between two points. One variation of the function is a function accepting five parameters: algorithm and two pairs of points. If, using this feature, we would like to sort your search results in ascending order from the point of latitude and longitude 0.0, we should add the following sort parameter to the Solr query:

...sort=dist(2, geo_x, geo_y, 0, 0) asc

I suspect that the most commonly used values of the first parameter will be:

1 – calculation based on the Manhattan metrics
2 – calculation of Euclidean distance

A few words about performance

Everything is fine till now, but how it looks like in terms of performance ? I’ve made a two simple tests.

During the first test, I indexed 200 000 documents, every one of them consisted of four fields: identifier (numeric field), description (a text field) and location (two numeric fields). In order not to obscure the test results for sorting, I used one of the simplest functions currently available in the Solr – the sum function which sums two given arguments. I compared the query time of the default sorting (by score) with the ones which used the value of the function. The following table shows the results of the test:

[table “13” not found /]

Another test was based on a comparison of sorting by a string field to sort using function. The test was almost identical to the first test. I’ve indexed 200,000 documents indexed (with additional field: name_sort – type string) and used the sum function. The following table shows the results of the test:

[table “15” not found /]

Above test shows that sorting using the sort function is much slower than the default sort order (which you’d expect). Sorting on the basis of function value is also slower than sorting with the use of string based field, but the difference is not as significant as in the previous case.

A few words at the end

Of course, the above test just glides through the topic of sorting efficiency using Solr functions, however, shows a direct relationship. Given that, in most cases, this will not be the default sort method and giving us a really powerful tool it seems to me that this is a feature worth remembering. It will definitely be worth using when the requirements says that we have to sort on the value that depends on the query and index values – as in the case of sorting by distance from the point specified by the user.

Quick look – IndexSorter

Rafał Kuć — Mon, 04 Oct 2010 12:14:20 +0000

At the Apache Lucene Eurocon 2010 conference, which took place in May this year, Andrew Białecki in his presentation talked about how to obtain satisfactory search results when using early termination search techniques. Unfortunately the tool he mentioned, was not available in Solr – but it changed.

At the time of writing, the described tools are available only in branch named branch_3x in SVN, but it is planned to migrate this functionality to version 4.x.

But what is it?

Using the techniques of terminating the search after a predetermined time, without looking at the number of search results, at some point we come across the problem of the quality of search results. Instead of receiving the best results, in the context of the search query, we get them in a random fashion (or at least they may look random). This means that we are not able to ensure that the user that uses the system gets the best matching results. Of course, we talk about the situation, when you terminate the search after a predetermined period of time and we that is why Solr can’t gather all the documents that match your query.

Is it useful for me ?

When ending a search after a predetermined time may be useful? There are many uses cases of such a search. Imagine that our implementation is composed of many separate shards, which operate on large amounts of data each. When making a distributed query, each of the shards, present in the search system, must be queried for relevant documents, then all results must be gathered and displayed to the end user (of course, this not need to be a man, this may be an application). But what if each of the shards needs a very long time to process all search results, and we are, for example, only interested in those added in recent times (eg last week). This is where we have the possibility of early termination of search query – assuming that we are more interested in documents added the day before rather than two weeks ago.

How to achieve it ?

Example above illustrates the case when we can use the search that is terminated after a specified time. However, when looking further into search results we come to a problem – to sort search results Solr must collect them all. So when making query with a sort parameter like sort=added+desc to get the documents sorted correctly, each of the shards would have to return all search results – this mean that we can’t use early termination of search ? Not really. To help us, Solr provides a tool – IndexSorter, which until now was available only in the Apache Nutch project, but recently was commited to Lucene and Solr. With this tool, we can pre-sort the index by the parameter that we need. Thus, an index sorted descending by date of a document adding, Solr would first get the documents that have been added lately, and thus we would be able to use early termination.

Using IndexSorter

What to do to use the IndexSorter tool ? Can I tell You the truth ? – It’s not that complicated. Note, however, that at the time of publication of this entry the mentioned tool is only available in branch_3x of Lucene/Solr project. To sort an index on the basis of a field, run the following command from the command line (of course keeping in mind the appropriate location of the library lucene-misc-3.1.jar – after building the project we find it in directory lucene/build/contrib/misc):

java IndexSorter SOURCE_DIRECTORY TARGET_DIRECTORY FIELD_NAME

The parameters mean:

SOURCE_DIRECTORY – a catalog with an index that you want to sort,
TARGET_DIRECTORY – the directory where sorted index will be saved,
FIELD_NAME – the field on which basis the index will be sorted.

If everything goes correctly, You should see something like this:

IndexSorter: done, 896 total milliseconds

The end

In my opinion, Lucene and Solr just got a very interesting feature, which can be used for example wherever the amount of data is very large, when response time can not exceed a certain time limit, or when the results beyond the first (the first 100 or 1000) are not significant. All who are interested in the subject or index sorting and early termination techniques should watch a slide presentation titled “Munching and Crunching: Lucene Index Post-Processing” (slides) led by Andrzej Bialecki during Lucene Eurocon Conference 2010, who discussed these topics.

sorting – Solr.pl