One of the many new features that Lucene and Solr 3.1 brings is FastVectorHighlighting – as the change notes say nothing less than the improved functionality of highlighting. Currently the highlighting mechanism is not too fast, sometimes it could kill your Solr instance when dealing with a large amount of data, or very long text fields. I thought that it is worthwhile to test the performance of the new functionality.
A few words at the beginning
First, some information about the possibilities of a new Lucene highlighter:
- supports N-gram based fields
- enforces the use of Java 5 or higher
- takes boosts into consideration in order to boost the importance of the text fragments
- it is very fast for large documents
It is also worth to notice that the current highlighter is marked as Deprecated according to the SOLR-1696 Jira issue.
How was the test performed ?
For testing purposes I used an index that contains approximately 1.2 million documents (I’ve indexed the Polish Wikipedia – only the latest changes). For each of the following searches I used a one of the biggest fields to highlight on, once with the old (hl.useFastVectorHighlighter = false), once with the new (hl.useFastVectorHighlighter = true) highlighter. Tests were performed on the caches turned off. The table contains the response times which are the average time of 10 queries sequentially excluding the largest and smallest. Solr was restarted after each query. Below are the results of this simple test:
|Query||Highlighter query time||FastVectorHighlighter query time||Documents returned|
Although the test is very simple, it shows a pattern – FastVectorHighlighter is faster than the current highlighter.
As for the quality of highlighting fragments, I couldn’t see a major differences, although this specific data is not made to such observations.
One thing to remember
Please note that FastVectorHighlighter requires that the field on which it will work to be properly defined. It is necessary to set the on the following attributes: term vectors (termVectors=”true”), term positions (termPositions=”true”) and term offsets (termOffsets=”true”). Otherwise, continue to be used for an old mechanism.
To sum up
Please remember that the performed test was not a detailed performance test of the new highlighting method. The test was just a simulation of environment which can be closely related to some production environments. However after making the test we can say, that we can expect the new highlighting method to be faster than the older one.
This post is also available in: Polish