Solr 4.0: DirectSolrSpellChecker

One of the new features, which will be introduces with Solr 4.0 is a new SpellChecker implementation, which doesn’t require its own index. I decided to take a quick look at it and share my thoughts.

What We Have Today

As for today (Solr 3.6) we can use the following SpellChecker implementations:

org.apache.solr.spelling.IndexBasedSpellChecker
org.apache.solr.spelling.FileBasedSpellChecker

With the upcoming Solr 4.0, we will get a new implementation:

org.apache.solr.spelling.DirectSolrSpellChecker

Current Problems

In most of the cases I worked with the main problem of IndexBasedSpellChecker was the need to rebuild its index. In some cases the rebuild was long and it wasn’t possible to rebuild that index after every commit which for some was a bit issue. Of course it wasn’t a problem with FileBasedSpellChecker, but again, in my case, it was used as a support mechanism for the IndexBasedSpellChecker.

Configuration

DirectSolrSpellChecker configuration is similar to the one you are used today in Solr 3. Of course, there are some additional parameters. Following you can find a sample configuration:

<searchComponent name="spellcheck" class="solr.SpellCheckComponent">
  <str name="queryAnalyzerFieldType">textTitle</str>
  <lst name="spellchecker">
    <str name="name">default</str>
    <str name="field">title</str>
    <str name="classname">solr.DirectSolrSpellChecker</str>
    <str name="distanceMeasure">internal</str>
    <float name="accuracy">0.7</float>
    <int name="maxEdits">2</int>
    <int name="minPrefix">1</int>
    <int name="maxInspections">5</int>
    <int name="minQueryLength">4</int>
    <float name="maxQueryFrequency">0.01</float>
    <float name="thresholdTokenFrequency">.01</float>
  </lst>
</searchComponent>

And the meaning for each of the parameters:

queryAnalyzerFieldType – name of the type on which basis SpellChecker query will be analyzed.
field – field which contents will be used to build SpellChecker results.
classname – SpellChecker implementation class.
distanceMeasure – algorithm which will be used to calculate terms distance, in our case we will use the default ones (Levensthein’s).
accuracy – precision that must be achieved for the suggest to be counted as proper one.
maxEdits – maximum number of changes during term enumeration. This property can be set to 1 or 2.
minPrefix – minimal, common prefix during term enumeration.
maxInspections – maximum number of checks for each suggestion.
minQueryLength – minimal suggestion length for work to be taken into consideration as proper suggestion.
maxQueryFrequency – maximum percentage of documents in which word can appear for the word to be considered as one to correct (0.01 value means 1%).
thresholdTokenFrequency – minimal percentage of documents in which suggestion have to appear in order for it to be considered proper (.01 value means 1%).

The above configuration attributes shows that DirectSolrSpellChecker gives us much degree of behavior configuration.

Usage

DirectSolrSpellChecker is no different than other SpellChecker implementations when it comes to using it. As with the previous implementations you can configure Solr to add SpellChecker results to each query results or just configure new handler and decide when to query it for results. We wrote about how to use SpellChecker in the past – in the “Car sale application” example.

What We Can Expect ?

Acording to the information which we can see at JIRA issue LUCENE-2507 DirectSolrSpellChecker will not only remove the need of having a separate index, but will also improvement in suggestions quality. From what you can see in the mentioned JIRA issue, DirectSolrSpellChecker works better comparing to the previous implementations although it’s slightly slower, but I think that wont be an issue when you don’t use SpellChecker with every query.

Solr.pl