Solr and autocomplete (part 3)

In the previous parts (part 1, part. 2) of the cycle, we learned how to configure and query Solr to get the autocomplete functionality. In today’s entry I will show you how to add the dictionary to the Suggester, and thus have an impact on the generated suggestions.

Component configuration

To configure the component presented in the previous part of the cycle add the following parameter:

<str name="sourceLocation">dict.txt</str>

Thus our configuration should look like this:

<searchComponent name="suggest" class="solr.SpellCheckComponent">
 <lst name="spellchecker">
  <str name="name">suggest</str>
  <str name="classname">org.apache.solr.spelling.suggest.Suggester</str>
  <str name="lookupImpl">org.apache.solr.spelling.suggest.tst.TSTLookup</str>
  <str name="field">name_autocomplete</str>
  <str name="sourceLocation">dict.txt</str>
 </lst>
</searchComponent>

With the parameter we informed the component to use the dictionary named dict.txt which should be placed in the Solr configuration directory.

Handler configuration

The handler configuration also gets one additional parameter which is:

<str name="spellcheck.onlyMorePopular">true</str>

So our configuration should be as follows:

<requestHandler name="/suggest" class="org.apache.solr.handler.component.SearchComponent">
 <lst name="defaults">
  <str name="spellcheck">true</str>
  <str name="spellcheck.dictionary">suggest</str>
  <str name="spellcheck.count">10</str>
  <str name="spellcheck.onlyMorePopular">true</str>
 </lst>
 <arr name="components">
  <str>suggest</str>
 </arr>
</requestHandler>

This parameter tell Solr, to return only those suggestions for which the number of results is greater than the number of results for the current query.

Dictionary

We told Solr to use the dictionary, but how should this dictionary look like ? For the purpose of this post I defined the following dictionary:

# sample dict
Hard disk hitachi
Hard disk wd    2.0
Hard disk jjdd    3.0

What is the construction of a dictionary? Each of the phrases (or single words) is located in a separate line. Each line ends with the weight of the phrase (between the weight and the phrase is a TAB character) which is used together with the parameter spellcheck.onlyMorePopular=true (the higher the weight, the higher the suggestion will be). The default weight value is 1.0. A dictionary should be saved in UTF-8 encoding. Lines beginning with # character are skipped.

Data

In this case we don’t need data – we will only use the defined dictionary.

Let’s check how it works

To check how our mechanism behaves I sent the following query to Solr, of course after rebuilding of the Suggester index:

/suggest?q=Har

As a result we get the following:

<?xml version="1.0" encoding="UTF-8"?>
<response>
<lst name="responseHeader">
  <int name="status">0</int>
  <int name="QTime">0</int>
</lst>
<lst name="spellcheck">
  <lst name="suggestions">
    <lst name="Dys">
      <int name="numFound">3</int>
      <int name="startOffset">0</int>
      <int name="endOffset">3</int>
      <arr name="suggestion">
        <str>Hard disk jjdd</str>
        <str>Hard disk hitachi</str>
        <str>Hard disk wd</str>
     </arr>
    </lst>
  </lst>
</lst>
</response>

A few words at the end

As you can see the suggestions are sorted by on the basis of weight, as expected. It is worth noting that the query was passed with a capital letter, which is also important – the lowercased query will return empty suggestion list.

What can you say about the method – if we have a very good dictionaries generated on the basis of weights such as customer behavior this is the method for you and your customers will love it. I would not recommend it if you don’t have good dictionaries – there is a very high chance that your suggestions will be of poor quality.

What will be next ?

The number of tasks this week didn’t let me finish the performance tests and that’s why, in the next part of the cycle, I’ll try to show you how each method behaves with various index structure and size.

Solr.pl