Autocomplete on multivalued fields using highlighting

One of the recent topics I came across was auto complete feature based on Solr multi-valued fields (for example, this question was asked on Stack Overflow). Let’s look what possibilities we have.

Multiple cores vs single core

One of the possibilities we should consider in the beginning is if we can use a dedicated core or collection for autocomplete. If we can, we should go that way. There are multiple reasons in favor of such approach, for example such collection will be smaller than the one with the data that needs to be search-able, the term count should be smaller and thus your queries will be faster. Of course we have to take care of the additional configuration and indexing, but that’s not too much of a problem right ? In this entry we will look at the situations where having a separate core is not an option – for example because of filtering that needs to be done.

Please also note, that in this entry we assume that we want whole phrases to be shown for the user.

Configuration

Let’s start from the configuration.

Struktura indeksu

Let’s assume that we want to suggest phrases from the multi valued fields. Let’s call that field  features. Configuration of all the fields in the index is as follows:

<fields>
 <field name="id" type="string" indexed="true" stored="true" required="true" multiValued="false" />
 <field name="features" type="string" indexed="true" stored="true" multiValued="true"/>
 <field name="features_autocomplete" type="text_autocomplete" indexed="true" stored="true" multiValued="true"/>

 <field name="_version_" type="long" indexed="true" stored="true"/>
</fields>

As you can see, for the auto complete feature, we will use the field named features_autocomplete. The _version_ field is needed by some of the Solr 4.0 (and newer) features and because of that it is present in our index.

Field values copying

In addition to the above configuration we also want to copy the data from the features field to the features_autocomplete one. In order to do that we will use Solr copy field feature. To do that, we add the following section to the schema.xml file:

<copyField source="features" dest="features_autocomplete"/>

Field type – text_autocomplete

Let’s have a look at the last thing we have when it comes to configuration – the definition of the text_autocomplete type:

<fieldType name="text_autocomplete" class="solr.TextField" positionIncrementGap="100">
 <analyzer type="index">
  <tokenizer class="solr.KeywordTokenizerFactory"/>
  <filter class="solr.LowerCaseFilterFactory"/>
  <filter class="solr.EdgeNGramFilterFactory" minGramSize="2" maxGramSize="50" />
 </analyzer>
 <analyzer type="query">
  <tokenizer class="solr.KeywordTokenizerFactory" />
  <filter class="solr.LowerCaseFilterFactory"/>
 </analyzer>
</fieldType>

As you can see, during indexing, Solr will create n-grams from the phrase indexed in the features_autocomplete field. It will start from the minimum length of 2, ending on the maximum length of 50.

During querying we will only lowercase our query phrase, nothing else is needed in our case.

Sample data

Our sample data looks like this:

<add>
 <doc>
  <field name="id">1</field>
  <field name="features">Multiple windows</field>
  <field name="features">Single door</field>
 </doc>
 <doc>
  <field name="id">2</field>
  <field name="features">Single window</field>
  <field name="features">Single door</field>
 </doc>
 <doc>
  <field name="id">3</field>
  <field name="features">Multiple windows</field>
  <field name="features">Multiple doors</field>
 </doc>
</add>

Initial query

Let’s look at the queries now.

In the beginning

Let’s start with a simple query that would return the data we need if we would use a single valued fields. The query looks as follows:

q=features_autocomplete:sing&fl=features_autocomplete

Query results

The results we would get from such query, for our example data, should look like this:

<?xml version="1.0" encoding="UTF-8"?>
<response>
 <lst name="responseHeader">
  <int name="status">0</int>
  <int name="QTime">3</int>
  <lst name="params">
   <str name="fl">features_autocomplete</str>
   <str name="q">features_autocomplete:sing</str>
  </lst>
 </lst>
 <result name="response" numFound="2" start="0">
 <doc>
  <arr name="features_autocomplete">
   <str>Single window</str>
   <str>Single door</str>
  </arr>
 </doc>
 <doc>
  <arr name="features_autocomplete">
   <str>Multiple windows</str>
   <str>Single door</str>
  </arr>
 </doc>
 </result>
</response>

A short comment

As we can see, the results are not satisfying us, because in addition to the value we are querying for, we got all the values that are stored in the multi-valued field. We would only like to have the one that we queried for. Is this possible ? Yes it is – with a little trick. Let’s modify our query to use highlighting.

Query with highlighting

So now, we will make use of Apache Solr highlighting module.

Changed query

What we will do is add the following part to our previous query:

hl=true&hl.fl=features_autocomplete&hl.simple.pre=&hl.simple.post=

So the whole query looks like this:

q=features_autocomplete:sing&fl=features_autocomplete&hl=true&hl.fl=features_autocomplete&hl.simple.pre=&hl.simple.post=

A few words about the parameters that were used:

  • hl=true – we inform Solr that we want to use highlighting,
  • hl.fl=features_autocomplete – we tell Solr which field should be used for highlighting,
  • hl.simple.pre= – setting the hl.simple.pre to empty value tells Solr that we don’t want to mark the beginning of the highlighted fragment,
  • hl.simple.post= – setting the hl.simple.post to empty value tells Solr that we don’t want to mark the end of the highlighted fragment.

Modified query results

After querying Solr with the modified query, the following results were returned:

<?xml version="1.0" encoding="UTF-8"?>
<response>
 <lst name="responseHeader">
  <int name="status">0</int>
  <int name="QTime">4</int>
  <lst name="params">
   <str name="fl">features_autocomplete</str>
   <str name="q">features_autocomplete:sing</str>
   <str name="hl.simple.pre"/>
   <str name="hl.simple.post"/>
   <str name="hl.fl">features_autocomplete</str>
   <str name="hl">true</str>
  </lst>
 </lst>
 <result name="response" numFound="2" start="0">
 <doc>
  <arr name="features_autocomplete">
   <str>Single window</str>
   <str>Single door</str>
  </arr>
 </doc>
 <doc>
  <arr name="features_autocomplete">
   <str>Multiple windows</str>
   <str>Single door</str>
  </arr>
 </doc>
 </result>
 <lst name="highlighting">
  <lst name="2">
   <arr name="features_autocomplete">
    <str>Single window</str>
   </arr>
  </lst>
  <lst name="1">
   <arr name="features_autocomplete">
    <str>Single door</str>
   </arr>
  </lst>
 </lst>
</response>

As you can see, the section responsible for highlighting brings the information that we are interested in :)

Summary

Of course we need to remember that the approach proposed in this entry is not the only way to have a working auto-complete feature with data in multi-valued fields. In the next entry in this topic we will show how we can use faceting do get the same results if only we can accept some small drawbacks.

This post is also available in: Polish

This entry was posted on Monday, February 25th, 2013 at 09:07 and is filed under About Solr, Autocomplete. You can follow any responses to this entry through the RSS 2.0 feed. You can leave a response, or trackback from your own site.