Autocomplete With Special Characters

Its been a long time since the last real blog post here on solr.pl, but better later than ever, right? So, today we will get back to the topic of autocomplete functionality with Solr Suggesters, facets and a standard queries with n-gram approach. So, the same functionality, but done with different approach, depending on the needs and preferences.

Today we will look at a more sophisticated autocomplete functionality – one that suggest words with special characters while the user types in words with or without autocomplete.

Autocomplete With Special Characters

Imagine we have words documents that have only two fields – id, which is the document identifier and a name, which is the name of the document and the field on which we would like to run autocomplete on. The name field is the one that can have language based characters, like ż, ć or ę in polish. And we would like to support them for all our users, no matter what type of keyboard they have or what type of locale they have set in their system.

Let’s assume we have the following documents:

{"id":1, "name":"Pośrednictwo nieruchomości"}
{"id":2, "name":"Posadowienie budynków"}
{"id":3, "name":"Posocznica"}

The names are not important, the use case is. We would like to be able to get all three documents when the user enters pos into the search box. Is that possible? Yes and let’s see how.

Preparing Collection Structure

Let’s start with the schema.xml file and the definition of the fields and types. We will completely ignore that search here and only focus on autocomplete. We also assume that we want the whole name to be displayed whenever user input matched any of the terms in the name field.
We will start with the fields, which look like this:

<field name="id" type="int" indexed="true" stored="true" required="true" multiValued="false" />
<field name="name" type="string" indexed="false" stored="true" multiValued="false" />
<field name="name_ac" type="text_ac" indexed="true" stored="false" multiValued="false" />

So, we will have our id field which is an integer, name field which we will use only for display purposes. The name_ac field will be used for the actual autocomplete. To not bother with manually filling the name_ac field we will use copyField definition in the schema.xml:

<copyField source="name" dest="name_ac" />

The name_ac definition will use ngrams for efficient prefix queries and a filter that will remove all the funky, special, language specific characters – the solr.ASCIIFoldingFilterFactory. We need to do that both during index and during query time, so that both sides of the analysis works. The text_ac type definition looks as follows:

<fieldType name="text_ac" class="solr.TextField" positionIncrementGap="100">
  <analyzer type="index">
    <tokenizer class="solr.KeywordTokenizerFactory"/>
    <filter class="solr.LowerCaseFilterFactory"/>
    <filter class="solr.ASCIIFoldingFilterFactory"/>
    <filter class="solr.EdgeNGramFilterFactory" minGramSize="1" maxGramSize="100" />
  </analyzer>
  <analyzer type="query">
    <tokenizer class="solr.KeywordTokenizerFactory"/>
    <filter class="solr.LowerCaseFilterFactory"/>
    <filter class="solr.ASCIIFoldingFilterFactory"/>
  </analyzer>
</fieldType>

During index time, we lowercase the whole input, remove language specific characters and turn them into ASCII equivalent and finally created edge ngram with the minimum size of 1, which means that the ngram will start from a single letter and will have a maximum size of 100. During query time, we lowercase the whole input and remove language specific characters and that’s all that is needed.

Let’s Test

To test the changes we will start Solr in the SolrCloud mode with the local ZooKeeper by running the following command from Solr home directory:

$ bin/solr start -c

Next we will upload the configuration that includes all the changes to ZooKeeper by running the following command:

$ bin/solr zk upconfig -z localhost:9983 -n autocomplete -d /home/config/autocomplete/conf

The only thing you need to care about for that command to work on your local machine is adjusting the directory with the configuration. The configuration itself can be downloaded from our Github account (config download).

Now we need to create the collection by running the following command:

$ curl 'localhost:8983/solr/admin/collections?action=CREATE&name=autocomplete&numShards=1&replicationFactor=1&collection.configName=autocomplete'

We have everything in place to index our documents, which we can do by running the following command:

$ curl "http://localhost:8983/solr/autocomplete/update?commit=true" -H 'Content-type:application/json' -d '[
  {"id":1, "name":"Pośrednictwo nieruchomości"},
  {"id":2, "name":"Posadowienie budynków"},
  {"id":3, "name":"Posocznica"}
]'

So now, let’s run two queries that should tell us if the above method really works. The first query looks as follows:

http://localhost:8983/solr/autocomplete/select?q.op=AND&defType=edismax&qf=name_ac&fl=id,name&q=pos

The second query looks as follows:

http://localhost:8983/solr/autocomplete/select?q.op=AND&defType=edismax&qf=name_ac&fl=id,name&q=pos

In both cases we are running a query using the Extended DisMax query parser, we are using AND as the Boolean operator (q.op parameter) and we are saying that we want to run the query on the name_ac field by using the qf parameter. We also say that we only want the id and name field to be returned in the search results by specifying the fl parameter.

The results returned by Solr, in both cases are identical and look as follows:

<?xml version="1.0" encoding="UTF-8"?>
<response>
<lst name="responseHeader">
  <bool name="zkConnected">true</bool>
  <int name="status">0</int>
  <int name="QTime">0</int>
  <lst name="params">
    <str name="q">poś</str>
    <str name="defType">edismax</str>
    <str name="qf">name_ac</str>
    <str name="fl">id,name</str>
    <str name="q.op">AND</str>
  </lst>
</lst>
<result name="response" numFound="3" start="0">
  <doc>
    <int name="id">1</int>
    <str name="name">Pośrednictwo nieruchomości</str></doc>
  <doc>
    <int name="id">2</int>
    <str name="name">Posadowienie budynków</str></doc>
  <doc>
    <int name="id">3</int>
    <str name="name">Posocznica</str></doc>
</result>
</response>

So the method works 🙂

Leave a Reply

Your email address will not be published. Required fields are marked *