Autocomplete With Special Characters

Its been a long time since the last real blog post here on solr.pl, but better later than ever, right? So, today we will get back to the topic of autocomplete functionality with Solr Suggesters, facets and a standard queries with n-gram approach. So, the same functionality, but done with different approach, depending on the needs and preferences.

Today we will look at a more sophisticated autocomplete functionality – one that suggest words with special characters while the user types in words with or without autocomplete.

Autocomplete With Special Characters

Imagine we have words documents that have only two fields – id, which is the document identifier and a name, which is the name of the document and the field on which we would like to run autocomplete on. The name field is the one that can have language based characters, like ż, ć or ę in polish. And we would like to support them for all our users, no matter what type of keyboard they have or what type of locale they have set in their system.

Let’s assume we have the following documents:

The names are not important, the use case is. We would like to be able to get all three documents when the user enters pos into the search box. Is that possible? Yes and let’s see how.

Preparing Collection Structure

Let’s start with the schema.xml file and the definition of the fields and types. We will completely ignore that search here and only focus on autocomplete. We also assume that we want the whole name to be displayed whenever user input matched any of the terms in the name field.
We will start with the fields, which look like this:

So, we will have our id field which is an integer, name field which we will use only for display purposes. The name_ac field will be used for the actual autocomplete. To not bother with manually filling the name_ac field we will use copyField definition in the schema.xml:

The name_ac definition will use ngrams for efficient prefix queries and a filter that will remove all the funky, special, language specific characters – the solr.ASCIIFoldingFilterFactory. We need to do that both during index and during query time, so that both sides of the analysis works. The text_ac type definition looks as follows:

During index time, we lowercase the whole input, remove language specific characters and turn them into ASCII equivalent and finally created edge ngram with the minimum size of 1, which means that the ngram will start from a single letter and will have a maximum size of 100. During query time, we lowercase the whole input and remove language specific characters and that’s all that is needed.

Let’s Test

To test the changes we will start Solr in the SolrCloud mode with the local ZooKeeper by running the following command from Solr home directory:

Next we will upload the configuration that includes all the changes to ZooKeeper by running the following command:

The only thing you need to care about for that command to work on your local machine is adjusting the directory with the configuration. The configuration itself can be downloaded from our Github account (config download).

Now we need to create the collection by running the following command:

We have everything in place to index our documents, which we can do by running the following command:

So now, let’s run two queries that should tell us if the above method really works. The first query looks as follows:

The second query looks as follows:

In both cases we are running a query using the Extended DisMax query parser, we are using AND as the Boolean operator (q.op parameter) and we are saying that we want to run the query on the name_ac field by using the qf parameter. We also say that we only want the id and name field to be returned in the search results by specifying the fl parameter.

The results returned by Solr, in both cases are identical and look as follows:

So the method works 🙂

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.