autocomplete – Solr.pl

Autocomplete With Special Characters

Rafał Kuć — Sun, 29 Jan 2017 14:11:18 +0000

Its been a long time since the last real blog post here on solr.pl, but better later than ever, right? So, today we will get back to the topic of autocomplete functionality with Solr Suggesters, facets and a standard queries with n-gram approach. So, the same functionality, but done with different approach, depending on the needs and preferences.

Today we will look at a more sophisticated autocomplete functionality – one that suggest words with special characters while the user types in words with or without autocomplete.

Autocomplete With Special Characters

Imagine we have words documents that have only two fields – id, which is the document identifier and a name, which is the name of the document and the field on which we would like to run autocomplete on. The name field is the one that can have language based characters, like ż, ć or ę in polish. And we would like to support them for all our users, no matter what type of keyboard they have or what type of locale they have set in their system.

Let’s assume we have the following documents:

{"id":1, "name":"Pośrednictwo nieruchomości"}
{"id":2, "name":"Posadowienie budynków"}
{"id":3, "name":"Posocznica"}

The names are not important, the use case is. We would like to be able to get all three documents when the user enters pos into the search box. Is that possible? Yes and let’s see how.

Preparing Collection Structure

Let’s start with the schema.xml file and the definition of the fields and types. We will completely ignore that search here and only focus on autocomplete. We also assume that we want the whole name to be displayed whenever user input matched any of the terms in the name field.
We will start with the fields, which look like this:

So, we will have our id field which is an integer, name field which we will use only for display purposes. The name_ac field will be used for the actual autocomplete. To not bother with manually filling the name_ac field we will use copyField definition in the schema.xml:

The name_ac definition will use ngrams for efficient prefix queries and a filter that will remove all the funky, special, language specific characters – the solr.ASCIIFoldingFilterFactory. We need to do that both during index and during query time, so that both sides of the analysis works. The text_ac type definition looks as follows:

During index time, we lowercase the whole input, remove language specific characters and turn them into ASCII equivalent and finally created edge ngram with the minimum size of 1, which means that the ngram will start from a single letter and will have a maximum size of 100. During query time, we lowercase the whole input and remove language specific characters and that’s all that is needed.

Let’s Test

To test the changes we will start Solr in the SolrCloud mode with the local ZooKeeper by running the following command from Solr home directory:

$ bin/solr start -c

Next we will upload the configuration that includes all the changes to ZooKeeper by running the following command:

$ bin/solr zk upconfig -z localhost:9983 -n autocomplete -d /home/config/autocomplete/conf

The only thing you need to care about for that command to work on your local machine is adjusting the directory with the configuration. The configuration itself can be downloaded from our Github account (config download).

Now we need to create the collection by running the following command:

$ curl 'localhost:8983/solr/admin/collections?action=CREATE&name=autocomplete&numShards=1&replicationFactor=1&collection.configName=autocomplete'

We have everything in place to index our documents, which we can do by running the following command:

$ curl "http://localhost:8983/solr/autocomplete/update?commit=true" -H 'Content-type:application/json' -d '[
  {"id":1, "name":"Pośrednictwo nieruchomości"},
  {"id":2, "name":"Posadowienie budynków"},
  {"id":3, "name":"Posocznica"}
]'

So now, let’s run two queries that should tell us if the above method really works. The first query looks as follows:

http://localhost:8983/solr/autocomplete/select?q.op=AND&defType=edismax&qf=name_ac&fl=id,name&q=pos

The second query looks as follows:

http://localhost:8983/solr/autocomplete/select?q.op=AND&defType=edismax&qf=name_ac&fl=id,name&q=pos

In both cases we are running a query using the Extended DisMax query parser, we are using AND as the Boolean operator (q.op parameter) and we are saying that we want to run the query on the name_ac field by using the qf parameter. We also say that we only want the id and name field to be returned in the search results by specifying the fl parameter.

The results returned by Solr, in both cases are identical and look as follows:




  true
  0
  0
  
    poś
    edismax
    name_ac
    id,name
    AND
  


  
    1
    Pośrednictwo nieruchomości
  
    2
    Posadowienie budynków
  
    3
    Posocznica

So the method works

Autocomplete on multivalued fields using faceting

Rafał Kuć — Mon, 25 Mar 2013 12:54:50 +0000

In the previous blog post about auto complete on multi-valued field we discussed how highlighting can help us get the information we are interested in. We also promised that we will get back to the topic and we will show how to achieve a similar functionality with the use of Solr faceting capabilities. So, let’s do it.

Before we start

Because this post is more or less a continuation of what we’ve wrote earlier about autocomplete on multi-valued fields we recommend to read the “Autocomplete on multivalued field using highlighting” before reading the rest of this entry. We would also like to note, that the method shown in this entry is very similar to the one shown in the “Solr and autocomplete (part 1)” post, but we wanted to refresh that topic and show the example using multi-valued fields.

Configuration

Similar to the previous post we will start with Solr configuration.

Index structure

The structure of our index is exactly the same as the one previously shown, but let’s recall it. One thing – please remember that we want to have auto complete working on multi-valued field. This field is called features and the whole index fields configuration looks like this:

For getting values for auto complete we will use the features_autocomplete field.

Copy field

Of course we don’t want to change our indexer and we want Solr to automatically copy the data from features field to the features_autocomplete one. Because of that we will add the copyField definition to the schema.xml file, so it looks like this:

Our text_autocomplete field type

And we’ve come to the first difference – the text_autocomplete field type. This time it looks like this:

Because of the fact that we will use faceting we use the solr.KeywordTokenizerFactory with the solr.LowerCaseFilterFactory to have the data in our field as a single, lowercased token.

Example data

Our example data is identical to what we had before, but even though let’s recall them for things to be clear:


 
  1
  Multiple windows
  Single door
 
 
  2
  Single window
  Single door
 
 
  3
  Multiple windows
  Multiple doors

Query with faceting

Let’s look how our query will look like when we will use faceting instead of highlighting.

Full query

When using faceting our query should look more or less like the following one:

q=*:*&rows=0&facet=true&facet.field=features_autocomplete&facet.prefix=sing

A few words about the parameters:

rows=0 – we tell Solr that we don’t want the documents that matched the query in the results,
facet=true – we inform Solr that we want to use faceting,
facet.field=features_autocomplete – we say which field will be used to calculate faceting,
facet.prefix=sing – with the use of this parameter we provide the value of a query for auto complete.

Query results

Query results returned by Solr for the above query are as follows:




  0
  0
  
    true
    *:*
    sing
    features_autocomplete
    0
  




  
  
    
      2
      1

As you can see in the field faceting section we got the phrases we were interested in along with the number of documents they appear in.

What to remember about

The crucial thing to remember is that the value provided to the facet.prefix parameter is not analyzed. Because of that if we would provide the Sing value instead of the singwe wouldn’t get the results. You should remember that.

A short summary

The above entry shown the second method used to develop auto complete functionality on multi-valued fields. Of couse we didn’t say all about the topic and we will get back to it someday, but for now that is all. We hope that someone will find it useful

Autocomplete on multivalued fields using highlighting

Rafał Kuć — Mon, 25 Feb 2013 12:50:33 +0000

One of the recent topics I came across was auto complete feature based on Solr multi-valued fields (for example, this question was asked on Stack Overflow). Let’s look what possibilities we have.

Multiple cores vs single core

One of the possibilities we should consider in the beginning is if we can use a dedicated core or collection for autocomplete. If we can, we should go that way. There are multiple reasons in favor of such approach, for example such collection will be smaller than the one with the data that needs to be search-able, the term count should be smaller and thus your queries will be faster. Of course we have to take care of the additional configuration and indexing, but that’s not too much of a problem right ? In this entry we will look at the situations where having a separate core is not an option – for example because of filtering that needs to be done.

Please also note, that in this entry we assume that we want whole phrases to be shown for the user.

Configuration

Let’s start from the configuration.

Struktura indeksu

Let’s assume that we want to suggest phrases from the multi valued fields. Let’s call that field features. Configuration of all the fields in the index is as follows:

As you can see, for the auto complete feature, we will use the field named features_autocomplete. The _version_ field is needed by some of the Solr 4.0 (and newer) features and because of that it is present in our index.

Field values copying

In addition to the above configuration we also want to copy the data from the features field to the features_autocomplete one. In order to do that we will use Solr copy field feature. To do that, we add the following section to the schema.xml file:

Field type – text_autocomplete

Let’s have a look at the last thing we have when it comes to configuration – the definition of the text_autocomplete type:

As you can see, during indexing, Solr will create n-grams from the phrase indexed in the features_autocomplete field. It will start from the minimum length of 2, ending on the maximum length of 50.

During querying we will only lowercase our query phrase, nothing else is needed in our case.

Sample data

Our sample data looks like this:


 
  1
  Multiple windows
  Single door
 
 
  2
  Single window
  Single door
 
 
  3
  Multiple windows
  Multiple doors

Initial query

Let’s look at the queries now.

In the beginning

Let’s start with a simple query that would return the data we need if we would use a single valued fields. The query looks as follows:

q=features_autocomplete:sing&fl=features_autocomplete

Query results

The results we would get from such query, for our example data, should look like this:



 
  0
  3
  
   features_autocomplete
   features_autocomplete:sing
  
 
 
 
  
   Single window
   Single door
  
 
 
  
   Multiple windows
   Single door

A short comment

As we can see, the results are not satisfying us, because in addition to the value we are querying for, we got all the values that are stored in the multi-valued field. We would only like to have the one that we queried for. Is this possible ? Yes it is – with a little trick. Let’s modify our query to use highlighting.

Query with highlighting

So now, we will make use of Apache Solr highlighting module.

Changed query

What we will do is add the following part to our previous query:

hl=true&hl.fl=features_autocomplete&hl.simple.pre=&hl.simple.post=

So the whole query looks like this:

q=features_autocomplete:sing&fl=features_autocomplete&hl=true&hl.fl=features_autocomplete&hl.simple.pre=&hl.simple.post=

A few words about the parameters that were used:

hl=true – we inform Solr that we want to use highlighting,
hl.fl=features_autocomplete – we tell Solr which field should be used for highlighting,
hl.simple.pre= – setting the hl.simple.pre to empty value tells Solr that we don’t want to mark the beginning of the highlighted fragment,
hl.simple.post= – setting the hl.simple.post to empty value tells Solr that we don’t want to mark the end of the highlighted fragment.

Modified query results

After querying Solr with the modified query, the following results were returned:



 
  0
  4
  
   features_autocomplete
   features_autocomplete:sing
   
   
   features_autocomplete
   true
  
 
 
 
  
   Single window
   Single door
  
 
 
  
   Multiple windows
   Single door
  
 
 
 
  
   
    Single window
   
  
  
   
    Single door

As you can see, the section responsible for highlighting brings the information that we are interested in

Summary

Of course we need to remember that the approach proposed in this entry is not the only way to have a working auto-complete feature with data in multi-valued fields. In the next entry in this topic we will show how we can use faceting do get the same results if only we can accept some small drawbacks.

Autcomplete, part 4 (Ngram and faceting)

Marek Rogoziński — Mon, 28 May 2012 21:47:24 +0000

In the previous parts of autocomplete series we presented two methods of autocomplete queries. Than we extended one of those with the ability to define returned information. In todays entry we are back to autocomplete with facet and ngram.

Requirements

Our autocomplete mechanism has the following requirements:

We return whole phrase, not just s single word
Returned phrase can be present multiple times in the index
We want to know the number of results for the returned phrase
Common phrases should be shown higher than the less common ones
Order of words entered by the user doesn’t matter

Solution

Solution given in the first part of the series will not met the requirements because of the first requirement. Of course we could change analysis type, but we wouldn’t return the whole phrase.

Solution to the above requirements is the modified faceting method. Instead of searching all the elements and narrowing results with facet.prefix parameter, we can search only for those elements that have the word fragment we are looking for. We don’t want wildcard query to be used (because of performance) we call ngram’s for the rescue. This means we need to write the ngrams into the index (of course Solr will do that for us). The obvious flaw is the index size growth, but in this case we can live with that.

Schema.xml

We define an additional type:

We also define additional fields: one which value we plan to return and one which will be used for searching:

And one copyField to make things easier:

Query

After indexation we are ready to test our queries:

We are narrowing results, only to those which have the interesting word fragment in the tag_autocomplete field, with: q=tag_autocomplete:(PHRASE)
We need all the fragments entered by the user, so we use AND as our logical operator: q.op=AND
We not interested in the actual query results, we will use data returned by faceting, so we say: rows=0
We need faceting: facet=true
We need faceting on the field where we store the original phrase: facet.field=tag
We are not interested in empty tags: facet.mincount=1
We are only interested in 5 autocomplete values: facet.limit=5

And the final query:

?q=tag_autocomplete:(PHRASE)&q.op=AND&rows=0&facet=true&facet.field=tag&facet.mincount=1&facet.limit=5

If we will configure out search handler to include all the constant parameters, we will have the following query:

?q=tag_autocomplete:(PHRASE)

At the end

The basic virtue of the presented method is the ability to use one field for searching and other for returning results. Because of that, we were able to return the whole phrase instead of a single word.

Solr and autocomplete (part 3)

Rafał Kuć — Mon, 29 Nov 2010 22:38:55 +0000

In the previous parts (part 1, part. 2) of the cycle, we learned how to configure and query Solr to get the autocomplete functionality. In today’s entry I will show you how to add the dictionary to the Suggester, and thus have an impact on the generated suggestions.

Component configuration

To configure the component presented in the previous part of the cycle add the following parameter:

dict.txt

Thus our configuration should look like this:


 
  suggest
  org.apache.solr.spelling.suggest.Suggester
  org.apache.solr.spelling.suggest.tst.TSTLookup
  name_autocomplete
  dict.txt

With the parameter we informed the component to use the dictionary named dict.txt which should be placed in the Solr configuration directory.

Handler configuration

The handler configuration also gets one additional parameter which is:

true

So our configuration should be as follows:


 
  true
  suggest
  10
  true
 
 
  suggest

This parameter tell Solr, to return only those suggestions for which the number of results is greater than the number of results for the current query.

Dictionary

We told Solr to use the dictionary, but how should this dictionary look like ? For the purpose of this post I defined the following dictionary:

# sample dict
Hard disk hitachi
Hard disk wd    2.0
Hard disk jjdd    3.0

What is the construction of a dictionary? Each of the phrases (or single words) is located in a separate line. Each line ends with the weight of the phrase (between the weight and the phrase is a TAB character) which is used together with the parameter spellcheck.onlyMorePopular=true (the higher the weight, the higher the suggestion will be). The default weight value is 1.0. A dictionary should be saved in UTF-8 encoding. Lines beginning with # character are skipped.

Data

In this case we don’t need data – we will only use the defined dictionary.

Let’s check how it works

To check how our mechanism behaves I sent the following query to Solr, of course after rebuilding of the Suggester index:

/suggest?q=Har

As a result we get the following:




  0
  0


  
    
      3
      0
      3
      
        Hard disk jjdd
        Hard disk hitachi
        Hard disk wd

A few words at the end

As you can see the suggestions are sorted by on the basis of weight, as expected. It is worth noting that the query was passed with a capital letter, which is also important – the lowercased query will return empty suggestion list.

What can you say about the method – if we have a very good dictionaries generated on the basis of weights such as customer behavior this is the method for you and your customers will love it. I would not recommend it if you don’t have good dictionaries – there is a very high chance that your suggestions will be of poor quality.

What will be next ?

The number of tasks this week didn’t let me finish the performance tests and that’s why, in the next part of the cycle, I’ll try to show you how each method behaves with various index structure and size.

Solr and autocomplete (part 1)

Rafał Kuć — Mon, 18 Oct 2010 12:16:39 +0000

Almost everyone has seen how the autocomplete feature looks like. No wonder, then, Solr provides mechanisms by which we can build such functionality. In today’s entry I will show you how you can add autocomplete mechanism using faceting.

Indeks

Suppose you want to show some hints to the user in the on-line store, for example you want to show products name. Suppose that our index is composed of the following fields:

A text type is defined as follows:

Configuration

To start, consider what you want to achieve – do we want to suggest only individual words that make up a name, or maybe full names that begin with the letters specified by the user. Depending on our choices we have to prepare the appropriate field on which we will build hints.

Prompting individual words that make up the name

In the case of single words, we should use a field that is tokenized. In our case, the field named name will be sufficient. However, note that if you want to use for example stemming you should define another type, which do not use stemming because of how this analysis operates on the contents of the field.

Prompting full name

For full names of the products suggestions we need a different field configuration – the best for this will be a untokenized field. But we can not use string based field for that. For this purpose, we define the field as follows:

This type is defined as follows:

To not modify the format of the data we also add the appropriate definition of the copy information:

How do we use it ?

To use the data we prepared we use a fairly simple query:

q=*:*&facet=true&facet.field=FIELD&facet.mincount=1&facet.prefix=USER_QUERY

Where:

FIELD – field on the basis of which we intend to make suggestions. In our case the field named name_auto.
USER_QUERY – letters entered by the user.

It is worth noting rows=0 parameter is added here to only show the faceting result without the query results. Of course, this is not a necessity.

An example query would look like that:

fl=id,name&rows=0&q=*:*&facet=true&facet.field=name_auto&facet.mincount=1&facet.prefix=har

The result of this query might look like this::

Additional features

It is worth to mention the additional opportunities which are inherent to this method.

The first possibility is to show the user additional information such as number of results that you get when you select an appropriate hint. If you want to show such information it will certainly be an interesting option.

The next thing is sorting with the use of facet.sort parameter. Depending on your needs, we can sort the results by the number of documents (the default behavior, parameter set to true) or alphabetically (value set to false).

We may limit the suggestions to those which have more results than a specified number. To take advantage of this opportunity pass in a parameter facet.mincount with the appropriate number.

And as for me the biggest advantage of this method is the possibility of getting only those suggestions that not only match the letters that the user typed but also some other parameters, like category for example. For example, we want to show hints for the user who is in the household section of our store. We suspect that at this moment the user will not be interested in DVD-type products, and therefore we add a parameter fq=department:homeAppliances (assuming that we have such a department). After such a modified query, you do not get hints generated from the entire index, we only get those narrowed to the selected department.

A few words at the end

As other method, this one too, have its advantages and disadvantages. The advantage of this solution is its ease of use, no additional components requirement, and that the result hints can be easily narrowed to be generated only from those documents that match the query entered by the user. As a big plus is that the method includes number of result that will be shown after selecting the hint (of course with the same search parameters). For the downside is definitely need to have additional types and fields, quite limited abilities and the load caused by the use of faceting mechanism.

The next entry about the autocomplete will try to expand on and show a further methods of generating hints using Solr.