“Car sale application” – Unicode Collation, sorting text in a language-sensitive way (part 4)

In the third part of our ”Car sale” application related posts we added some location data and the information about the city that is related to every car. Shortly afterwards we added the possibility to sort using the city field by simply modifying the schema:

It turned out, that sorting using the city_sort field did not work as we expected. All because of the polish signs appearing in the city names. What should we do with it ?

Requirements specification

Let’s check if the „city_sort” field sorting does really not working well in conjunction with the polish signs. When we enter the query:

we have the result:

That’s really not what we expect. We would like to have:

To make the sorting functionality work well, we will use the „solr.CollationKeyFilter” filter.

solr.CollationKeyFilter

The filter called solr.CollationKeyFilter is used at index time, indexing special “sort keys” into the sort field. It allows us to choose the collator related to wanted country and language. We can also choose the strength of the collation which determines the minimum level of difference considered significant during comparison. For example:

The given example shows us the configuration of the solr.CollationKeyFilterFactory, where we want to handle the spanish language with the primary strength.

Schema.xml changes

  1. New field types definitions:

      As we may notice, it’s the definition of the currently existing „lowercase” type, where we added the solr.CollationKeyFilter, handling the polish language. The type will be used for the fields, where the data contains polish signs.

  2. New „city_sort” field definition:
    • let’s change the type for the „city_sort” field to „polishLowercase”:

Functional tests

Before we check if the given field type change is just what we need, we must remember that the solr.CollationKeyFilter is used at index time, so we need to re-index all of the data.

Now let’s check our test query result:

It appears that the result is correct:

The end

Yet another reported problem has been solved successfully. We have improved the quality of the sorting mechanism, where we must handle the polish signs, by adding the solr.CollationKeyFilter which entirely fulfilled our needs. Now we can only wait for another notifications and improvements 🙂

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.