language – Solr.pl

Solr 4.0 and Polish language analysis

Rafał Kuć — Mon, 02 Apr 2012 21:43:51 +0000

Because Polish language analysis functionality is present in Lucene (and Solr) for some time I decided to take a look and compare it on the basis of upcoming Lucene and Solr 4.0.

Options

At the time of writing, the following options were present when it comes to analyzing Polish:

Use Stempel library (available since Solr 3.1)
Use Hunspell and Polish dictionaries (available since Solr 3.5)
Use Morfologik library (will be available in Solr 4.0, SOLR-3272).

Configuration

Lets look how to configure all the above options in Solr (please remember that all the following configuration examples are based on Solr 4.0).

Stempel

In order to add Polish stemming using Stempel library, we just need to add the following filter to our type definition:

In addition to that, you need to add lucene-analyzers-stempel-4.0.jar library and apache-solr-analysis-extras-4.0.jar library to SOLR_HOME/lib. It’s also a good idea to use solr.LowerCaseFilterFactory before Stempel filter.

Hunspell

Similar to the configuration above, to use Hunspell you need to add a new filter to your type definition. For example in the following way:

Parameters dictionary and affix are responsible for dictionary definition that we want to use. The ignoreCase parameter set to true tells Hunspell to ignore character case. You can find Hunspell dictionaries at the following URL: http://wiki.services.openoffice.org/wiki/Dictionaries.

Morfologik

Similar to the two above examples all you need to change in your schema.xml is adding a new filter, this time the following way:

The dictionary parameter tell Solr which dictionary you would like to use. You can choose the one from the following three:

MORFOLOGIK
MORFEUSZ
COMBINED

In addition to that, you need to add the following libraries to the SOLR_HOME/lib: lucene-analyzers-morfologik-4.0.jar, apache-solr-analysis-extras-4.0.jar, morfologik-fsa-1.5.2.jar, morfologik-polish-1.5.2.jar and morfologik-stemming-1.5.2.jar.

Results Comparison

Of course I wasn’t able to judge the results of analysis from the above three filters on the whole Polish language corpus and that’s why I decided to choose four work, to see the each of the filters behave. Those words are: “urodzić urodzony urodzona urodzeni” (this words are variations of the born word in Polish). The results are as follows:

Stempel

The terms I got from Stempel were the following ones:

[urodzić] [urodzo] [urodzona] [urodzeni]

Not all of them are words, but you have to remember that Stempel is a stemmer and because of that it produce stems which can be different from the actual words or their root forms. It is important to have the words we are interested in to be processed to the same tokens, which will allow to find those words by Lucene/Solr. Remembering that, I have to say, that the results of analysis using Stempel are not as good as I would like them to be. For example by searching for urodzić word you won’t be able to find documents with words like urodzona or urodzić.

Hunspell

The result of Hunspell analysis were as follows:

[urodzić, urodzić] [urodzony, urodzić] [urodzić] [urodzić, urodzony, urodzenie]

Comparing the results I got when using Hunspell to those Stempel produced we can see the difference. Our sample query for the urodzić word, would find documents with words like urodzony, urodzona oraz urodzeni, which is quite nice. You can also notice, that with three words we got more than one term on the same positions. The results I got when using Hunspell are OK and I think they should satisfy most of the users (they do satisfy me), but lets have a look on the newly introduced filter in Lucene and Solr – Morrfologik.

Morfologik

The results of Morfologik analysis were as follows:

[urodzić] [urodzony, urodzić] [urodzić] [urodzić, urodzony]

Again, if you compare those the the ones got when using Hunspell you can hardly see the difference (of course in this particular case). The only difference between Hunspell and Morfologik is the last term for which we got different results. In my opinion the results achieved with Morfologik, are satisfying.

Performance

The performance test was done in a simple manner – for each filter I’ve indexed 5 million documents, where all the text fields were based on Polish language analysis with appropriate filter (in addition to that some standard filters like stopwords, synonyms and so on). Every time the indexation was done on a clean Solr 4.0 instance. Because of using Data Import Handler I’ve sent commit every 100k documents. The index contained several fields, but the actual index structure was not crucial for the test as I indexed the same set of documents every time. Following are the test results:

[table “21” not found /]

Warning: At the time of writing, according to SOLR-3245 JIRA issue there is a problem with Hunspell performance with Polish dictionaries and Solr 4.0. I’m almost certain that this situation will be resolved by the time Solr 4.0 will be released. But right now performance of Hunspell with Polish dictionaries and Solr 4.0 may not be sufficient.

Short Summary

Despite not having performance results for Hunspell (because I don’t count the ones I have right now as correct ones) we can see that Hunspell and Morfologik are a good candidates for Polish language analysis. Looking at Morfologik we have similar performance to Stempel, but Morfologik results are better in my opinion and that will make your user more happy.

Document language identification

Rafał Kuć — Mon, 23 Jan 2012 20:59:03 +0000

One of the functionality of the latest Solr version (3.5) is the ability to identify the language of the document during its indexation. In todays entry we will see how Apache Solr work together with Apache Tika to identify the language of the documents.

At the beginning

You should remember that the described functionality was introduced in Solr 3.5.

Assumptions

We will be using two fields to identify the document language: title and body. We want to store the information of the detected language in the lang field.

Index structure

The structure of our index is of course simplified and contain only fields needed for the test. So the field definition part of the schema.xml file looks like this:

All the fields as marked as stored=”true” for simplicity.

Update request processor configuration

In order to be able to use the language identification feature we need to configure Solr update request processor. We will be using the one that is using Apache Tika (there is a second implementation based on http://code.google.com/p/language-detection/). In order to configure the process we add the following to the solrconfig.xml file:


  
    
      title,body
      lang

Other parameters of the TikaLanguageIdentifierUpdateProcessorFactory are described on Apache Solr wiki pages available at the following URL address: http://wiki.apache.org/solr/LanguageDetection.

Additional libraries

In order for the update request processor to be working we need some additional libraries. From the dist directory from Apache Solr distribution we copy the apache-solr-langid-3.5.0.jar to tikaDir (for example), which we make on the same level as the webapps directory. Then we add the following line to the solrconfig.xml file:

The next library we will need is the Tika jar with all the goodiess (tika-app-1.0.jar) which we can download at the following URL address: http://tika.apache.org/. We place it in the same tikaDir directory and then we add the following entry to the solrconfig.xml file:

Test documents

For the testing purposes I decided to prepare three documents. The first was in English, the second one in Polish and the third one in German. Their content was downloaded from Wikipedia. They look as follows:

tika_en.xml



  1
  Water
  Water is a chemical substance with the chemical formula H2O. A water molecule contains one oxygen and two hydrogen atoms connected by covalent bonds. Water is a liquid at ambient conditions, but it often co-exists on Earth with its solid state, ice, and gaseous state (water vapor or steam). Water also exists in a liquid crystal state near hydrophilic surfaces.[1][2] Under nomenclature used to name chemical compounds, Dihydrogen monoxide is the scientific name for water, though it is almost never used.

tika_pl.xml



  2
  Woda
  Woda (tlenek wodoru; nazwa systematyczna IUPAC: oksydan) – związek chemiczny o wzorze H2O, występujący w warunkach standardowych w stanie ciekłym. W stanie gazowym wodę określa się mianem pary wodnej, a w stałym stanie skupienia – lodem. Słowo woda jako nazwa związku chemicznego może się odnosić do każdego stanu skupienia.

tika_de.xml



  3
  Wasser
  Wasser (H2O) ist eine chemische Verbindung aus den Elementen Sauerstoff (O) und Wasserstoff (H). Wasser ist die einzige chemische Verbindung auf der Erde, die in der Natur in allen drei Aggregatzuständen vorkommt. Die Bezeichnung Wasser wird dabei besonders für den flüssigen Aggregatzustand verwendet. Im festen (gefrorenen) Zustand spricht man von Eis, im gasförmigen Zustand von Wasserdampf.

More testing

To index the data I used the following shell commands:

curl 'http://localhost:8983/solr/update?update.chain=langid' --data-binary @tika_pl.xml -H 'Content-type:application/xml'
curl 'http://localhost:8983/solr/update?update.chain=langid' --data-binary @tika_en.xml -H 'Content-type:application/xml'
curl 'http://localhost:8983/solr/update?update.chain=langid' --data-binary @tika_de.xml -H 'Content-type:application/xml'
curl 'http://localhost:8983/solr/update?update.chain=langid' --data-binary '' -H 'Content-type:application/xml'

It is worth to notice the additional update.chain=langid parameter added to the request. This parameter is used to tell Solr which update processor to use when indexing the data. In the example we told Solr that it should use our defined update processor.

Indexed data

So let’s have a look at the indexed data. We will do that by running the following query: q=*:*&indent=true.




  0
  0
  
    true
    *:*
  


  
    Woda (tlenek wodoru; nazwa systematyczna IUPAC: oksydan) – związek chemiczny o wzorze H2O, występujący w warunkach standardowych w stanie ciekłym. W stanie gazowym wodę określa się mianem pary wodnej, a w stałym stanie skupienia – lodem. Słowo woda jako nazwa związku chemicznego może się odnosić do każdego stanu skupienia.
    2
    pl
    Woda
  
  
    Water is a chemical substance with the chemical formula H2O. A water molecule contains one oxygen and two hydrogen atoms connected by covalent bonds. Water is a liquid at ambient conditions, but it often co-exists on Earth with its solid state, ice, and gaseous state (water vapor or steam). Water also exists in a liquid crystal state near hydrophilic surfaces.[1][2] Under nomenclature used to name chemical compounds, Dihydrogen monoxide is the scientific name for water, though it is almost never used.
    1
    en
    Water
  
  
    Wasser (H2O) ist eine chemische Verbindung aus den Elementen Sauerstoff (O) und Wasserstoff (H). Wasser ist die einzige chemische Verbindung auf der Erde, die in der Natur in allen drei Aggregatzuständen vorkommt. Die Bezeichnung Wasser wird dabei besonders für den flüssigen Aggregatzustand verwendet. Im festen (gefrorenen) Zustand spricht man von Eis, im gasförmigen Zustand von Wasserdampf.
    3
    de
    Wasser

As you can see, Solr with the use of Tika, was able to identify the languages of the indexed documents. Of course, let’s not be too optimistic, because mistakes happen, especially when dealing with multi-language documents, but that’s understandable.

To sum up

You should remember, that the language identification feature is not perfect and can make mistakes. Also remember, that the longer the documents, the better the functionality will work. Of course the problem is that we can’t use the language identification during query time, but it’s not only problem with Solr and Tika. You can deal with that by identifying your user, it’s web browser or place he is located in.