Document language identification

One of the functionality of the latest Solr version (3.5) is the ability to identify the language of the document during its indexation. In todays entry we will see how Apache Solr work together with Apache Tika to identify the language of the documents.

At the beginning

You should remember that the described functionality was introduced in Solr 3.5.


We will be using two fields to identify the document language: title and body. We want to store the information of the detected language in the lang field.

Index structure

The structure of our index is of course simplified and contain only fields needed for the test. So the field definition part of the schema.xml file looks like this:

All the fields as marked as stored=”true” for simplicity.

Update request processor configuration

In order to be able to use the language identification feature we need to configure Solr update request processor. We will be using the one that is using Apache Tika (there is a second implementation based on In order to configure the process we add the following to the solrconfig.xml file:

Other parameters of the TikaLanguageIdentifierUpdateProcessorFactory are described on Apache Solr wiki pages available at the following URL address:

Additional libraries

In order for the update request processor to be working we need some additional libraries. From the dist directory from Apache Solr distribution we copy the apache-solr-langid-3.5.0.jar to tikaDir (for example), which we make on the same level as the webapps directory. Then we add the following line to the solrconfig.xml file:

The next library we will need is the Tika jar with all the goodiess (tika-app-1.0.jar) which we can download at the following URL address: We place it in the same tikaDir directory and then we add the following entry to the solrconfig.xml file:

Test documents

For the testing purposes I decided to prepare three documents. The first was in English, the second one in Polish and the third one in German. Their content was downloaded from Wikipedia. They look as follows:




More testing

To index the data I used the following shell commands:

It is worth to notice the additional update.chain=langid parameter added to the request. This parameter is used to tell Solr which update processor to use when indexing the data. In the example we told Solr that it should use our defined update processor.

Indexed data

So let’s have a look at the indexed data. We will do that by running the following query: q=*:*&indent=true.

As you can see, Solr with the use of Tika, was able to identify the languages of the indexed documents. Of course, let’s not be too optimistic, because mistakes happen, especially when dealing with multi-language documents, but that’s understandable.

To sum up

You should remember, that the language identification feature is not perfect and can make mistakes. Also remember, that the longer the documents, the better the functionality will work. Of course the problem is that we can’t use the language identification during query time, but it’s not only problem with Solr and Tika. You can deal with that by identifying your user, it’s web browser or place he is located in.

2 thoughts on “Document language identification

  • 8 May 2012 at 16:42


    you have defined field “body” with type “text_ws”. Haw can you apply different language analyzers in this case? Lets assume that we have defined different types like “text_de”, “text_pl”, “text_en” etc. How the content of “body” will be analyzed depending on language?

    Yuliyan Fasev

  • 9 May 2012 at 22:43

    That’s the main problem. In Solr you would have define proper fields like body_en, body_de, body_pl and so on. Unlike ElasticSearch Solr doesn’t let you specify analyzers dyanmically.


Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.