One of the functionality of the latest Solr version (3.5) is the ability to identify the language of the document during its indexation. In todays entry we will see how Apache Solr work together with Apache Tika to identify the language of the documents.
Indexing the so-called “rich documents”, ie files like pdf, doc, rtf, and so on (or binary files) always required some additional work on the developer side, at least to get the contents of the file and prepare it in a format understood by the search engines, in this case for Solr. To minimize this job I decided to look at the Apache Tika and integration of this library with Solr.
Solr is not very friendly to novice users. Preparing good schema file requires some experience. Assuming that we have prepared the configuration files, what remains for us is to share our data with the search server and take care of update ability.