One of the functionality of the latest Solr version (3.5) is the ability to identify the language of the document during its indexation. In todays entry we will see how Apache Solr work together with Apache Tika to identify the language of the documents.
tika
Solr and Tika integration (part 1 – basics)
Indexing the so-called “rich documents”, ie files like pdf, doc, rtf, and so on (or binary files) always required some additional work on the developer side, at least to get the contents of the file and prepare it in a format understood by the search engines, in this case for Solr. To minimize this job I decided to look at the Apache Tika and integration of this library with Solr.
Solr: data indexing for fun and profit
Solr is not very friendly to novice users. Preparing good schema file requires some experience. Assuming that we have prepared the configuration files, what remains for us is to share our data with the search server and take care of update ability.