So far, in previous articles, we looked at the import data from SQL databases. Today it’s time to import from XML files.
dih
Indexing files like doc, pdf – Solr and Tika integration
In the previous article we have given basic information about how to enable the indexing of binary files, ie MS Word files, PDF files or LibreOffice files. Today we will do the same thing, using the Data Import Handler. Since a few days ago a new version of the Solr server (3.1) have been released, the following guidelines are based on this version. For the purpose of the article I used the “example” application – all of the changes relate to this application.
Data Import Handler – removing data from index
Deleting data from an index using DIH incremental indexing, on Solr wiki, is residually treated as something that works similarly to update the records. Similarly, in a previous article, I used this shortcut, the more that I have given an example of indexing wikipedia data that does not need to delete data.
Having at hand a sample data of the albums and performers, I decided to show my way of dealing with such cases. For simplicity and clarity, I assume that after the first import, the data can only decrease.
Data Import Handler – sharding
Our reader (greetings!) reported us a problem with the cooperation of DIH and sharding mechanism. The Solr project wiki, in my opinion, discuss the solution to this issue, but makes it a little around and on the occasion.