When indexing so called “rich documents” we should sometimes think about, where we want those documents to be processes – should we send them to Apache Solr (or other search engine, like ElasticSearch) and forget about them or whether we should use Apache Tika before sending the document and send the extracted content along with other information for indexation.
As I wrote a few lines above we have two options – the first one is sending the binaries to search engine and use ExtractingRequestHandler (information about integrating Solr with Apache Tika can be found here) in Solr case, so it will make all the work for us. The second option is to use the same functionality (almost the same) to parse binary documents and get their contents before sending them to Solr. Of course there is a third option, not possible in most cases – get the documents you want to index in a format understandable by Solr
Processing on the Search Server Side
The simplest approach is to process your “rich documents” on the search server side. Lets assume its Apache Solr. We configure the ExtractingRequestHandler in the way we want it to work and we forget about everything else. But its not the right approach every time. You can imagine a situation when your indexing server is almost 100% utilized. If you would add another source of generating load you would probably suffer from performance problems. In such cases you will probably want to do it the other way.
Processing Outside of the Search Server
If the amount of rich documents is huge or your indexing server is almost completely utilized than it may be a good idea to process your binary files before sending them to your indexing server. Using Apache Tika for example we are able to build (quite easily) a good and reliable solution to process rich documents in your application. Of course such approach require a bit of knowledge about Java (or any other language you will use for content extraction). Such approach can save us from a situation where our indexing server is overloaded and because of the amount of data we can’t do anything with it.
A Few Words at the End
Once every few weeks we will be publishing posts that don’t cover one of the Apache Solr functionalities, but instead discuss some overall search problem or describe architecture of system with search as their part. We hope that such posts will allow us and you to look a bit wider on search topics than only from Apache Solr point of view.
This post is also available in: Polish