Indexing the so-called “rich documents”, ie files like pdf, doc, rtf, and so on (or binary files) always required some additional work on the developer side, at least to get the contents of the file and prepare it in a format understood by the search engines, in this case for Solr. To minimize this job I decided to look at the Apache Tika and integration of this library with Solr.
First, a few words about the opportunities that we have when we choose Apache Tika. Apache Tika is a framework designed to extract information from the so-called “rich documents”-documents such as PDF files, the files in Microsoft Office format, rtf, but not only. Using Apache Tika we can also extract information from compressed documents, HTML files, images (eg jpg, png, gif), audio files (eg mp3, midi, wave), and compiled Java bytecode files. In addition, Apache Tika can detect the type of file being processed, which further simplifies the work with such documents. It is worth mentioning that the described framework is based on libraries such as PDFBox, Apache POI, or Neko HTML which indirectly guarantee very good results of extracted data.
Sample index structure
I’ll skip how to manually start the extraction of the contents of the documents in the Apache Tika and I will focus on the integration of this framework with Solr and how trivial it is. Assume that we are interested in the ID, title and contents of the documents we have to index. Thus we create a simple schema.xml file describing the index structure, which could look like this:
<field name="id" type="string" indexed="true" stored="true" required="true" /> <field name="tytul" type="text" indexed="true" stored="true"/> <field name="zawartosc" type="text" indexed="true" stored="false" multiValued="true"/>
To the solrconfig.xml file we add the following entry which defines a handler that will handle the indexing of documents using Apache Tika:
<requestHandler name="/update/extract" class="org.apache.solr.handler.extraction.ExtractingRequestHandler"> <lst name="defaults"> <str name="fmap.Last-Modified">last_modified</str> <str name="uprefix">ignored_</str> </lst> </requestHandler>
All update requests sent to the /update/extract address will be handled by Apache Tika. Of course, remember to send the commit command to the update handler after sending the documents to the handler using Apache Tika. Otherwise, your documents won’t be visible. In the standard Solr deployment you shoudl send the commit command to the handler located under /update.
In the configuration we told the extraction handler to assign the Last-Modified attribute to the last_modified field and to ignore the fields that do are not specified.
If you are going to index large binary files, remember to change the size limits. To do that, change the following values in the solrconfig.xml file:<requestDispatcher handleSelect=”true”> <requestParsers enableRemoteStreaming=”false” multipartUploadLimitInKB=”10240″ />
All parameters defining the ExtractingRequestHandler can be found at: http://wiki.apache.org/solr/ExtractingRequestHandler.