Solr and Tika integration (part 1 – basics)

Indexing the so-called “rich documents”, ie files like pdf, doc, rtf, and so on (or binary files) always required some additional work on the developer side, at least to get the contents of the file and prepare it in a format understood by the search engines, in this case for Solr. To minimize this job I decided to look at the Apache Tika and integration of this library with Solr.

Introduction

First, a few words about the opportunities that we have when we choose Apache Tika. Apache Tika is a framework designed to extract information from the so-called “rich documents”-documents such as PDF files, the files in Microsoft Office format, rtf, but not only. Using Apache Tika we can also extract information from compressed documents, HTML files, images (eg jpg, png, gif), audio files (eg mp3, midi, wave), and compiled Java bytecode files. In addition, Apache Tika can detect the type of file being processed, which further simplifies the work with such documents. It is worth mentioning that the described framework is based on libraries such as PDFBox, Apache POI, or Neko HTML which indirectly guarantee very good results of extracted data.

Sample index structure

I’ll skip how to manually start the extraction of the contents of the documents in the Apache Tika and I will focus on the integration of this framework with Solr and how trivial it is. Assume that we are interested in the ID, title and contents of the documents we have to index. Thus we create a simple schema.xml file describing the index structure, which could look like this:

<field name="id" type="string" indexed="true" stored="true" required="true" />
<field name="tytul" type="text" indexed="true" stored="true"/>
<field name="zawartosc" type="text" indexed="true" stored="false" multiValued="true"/>

Configuration

To the solrconfig.xml file we add the following entry which defines a handler that will handle the indexing of documents using Apache Tika:

<requestHandler name="/update/extract" class="org.apache.solr.handler.extraction.ExtractingRequestHandler">
   <lst name="defaults">
      <str name="fmap.Last-Modified">last_modified</str>
      <str name="uprefix">ignored_</str>
    </lst>
</requestHandler>

All update requests sent to the /update/extract address will be handled by Apache Tika. Of course, remember to send the commit command to the update handler after sending the documents to the handler using Apache Tika. Otherwise, your documents won’t be visible. In the standard Solr deployment you shoudl send the commit command to the handler located under /update.

In the configuration we told the extraction handler to assign the Last-Modified attribute to the last_modified field and to ignore the fields that do are not specified.

Additional notes

If you are going to index large binary files, remember to change the size limits. To do that, change the following values in the solrconfig.xml file:<requestDispatcher handleSelect=”true”> <requestParsers enableRemoteStreaming=”false” multipartUploadLimitInKB=”10240″ />

fmap.Last-Modified

The end

All parameters defining the ExtractingRequestHandler can be found at: http://wiki.apache.org/solr/ExtractingRequestHandler.

This post is also available in: Polish

This entry was posted on Monday, March 21st, 2011 at 09:17 and is filed under Bez kategorii. You can follow any responses to this entry through the RSS 2.0 feed. You can leave a response, or trackback from your own site.

5 Responses to “Solr and Tika integration (part 1 – basics)”

  1. lars Says:

    Hi,

    I was wondering if you could send me or publish a complete example of how to set up Solr in order to use pdf/word etc. indexing and querying. So far I just found bits and pieces of information everywhere, clearly indicating that it’s possible to do such things but unfortunately no complete walk through. When I add a pdf to solr all seems find but I can’t find / search for it’s content and don’t know what goes wrong. Any help is greatly appreciated.

    Regards,
    Lars

    P.S: I use SolrJ as interface

  2. gr0 Says:

    We will try to write something more about the topic, but this will probably happen after Christmas or after New Years Eve.

  3. Dan Says:

    Great article. Also, I second gr0′s request. A tutorial showing Tika integration into Solr to index the content would be very helpful!

  4. gr0 Says:

    Dan expect a use case example on how to use Tika to index data from JPEG files on next Monday :)

  5. Dan Says:

    Awesome! Thanks!

    Oh, and I meant lars. :)