Solr and Tika integration (part 1 – basics)

Indexing the so-called “rich documents”, ie files like pdf, doc, rtf, and so on (or binary files) always required some additional work on the developer side, at least to get the contents of the file and prepare it in a format understood by the search engines, in this case for Solr. To minimize this job I decided to look at the Apache Tika and integration of this library with Solr.

Introduction

First, a few words about the opportunities that we have when we choose Apache Tika. Apache Tika is a framework designed to extract information from the so-called “rich documents”-documents such as PDF files, the files in Microsoft Office format, rtf, but not only. Using Apache Tika we can also extract information from compressed documents, HTML files, images (eg jpg, png, gif), audio files (eg mp3, midi, wave), and compiled Java bytecode files. In addition, Apache Tika can detect the type of file being processed, which further simplifies the work with such documents. It is worth mentioning that the described framework is based on libraries such as PDFBox, Apache POI, or Neko HTML which indirectly guarantee very good results of extracted data.

Sample index structure

I’ll skip how to manually start the extraction of the contents of the documents in the Apache Tika and I will focus on the integration of this framework with Solr and how trivial it is. Assume that we are interested in the ID, title and contents of the documents we have to index. Thus we create a simple schema.xml file describing the index structure, which could look like this:

<field name="id" type="string" indexed="true" stored="true" required="true" />
<field name="tytul" type="text" indexed="true" stored="true"/>
<field name="zawartosc" type="text" indexed="true" stored="false" multiValued="true"/>

Configuration

To the solrconfig.xml file we add the following entry which defines a handler that will handle the indexing of documents using Apache Tika:

<requestHandler name="/update/extract" class="org.apache.solr.handler.extraction.ExtractingRequestHandler">
   <lst name="defaults">
      <str name="fmap.Last-Modified">last_modified</str>
      <str name="uprefix">ignored_</str>
    </lst>
</requestHandler>

All update requests sent to the /update/extract address will be handled by Apache Tika. Of course, remember to send the commit command to the update handler after sending the documents to the handler using Apache Tika. Otherwise, your documents won’t be visible. In the standard Solr deployment you shoudl send the commit command to the handler located under /update.

In the configuration we told the extraction handler to assign the Last-Modified attribute to the last_modified field and to ignore the fields that do are not specified.

Additional notes

If you are going to index large binary files, remember to change the size limits. To do that, change the following values in the solrconfig.xml file:<requestDispatcher handleSelect=”true”> <requestParsers enableRemoteStreaming=”false” multipartUploadLimitInKB=”10240″ />

fmap.Last-Modified

The end

All parameters defining the ExtractingRequestHandler can be found at: http://wiki.apache.org/solr/ExtractingRequestHandler.

This post is also available in: Polish

This entry was posted on Monday, March 21st, 2011 at 09:17 and is filed under Bez kategorii. You can follow any responses to this entry through the RSS 2.0 feed. You can leave a response, or trackback from your own site.

10 Responses to “Solr and Tika integration (part 1 – basics)”

  1. lars Says:

    Hi,

    I was wondering if you could send me or publish a complete example of how to set up Solr in order to use pdf/word etc. indexing and querying. So far I just found bits and pieces of information everywhere, clearly indicating that it’s possible to do such things but unfortunately no complete walk through. When I add a pdf to solr all seems find but I can’t find / search for it’s content and don’t know what goes wrong. Any help is greatly appreciated.

    Regards,
    Lars

    P.S: I use SolrJ as interface

  2. gr0 Says:

    We will try to write something more about the topic, but this will probably happen after Christmas or after New Years Eve.

  3. Dan Says:

    Great article. Also, I second gr0′s request. A tutorial showing Tika integration into Solr to index the content would be very helpful!

  4. gr0 Says:

    Dan expect a use case example on how to use Tika to index data from JPEG files on next Monday :)

  5. Dan Says:

    Awesome! Thanks!

    Oh, and I meant lars. :)

  6. xan Says:

    Hey, could you demonstrate how it could be done in php? I’m using the php solr client.

  7. gr0 Says:

    I’m afraid I’m not too familiar with Solr PHP client libraries, but you should be able to just push a binary document to the appropriate handler using PHP curl command.

  8. Nutan Says:

    i get error:unknown field ignored stream_source_info though i have defined dynamic field tag in schema.xml as :

    curl “http://localhost:8080/solr/update/extract?literal.id=1&commit=true”
    -F “myfile=@abc.txt”

    i am working on solr4.2 on windows.
    please help me to resolve this error.
    Also suggest me books or links to work on solr in windows.
    Thanks a lot.

  9. gr0 Says:

    Could you post your schema to blog(at)solr.pl, so we can see it?

  10. vardhan Says:

    I followed the above process I am getting error:org.apache.solr.common.SolrException:org.apache.solr.common.SolrException: Unknown fieldtype ‘text’ specified on field tytul

    My

    = 1.4
    1.5: omitNorms defaults to true for primitive field types
    (int, float, boolean, string…)
    –>

    <!– Valid attributes for fields:
    name: mandatory – the name for the field
    type: mandatory – the name of a field type from the
    fieldType section
    indexed: true if this field should be indexed (searchable or sortable)
    stored: true if this field should be retrievable
    multiValued: true if this field may contain multiple values per document
    omitNorms: (expert) set to true to omit the norms associated with
    this field (this disables length normalization and index-time
    boosting for the field, and saves some memory). Only full-text
    fields or fields that need an index-time boost need norms.
    Norms are omitted for primitive (non-analyzed) types by default.
    termVectors: [false] set to true to store the term vector for a
    given field.
    When using MoreLikeThis, fields used for similarity should be
    stored for best performance.
    termPositions: Store position information with the term vector.
    This will increase storage costs.
    termOffsets: Store offset information with the term vector. This
    will increase storage costs.
    required: The field is required. It will throw an error if the
    value does not exist
    default: a value that should be used if no value is specified
    when adding a document.
    –>


    <!–

    –>



    id

    <!– DEPRECATED: The defaultSearchField is consulted by various query parsers when
    parsing a query string that isn't explicit about the field. Machine (non-user)
    generated queries are best made explicit, or they can use the "df" request parameter
    which takes precedence over this.
    Note: Un-commenting defaultSearchField will be insufficient if your request handler
    in solrconfig.xml defines "df", which takes precedence. That would need to be removed.
    text –>

    <!– DEPRECATED: The defaultOperator (AND|OR) is consulted by various query parsers
    when parsing a query string to determine if a clause of the query should be marked as
    required or optional, assuming the clause isn't already marked by some operator.
    The default is OR, which is generally assumed so it is not a good idea to change it
    globally here. The "q.op" request parameter takes precedence over this.
    –>

    <!– –>


    <!– –>

    <!– One can also specify an existing Analyzer class that has a
    default constructor via the class attribute on the analyzer element.
    Example:

    –>

    <!– in this example, we will only use synonyms at query time

    –>

    <!– in this example, we will only use synonyms at query time

    –>

    <!– Optionally you may want to use this less aggressive stemmer instead of PorterStemFilterFactory:

    –>

    <!– Optionally you may want to use this less aggressive stemmer instead of PorterStemFilterFactory:

    –>

    <!– in this example, we will only use synonyms at query time

    –>


    <!–

    –>

    org.apache.lucene.analysis.payloads.FloatEncoder,
    integer -> o.a.l.a.p.IntegerEncoder
    identity -> o.a.l.a.p.IdentityEncoder
    Fully Qualified class name implementing PayloadEncoder, Encoder must have a no arg constructor.
    –>

    <!– This point type indexes the coordinates as separate fields (subFields)
    If subFieldType is defined, it references a type, and a dynamic field
    definition is created matching *___. Alternately, if
    subFieldSuffix is defined, that is used to create the subFields.
    Example: if subFieldType=”double”, then the coordinates would be
    indexed in fields myloc_0___double,myloc_1___double.
    Example: if subFieldSuffix=”_d” then the coordinates would be indexed
    in fields myloc_0_d,myloc_1_d
    The subFields are an implementation detail of the fieldType, and end
    users normally should not need to know about them.
    –>

    <!– less aggressive: –>
    <!– more aggressive: –>

    <!– more aggressive: –>

    <!– less aggressive: –>

    <!– less aggressive: –>
    <!– more aggressive: –>

    <!– less aggressive: –>

    <!– less aggressive: –>

    <!– more aggressive: –>

    <!– Japanese using morphological analysis (see text_cjk for a configuration using bigramming)

    NOTE: If you want to optimize search for precision, use default operator AND in your query
    parser config with further down in this file. Use
    OR if you would like to optimize for recall (default).
    –>

    <!– Kuromoji Japanese morphological analyzer/tokenizer (JapaneseTokenizer)

    Kuromoji has a search mode (default) that does segmentation useful for search. A heuristic
    is used to segment compounds into its parts and the compound itself is kept as synonym.

    Valid values for attribute mode are:
    normal: regular segmentation
    search: segmentation useful for search with synonyms compounds (default)
    extended: same as search mode, but unigrams unknown words (experimental)

    For some applications it might be good to use search mode for indexing and normal mode for
    queries to reduce recall and prevent parts of compounds from being matched and highlighted.
    Use and for this and mode normal in query.

    Kuromoji also has a convenient user dictionary feature that allows overriding the statistical
    model with your own entries for segmentation, part-of-speech tags and readings without a need
    to specify weights. Notice that user dictionaries have not been subject to extensive testing.

    User dictionary attributes are:
    userDictionary: user dictionary filename
    userDictionaryEncoding: user dictionary encoding (default is UTF-8)

    See lang/userdict_ja.txt for a sample user dictionary file.

    Punctuation characters are discarded by default. Use discardPunctuation=”false” to keep them.

    See http://wiki.apache.org/solr/JapaneseLanguageSupport for more on Japanese language support.
    –>

    <!—->

    <!– less aggressive: –>
    <!– singular/plural: –>

    <!– less aggressive: –>
    <!– more aggressive: –>
    <!– most aggressive: –>

    <!– less aggressive: –>

    <!– less aggressive: –>


    <!–

    param value

    –>