Solr and Tika integration (part 1 – basics)

Indexing the so-called “rich documents”, ie files like pdf, doc, rtf, and so on (or binary files) always required some additional work on the developer side, at least to get the contents of the file and prepare it in a format understood by the search engines, in this case for Solr. To minimize this job I decided to look at the Apache Tika and integration of this library with Solr.

Introduction

First, a few words about the opportunities that we have when we choose Apache Tika. Apache Tika is a framework designed to extract information from the so-called “rich documents”-documents such as PDF files, the files in Microsoft Office format, rtf, but not only. Using Apache Tika we can also extract information from compressed documents, HTML files, images (eg jpg, png, gif), audio files (eg mp3, midi, wave), and compiled Java bytecode files. In addition, Apache Tika can detect the type of file being processed, which further simplifies the work with such documents. It is worth mentioning that the described framework is based on libraries such as PDFBox, Apache POI, or Neko HTML which indirectly guarantee very good results of extracted data.

Sample index structure

I’ll skip how to manually start the extraction of the contents of the documents in the Apache Tika and I will focus on the integration of this framework with Solr and how trivial it is. Assume that we are interested in the ID, title and contents of the documents we have to index. Thus we create a simple schema.xml file describing the index structure, which could look like this:

Configuration

To the solrconfig.xml file we add the following entry which defines a handler that will handle the indexing of documents using Apache Tika:

All update requests sent to the /update/extract address will be handled by Apache Tika. Of course, remember to send the commit command to the update handler after sending the documents to the handler using Apache Tika. Otherwise, your documents won’t be visible. In the standard Solr deployment you shoudl send the commit command to the handler located under /update.

In the configuration we told the extraction handler to assign the Last-Modified attribute to the last_modified field and to ignore the fields that do are not specified.

Additional notes

If you are going to index large binary files, remember to change the size limits. To do that, change the following values in the solrconfig.xml file:<requestDispatcher handleSelect=”true”> <requestParsers enableRemoteStreaming=”false” multipartUploadLimitInKB=”10240″ />

The end

All parameters defining the ExtractingRequestHandler can be found at: http://wiki.apache.org/solr/ExtractingRequestHandler.

12 thoughts on “Solr and Tika integration (part 1 – basics)

  • 21 December 2011 at 21:01
    Permalink

    Hi,

    I was wondering if you could send me or publish a complete example of how to set up Solr in order to use pdf/word etc. indexing and querying. So far I just found bits and pieces of information everywhere, clearly indicating that it’s possible to do such things but unfortunately no complete walk through. When I add a pdf to solr all seems find but I can’t find / search for it’s content and don’t know what goes wrong. Any help is greatly appreciated.

    Regards,
    Lars

    P.S: I use SolrJ as interface

    Reply
    • 22 December 2011 at 13:25
      Permalink

      We will try to write something more about the topic, but this will probably happen after Christmas or after New Years Eve.

      Reply
  • 14 February 2012 at 17:52
    Permalink

    Great article. Also, I second gr0’s request. A tutorial showing Tika integration into Solr to index the content would be very helpful!

    Reply
  • 14 February 2012 at 17:55
    Permalink

    Dan expect a use case example on how to use Tika to index data from JPEG files on next Monday 🙂

    Reply
  • 16 February 2012 at 18:44
    Permalink

    Awesome! Thanks!

    Oh, and I meant lars. 🙂

    Reply
  • 14 July 2013 at 14:22
    Permalink

    Hey, could you demonstrate how it could be done in php? I’m using the php solr client.

    Reply
    • 17 July 2013 at 18:44
      Permalink

      I’m afraid I’m not too familiar with Solr PHP client libraries, but you should be able to just push a binary document to the appropriate handler using PHP curl command.

      Reply
  • 1 September 2013 at 14:53
    Permalink

    i get error:unknown field ignored stream_source_info though i have defined dynamic field tag in schema.xml as :

    curl “http://localhost:8080/solr/update/extract?literal.id=1&commit=true”
    -F “myfile=@abc.txt”

    i am working on solr4.2 on windows.
    please help me to resolve this error.
    Also suggest me books or links to work on solr in windows.
    Thanks a lot.

    Reply
    • 10 September 2013 at 11:22
      Permalink

      Could you post your schema to blog(at)solr.pl, so we can see it?

      Reply
  • 13 November 2013 at 07:47
    Permalink

    I followed the above process I am getting error:org.apache.solr.common.SolrException:org.apache.solr.common.SolrException: Unknown fieldtype ‘text’ specified on field tytul

    My

    = 1.4
    1.5: omitNorms defaults to true for primitive field types
    (int, float, boolean, string…)
    –>

    <!– Valid attributes for fields:
    name: mandatory – the name for the field
    type: mandatory – the name of a field type from the
    fieldType section
    indexed: true if this field should be indexed (searchable or sortable)
    stored: true if this field should be retrievable
    multiValued: true if this field may contain multiple values per document
    omitNorms: (expert) set to true to omit the norms associated with
    this field (this disables length normalization and index-time
    boosting for the field, and saves some memory). Only full-text
    fields or fields that need an index-time boost need norms.
    Norms are omitted for primitive (non-analyzed) types by default.
    termVectors: [false] set to true to store the term vector for a
    given field.
    When using MoreLikeThis, fields used for similarity should be
    stored for best performance.
    termPositions: Store position information with the term vector.
    This will increase storage costs.
    termOffsets: Store offset information with the term vector. This
    will increase storage costs.
    required: The field is required. It will throw an error if the
    value does not exist
    default: a value that should be used if no value is specified
    when adding a document.
    –>


    <!–

    –>



    id

    <!– DEPRECATED: The defaultSearchField is consulted by various query parsers when
    parsing a query string that isn't explicit about the field. Machine (non-user)
    generated queries are best made explicit, or they can use the "df" request parameter
    which takes precedence over this.
    Note: Un-commenting defaultSearchField will be insufficient if your request handler
    in solrconfig.xml defines "df", which takes precedence. That would need to be removed.
    text –>

    <!– DEPRECATED: The defaultOperator (AND|OR) is consulted by various query parsers
    when parsing a query string to determine if a clause of the query should be marked as
    required or optional, assuming the clause isn't already marked by some operator.
    The default is OR, which is generally assumed so it is not a good idea to change it
    globally here. The "q.op" request parameter takes precedence over this.
    –>

    <!– –>


    <!– –>

    <!– One can also specify an existing Analyzer class that has a
    default constructor via the class attribute on the analyzer element.
    Example:

    –>

    <!– in this example, we will only use synonyms at query time

    –>

    <!– in this example, we will only use synonyms at query time

    –>

    <!– Optionally you may want to use this less aggressive stemmer instead of PorterStemFilterFactory:

    –>

    <!– Optionally you may want to use this less aggressive stemmer instead of PorterStemFilterFactory:

    –>

    <!– in this example, we will only use synonyms at query time

    –>


    <!–

    –>

    org.apache.lucene.analysis.payloads.FloatEncoder,
    integer -> o.a.l.a.p.IntegerEncoder
    identity -> o.a.l.a.p.IdentityEncoder
    Fully Qualified class name implementing PayloadEncoder, Encoder must have a no arg constructor.
    –>

    <!– This point type indexes the coordinates as separate fields (subFields)
    If subFieldType is defined, it references a type, and a dynamic field
    definition is created matching *___. Alternately, if
    subFieldSuffix is defined, that is used to create the subFields.
    Example: if subFieldType=”double”, then the coordinates would be
    indexed in fields myloc_0___double,myloc_1___double.
    Example: if subFieldSuffix=”_d” then the coordinates would be indexed
    in fields myloc_0_d,myloc_1_d
    The subFields are an implementation detail of the fieldType, and end
    users normally should not need to know about them.
    –>

    <!– less aggressive: –>
    <!– more aggressive: –>

    <!– more aggressive: –>

    <!– less aggressive: –>

    <!– less aggressive: –>
    <!– more aggressive: –>

    <!– less aggressive: –>

    <!– less aggressive: –>

    <!– more aggressive: –>

    <!– Japanese using morphological analysis (see text_cjk for a configuration using bigramming)

    NOTE: If you want to optimize search for precision, use default operator AND in your query
    parser config with further down in this file. Use
    OR if you would like to optimize for recall (default).
    –>

    <!– Kuromoji Japanese morphological analyzer/tokenizer (JapaneseTokenizer)

    Kuromoji has a search mode (default) that does segmentation useful for search. A heuristic
    is used to segment compounds into its parts and the compound itself is kept as synonym.

    Valid values for attribute mode are:
    normal: regular segmentation
    search: segmentation useful for search with synonyms compounds (default)
    extended: same as search mode, but unigrams unknown words (experimental)

    For some applications it might be good to use search mode for indexing and normal mode for
    queries to reduce recall and prevent parts of compounds from being matched and highlighted.
    Use and for this and mode normal in query.

    Kuromoji also has a convenient user dictionary feature that allows overriding the statistical
    model with your own entries for segmentation, part-of-speech tags and readings without a need
    to specify weights. Notice that user dictionaries have not been subject to extensive testing.

    User dictionary attributes are:
    userDictionary: user dictionary filename
    userDictionaryEncoding: user dictionary encoding (default is UTF-8)

    See lang/userdict_ja.txt for a sample user dictionary file.

    Punctuation characters are discarded by default. Use discardPunctuation=”false” to keep them.

    See http://wiki.apache.org/solr/JapaneseLanguageSupport for more on Japanese language support.
    –>

    <!—->

    <!– less aggressive: –>
    <!– singular/plural: –>

    <!– less aggressive: –>
    <!– more aggressive: –>
    <!– most aggressive: –>

    <!– less aggressive: –>

    <!– less aggressive: –>


    <!–

    param value

    –>

    Reply
  • 24 January 2016 at 14:09
    Permalink

    I have a problem with integrating solr in Ubuntu server.Before using solr on ubuntu server i tested it on my mac it was working perfectly. it indexed my PDF,Doc,Docx documents.so after installing solr on ubuntu server and using the same configuration files and librairies. i’ve found out that solr doesn’t index PDf documents.But i can search over .Doc and .Docx documents.
    here some parts of my solrconfig.xml contents :

    true
    ignored_
    _text_

    Reply
    • 6 February 2016 at 09:25
      Permalink

      Without seeing the configuration and libraries there is not much I can tell. I would suggest checking the paths to libraries in solrconfig.xml and checking Solr logs for issues related to loading libraries needed for rich document parsing.

      Reply

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.