Indexing files like doc, pdf – Solr and Tika integration

In the previous article we have given basic information about how to enable the indexing of binary files, ie MS Word files, PDF files or LibreOffice files. Today we will do the same thing, using the Data Import Handler. Since a few days ago a new version of the Solr server (3.1) have been released, the following guidelines are based on this version. For the purpose of the article I used the “example” application – all of the changes relate to this application.

Assumptions

We assume that the data is available in the XML format and contain basic information about the document along with the file name where the document contents is located. The files are located in a defined directory. Example file look like this:

As you can see the data is characterized by the fact that individual elements do not have a unique identifier. But we can handle it 🙂
First we modify the schema by adding a definition of a field that holds the contents of the file:

Next we modify the solrconfig.xml and add DIH configuration:

Since we will use the entity processor located in the extras (TikaEntityProcessor), we need to modify the line loading the DIH library:

The next step is to create a data-config.xml file. In our case it should look like this:

Generating record identifier – scripts

The first interesting feature is the use of standard ScriptTransformer in order to generate documents identifiers. With javascript “GenerateId” method and a references to it (transformer=”script:GenerateId”), each record will be numbered. Frankly it is not too good method of dealing with the lack of identifiers, because it does not allow for incremental indexing (we are not able to distinguish between the various versions of the record) – it is used here only to show how easy it is to modify the records. If you do not like Javascript – you can use any scripting language supported by java6.

The use of multiple data sources

Another interesting element is the use of several data sources. Since our metadata is available in the XML file, you need to download this file. We are using the standard approach: we define the UrlDataSource and then we use the XpathEntityProcessor to analyze the incoming data. Since we have to download binary attachments to each record, we define an additional data source: BinURLDataSource and additional entity, using the TikaEntityProcessor. Now we only need to notify the entity where to download the file (the url attribute with the reference to the entity – a parent) and the notification from which data source should be used (dataSource attribute). The whole is complemented by a list of fields to be indexed (an additional attribute meta means that the data are retrieved from the file’s metadata).

Available fields

Apache Tika allows you to download a number of additional data from the document. In the example above we used only the title, author, and content of the document. Complete information about the available fields are included in the interfaces that are implemented by the class Metadata (http://tika.apache.org/0.9/api/org/apache/tika/metadata/Metadata.html) exactly in the defined constants. In particular interesting are DublinCore and MSOffice.

The end

A short time after starting solr and running the import process (calling the address: http://localhost:8983/solr/dataimport?command=full-import) documents are indexed and that should be visible after sending the following query to Solr: http://localhost:8983/solr/select?q=*:*

22 thoughts on “Indexing files like doc, pdf – Solr and Tika integration

  • 29 July 2011 at 13:16
    Permalink

    i have done all according to yours instruction but when finally i tried to index by giving command:

    java -jar post.jar a.xml

    a.xml file i want to indexed

    but got error
    SimplePostTool: version 1.2
    SimplePostTool: WARNING: Make sure your XML documents are encoded in UTF-8, other encodings are not currently supported
    SimplePostTool: POSTing files to http://localhost:8983/solr/update..
    SimplePostTool: POSTing file a.xml
    SimplePostTool: FATAL: Solr returned an error: Severe_errors_in_solr_configuration__Check_your_log_files_for_more_detailed_information_on_what_may_be_wrong__If_you_want_solr_to_continue_after_configuration_errors_change____abortOnConfigurationErrorfalseabortOnConfigurationError__in_solrconfigxml___orgapachesolrcommonSolrException_schemaxml_Duplicate_field_definition_for_title_ignoring_titletypetextpropertiesindexedtokenizedstoredmultiValued__at_orgapachesolrschemaIndexSchemareadSchemaIndexSchemajava493__at_orgapachesolrschemaIndexSchemainitIndexSchemajava95__at_orgapachesolrcoreSolrCoreinitSolrCorejava520__at_orgapachesolrcoreCoreContainer$InitializerinitializeCoreContainerjava137__at_orgapachesolrservletSolrDispatchFilterinitSolrDispatchFilterjava83__at_orgmortbayjettyservletFilterHolderdoStartFilterHolderjava99__at_orgmortbaycomponentAbstractLifeCyclestartAbstractLifeCyclejava40__at_orgmortbayjettyservletServletHandlerinitializeServletHandlerjava594__at_orgmortbayjettyservletContextstartContextContextjava139__at_orgmortbayjettywebappWebAppContextstartContextWebAppContextjava1218__at_orgmortbayjettyhandlerContextHandlerdoStartContextHandlerjava500__at_orgmortbayjettywebappWebAppContextdoStartWebAppContextjava448__at_orgmortbaycomponentAbstractLifeCyclestartAbstractLifeCyclejava40__at_orgmortbayjettyhandlerHandlerCollectiondoStartHandlerCollectionjava147__at_orgmortbayjettyhandlerContextHandlerCollectiondoStartContextHandlerCollectionjava161__at_orgmortbaycomponentAbstractLifeCyclestartAbstractLifeCyclejava40__at_orgmortbayjettyhandlerHandlerCollectiondoStartHandlerCollectionjava147__at_orgmortbaycomponentAbstractLifeCyclestartAbstractLifeCyclejava40__at_orgmortbayjettyhandlerHandlerWrapperdoStartHandlerWrapperjava117__at_orgmortbayjettyServerdoStartServerjava210__at_orgmortbaycomponentAbstra

    please help me out…..

    Reply
    • 29 July 2011 at 15:28
      Permalink

      My guess: You have duplicate entry for “title” field in schema.xml

      Reply
  • 29 July 2011 at 15:34
    Permalink

    It seems that you have configuration error in your Solr instance. When starting up Solr please check if Solr throws any exceptions.

    Reply
  • 1 August 2011 at 07:11
    Permalink

    I have checked once again and my admin solr page worked properly and i want to add my own doc file and then index them and also want seach them with admin solr page.

    please give me guide that how can i indexed them.
    please help me…..

    Reply
  • 1 August 2011 at 08:33
    Permalink

    If you still have error with “Severe_errors_in_solr_configuration__” my advice will be the same: please check your schema.xml. This error message says exactly what is the problem.

    Reply
  • 23 November 2012 at 08:37
    Permalink

    How to index pdf files… Can any one provide the full instructions….

    Reply
  • 17 December 2012 at 10:38
    Permalink

    Ah I love assumptions in such articles. It would be very useful if you also explained how to get the data from the PDF, Word etc. files into XML format, rather than just assume everyone knows.

    Reply
  • 17 December 2012 at 10:45
    Permalink

    Ian there is no conversion from PDF to XML assumed here. This post describes how to use both XML and PDF files in order to index them as a single document.

    Reply
  • 8 January 2013 at 13:10
    Permalink

    Ahh… right – my comment in [9] – is misleading. This post assumes that you already have your PDF content in XML.

    Reply
  • 10 July 2013 at 14:06
    Permalink

    my solr 3.5 has following libs loaded: tagsoup 1.2, tika-parsers 0.8 and tika-core 0.8

    when i start the import i get the following error in my log:

    SEVERE: Full Import failed:org.apache.solr.handler.dataimport.DataImportHandlerException: java.util.ServiceConfigurationError: org.apache.tika.parser.Parser: Provider org.apache.tika.parser.html.HtmlParser could not be instantiated: java.lang.NoClassDefFoundError: org/ccil/cowan/tagsoup/Schema
    at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:669)
    at org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:268)
    at org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:187)
    at org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:359)
    at org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:427)
    at org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:408)
    Caused by: java.util.ServiceConfigurationError: org.apache.tika.parser.Parser: Provider org.apache.tika.parser.html.HtmlParser could not be instantiated: java.lang.NoClassDefFoundError: org/ccil/cowan/tagsoup/Schema
    at java.util.ServiceLoader.fail(ServiceLoader.java:224)
    at java.util.ServiceLoader.access$100(ServiceLoader.java:181)
    at java.util.ServiceLoader$LazyIterator.next(ServiceLoader.java:377)
    at java.util.ServiceLoader$1.next(ServiceLoader.java:445)
    at org.apache.tika.parser.DefaultParser.loadParsers(DefaultParser.java:70)
    at org.apache.tika.parser.DefaultParser.(DefaultParser.java:77)
    at org.apache.tika.config.TikaConfig.(TikaConfig.java:179)
    at org.apache.solr.handler.dataimport.TikaEntityProcessor.firstInit(TikaEntityProcessor.java:73)
    at org.apache.solr.handler.dataimport.EntityProcessorBase.init(EntityProcessorBase.java:58)
    at org.apache.solr.handler.dataimport.EntityProcessorWrapper.init(EntityProcessorWrapper.java:69)
    at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:557)
    at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:622)
    … 5 more
    Caused by: java.lang.NoClassDefFoundError: org/ccil/cowan/tagsoup/Schema
    at java.lang.Class.getDeclaredConstructors0(Native Method)
    at java.lang.Class.privateGetDeclaredConstructors(Class.java:2413)
    at java.lang.Class.getConstructor0(Class.java:2723)
    at java.lang.Class.newInstance0(Class.java:345)
    at java.lang.Class.newInstance(Class.java:327)
    at java.util.ServiceLoader$LazyIterator.next(ServiceLoader.java:373)
    … 14 more
    Caused by: java.lang.ClassNotFoundException: org.ccil.cowan.tagsoup.Schema
    at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
    at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
    at java.security.AccessController.doPrivileged(Native Method)
    at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:423)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:356)
    … 20 more

    Jul 10, 2013 3:02:29 PM org.apache.solr.update.DirectUpdateHandler2 rollback

    Reply
    • 10 July 2013 at 14:10
      Permalink

      Exception like: org.apache.tika.parser.html.HtmlParser could not be instantiated: java.lang.NoClassDefFoundError: org/ccil/cowan/tagsoup/Schema shows that Solr doesn’t see the needed library. You need to look there.

      Reply
  • 10 July 2013 at 19:33
    Permalink

    but as I said above: tagsoup 1.2, tika-parsers 0.8 and tika-core 0.8 are being loaded (), its declared in the log file

    example:
    INFO: Adding ‘file:/C:/ColdFusion10/cfusion/jetty/solr/contrib/extraction/lib/tagsoup-1.2.jar’ to classloader
    Jul 10, 2013 3:01:35 PM org.apache.solr.core.SolrResourceLoader replaceClassLoader

    Reply
    • 11 July 2013 at 10:47
      Permalink

      Will try to check that, but I don’t have PC with me right now, so it make take longer period of time.

      Reply
  • 11 July 2013 at 13:46
    Permalink

    thanks gr0, i have downloaded the newest version of tagsoup 1.2.1 and tika-app-1.4 and now i’m getting the following:

    SEVERE: Exception while processing: rec document : SolrInputDocument[{id=id(1.0)={download/online/05_Mai.csv}, title=title(1.0)={Beratungsseminar kundenbrief}, contents=contents(1.0)={wie kommuniziert man}, path=path(1.0)={download/online}}]:org.apache.solr.handler.dataimport.DataImportHandlerException: java.lang.NoSuchMethodError: org.apache.tika.parser.AutoDetectParser.setConfig(Lorg/apache/tika/config/TikaConfig;)V
    at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:669)
    at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:622)
    at org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:268)
    at org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:187)
    at org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:359)
    at org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:427)
    at org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:408)
    Caused by: java.lang.NoSuchMethodError: org.apache.tika.parser.AutoDetectParser.setConfig(Lorg/apache/tika/config/TikaConfig;)V
    at org.apache.solr.handler.dataimport.TikaEntityProcessor.nextRow(TikaEntityProcessor.java:122)
    at org.apache.solr.handler.dataimport.EntityProcessorWrapper.nextRow(EntityProcessorWrapper.java:238)
    at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:596)
    … 6 more

    Jul 11, 2013 2:43:44 PM org.apache.solr.common.SolrException log
    SEVERE: Full Import failed:org.apache.solr.handler.dataimport.DataImportHandlerException: java.lang.NoSuchMethodError: org.apache.tika.parser.AutoDetectParser.setConfig(Lorg/apache/tika/config/TikaConfig;)V
    at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:669)
    at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:622)
    at org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:268)
    at org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:187)
    at org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:359)
    at org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:427)
    at org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:408)
    Caused by: java.lang.NoSuchMethodError: org.apache.tika.parser.AutoDetectParser.setConfig(Lorg/apache/tika/config/TikaConfig;)V
    at org.apache.solr.handler.dataimport.TikaEntityProcessor.nextRow(TikaEntityProcessor.java:122)
    at org.apache.solr.handler.dataimport.EntityProcessorWrapper.nextRow(EntityProcessorWrapper.java:238)
    at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:596)
    … 6 more

    could this have something to do with external parsers?

    Reply
  • 18 July 2013 at 17:30
    Permalink

    i have now changed some things and the import runs without error. in schema.xml i haven’t got the field “text” but “contentsExact”. unfortunatly the text isn’t indexed. what am i doing wrong?

    <!– 2013-07-05T14:59:46.889Z –>

    ps: how do i import tsstamp, it’s a stice value?

    Reply
  • 1 September 2013 at 10:55
    Permalink

    hey,im working on solr4.2 on windows..
    plz help me to indexing .docx n text files on windows,all tutorials are mainly for unix.
    my input xml file is :

    7
    abc1
    Cloud based searching,in documents
    F:\\doc_solr\\abc.txt

    schema is :

    solrconfig.xml :
    requestHandler name=”/update/extract” class=”solr.extraction.ExtractingRequestHandler” >

    last_modified
    true
    attr_

    i did try indexing .pdf file using as follows:
    >curl “http://localhost:8080/solr/update/extract?literal.doc_id=8&commit=true ” -F “myfile=@C:\solr\collection1\conf\solr-word.pdf”
    but i get attr_stream_source_info not defined.

    plz help me..
    thank you

    Reply
  • 8 February 2015 at 20:27
    Permalink

    I wish babylon had specified what he changed to get it to work because I’m getting the same error

    Reply
  • 2 April 2015 at 08:15
    Permalink

    can any one tell me how to index pdf file in solr

    Reply

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.