Indexing files like doc, pdf – Solr and Tika integration

In the previous article we have given basic information about how to enable the indexing of binary files, ie MS Word files, PDF files or LibreOffice files. Today we will do the same thing, using the Data Import Handler. Since a few days ago a new version of the Solr server (3.1) have been released, the following guidelines are based on this version. For the purpose of the article I used the “example” application – all of the changes relate to this application.

Assumptions

We assume that the data is available in the XML format and contain basic information about the document along with the file name where the document contents is located. The files are located in a defined directory. Example file look like this:

<?xml version="1.0" encoding="utf-8"?>
<albums>
    <album>
        <author>John F.</author>
        <title>Life in picture</title>
        <description>1.jpg</description>
    </album>
    <album>
        <author>Peter Z.</author>
        <title>Simple presentation</title>
        <description>2.pptx</description>
    </album>
</abums>

As you can see the data is characterized by the fact that individual elements do not have a unique identifier. But we can handle it 🙂
First we modify the schema by adding a definition of a field that holds the contents of the file:

<field name="content" type="text" indexed="true" stored="true" multiValued="true"/>

Next we modify the solrconfig.xml and add DIH configuration:

   
<requestHandler name="/dataimport">
   <lst name="defaults">
      <str name="config">data-config.xml</str>
   </lst>
</requestHandler>

Since we will use the entity processor located in the extras (TikaEntityProcessor), we need to modify the line loading the DIH library:

 <lib dir="../../dist/" regex="apache-solr-dataimporthandler-.*\.jar" />

The next step is to create a data-config.xml file. In our case it should look like this:

<dataConfig>
    <script><![CDATA[
        id = 1;
        function GenerateId(row) {
            row.put('id', (id ++).toFixed());
            return row;
        }       
       ]]></script>
   <dataSource type="BinURLDataSource" name="data"/>
    <dataSource type="URLDataSource" baseUrl="http://localhost/tmp/bin/" name="main"/>
    <document>
        <entity name="rec" processor="XPathEntityProcessor" url="data.xml" forEach="/albums/album" dataSource="main" transformer="script:GenerateId">
            <field column="title" xpath="//title" />
            <field column="description" xpath="//description" />
            <entity processor="TikaEntityProcessor" url="http://localhost/tmp/bin/${rec.description}" dataSource="data">
                <field column="text" name="content" />
                <field column="Author" name="author" meta="true" />
                <field column="title" name="title" meta="true" />
            </entity>
        </entity>
    </document>
</dataConfig>

Generating record identifier – scripts

The first interesting feature is the use of standard ScriptTransformer in order to generate documents identifiers. With javascript “GenerateId” method and a references to it (transformer=”script:GenerateId”), each record will be numbered. Frankly it is not too good method of dealing with the lack of identifiers, because it does not allow for incremental indexing (we are not able to distinguish between the various versions of the record) – it is used here only to show how easy it is to modify the records. If you do not like Javascript – you can use any scripting language supported by java6.

The use of multiple data sources

Another interesting element is the use of several data sources. Since our metadata is available in the XML file, you need to download this file. We are using the standard approach: we define the UrlDataSource and then we use the XpathEntityProcessor to analyze the incoming data. Since we have to download binary attachments to each record, we define an additional data source: BinURLDataSource and additional entity, using the TikaEntityProcessor. Now we only need to notify the entity where to download the file (the url attribute with the reference to the entity – a parent) and the notification from which data source should be used (dataSource attribute). The whole is complemented by a list of fields to be indexed (an additional attribute meta means that the data are retrieved from the file’s metadata).

Available fields

Apache Tika allows you to download a number of additional data from the document. In the example above we used only the title, author, and content of the document. Complete information about the available fields are included in the interfaces that are implemented by the class Metadata (http://tika.apache.org/0.9/api/org/apache/tika/metadata/Metadata.html) exactly in the defined constants. In particular interesting are DublinCore and MSOffice.

The end

A short time after starting solr and running the import process (calling the address: http://localhost:8983/solr/dataimport?command=full-import) documents are indexed and that should be visible after sending the following query to Solr: http://localhost:8983/solr/select?q=*:*

This entry was posted on Monday, April 4th, 2011 at 07:38 and is filed under About Solr. You can follow any responses to this entry through the RSS 2.0 feed. You can leave a response, or trackback from your own site.

22 Responses to “Indexing files like doc, pdf – Solr and Tika integration”

  1. rahul Says:

    i have done all according to yours instruction but when finally i tried to index by giving command:

    java -jar post.jar a.xml

    a.xml file i want to indexed

    but got error
    SimplePostTool: version 1.2
    SimplePostTool: WARNING: Make sure your XML documents are encoded in UTF-8, other encodings are not currently supported
    SimplePostTool: POSTing files to http://localhost:8983/solr/update..
    SimplePostTool: POSTing file a.xml
    SimplePostTool: FATAL: Solr returned an error: Severe_errors_in_solr_configuration__Check_your_log_files_for_more_detailed_information_on_what_may_be_wrong__If_you_want_solr_to_continue_after_configuration_errors_change____abortOnConfigurationErrorfalseabortOnConfigurationError__in_solrconfigxml___orgapachesolrcommonSolrException_schemaxml_Duplicate_field_definition_for_title_ignoring_titletypetextpropertiesindexedtokenizedstoredmultiValued__at_orgapachesolrschemaIndexSchemareadSchemaIndexSchemajava493__at_orgapachesolrschemaIndexSchemainitIndexSchemajava95__at_orgapachesolrcoreSolrCoreinitSolrCorejava520__at_orgapachesolrcoreCoreContainer$InitializerinitializeCoreContainerjava137__at_orgapachesolrservletSolrDispatchFilterinitSolrDispatchFilterjava83__at_orgmortbayjettyservletFilterHolderdoStartFilterHolderjava99__at_orgmortbaycomponentAbstractLifeCyclestartAbstractLifeCyclejava40__at_orgmortbayjettyservletServletHandlerinitializeServletHandlerjava594__at_orgmortbayjettyservletContextstartContextContextjava139__at_orgmortbayjettywebappWebAppContextstartContextWebAppContextjava1218__at_orgmortbayjettyhandlerContextHandlerdoStartContextHandlerjava500__at_orgmortbayjettywebappWebAppContextdoStartWebAppContextjava448__at_orgmortbaycomponentAbstractLifeCyclestartAbstractLifeCyclejava40__at_orgmortbayjettyhandlerHandlerCollectiondoStartHandlerCollectionjava147__at_orgmortbayjettyhandlerContextHandlerCollectiondoStartContextHandlerCollectionjava161__at_orgmortbaycomponentAbstractLifeCyclestartAbstractLifeCyclejava40__at_orgmortbayjettyhandlerHandlerCollectiondoStartHandlerCollectionjava147__at_orgmortbaycomponentAbstractLifeCyclestartAbstractLifeCyclejava40__at_orgmortbayjettyhandlerHandlerWrapperdoStartHandlerWrapperjava117__at_orgmortbayjettyServerdoStartServerjava210__at_orgmortbaycomponentAbstra

    please help me out…..

  2. negativ Says:

    My guess: You have duplicate entry for “title” field in schema.xml

  3. gr0 Says:

    It seems that you have configuration error in your Solr instance. When starting up Solr please check if Solr throws any exceptions.

  4. rahul Says:

    I have checked once again and my admin solr page worked properly and i want to add my own doc file and then index them and also want seach them with admin solr page.

    please give me guide that how can i indexed them.
    please help me…..

  5. negativ Says:

    If you still have error with “Severe_errors_in_solr_configuration__” my advice will be the same: please check your schema.xml. This error message says exactly what is the problem.

  6. VJ Says:

    How to index pdf files… Can any one provide the full instructions….

  7. gr0 Says:

    In addition to the information in this post, you can look at http://wiki.apache.org/solr/ExtractingRequestHandler to get additional information.

  8. Ian Says:

    Ah I love assumptions in such articles. It would be very useful if you also explained how to get the data from the PDF, Word etc. files into XML format, rather than just assume everyone knows.

  9. gr0 Says:

    Ian there is no conversion from PDF to XML assumed here. This post describes how to use both XML and PDF files in order to index them as a single document.

  10. yep Says:

    it is only for XML !

  11. gr0 Says:

    Ahh… right – my comment in [9] – is misleading. This post assumes that you already have your PDF content in XML.

  12. archana Says:

    this z very helpful..

  13. babylon Says:

    my solr 3.5 has following libs loaded: tagsoup 1.2, tika-parsers 0.8 and tika-core 0.8

    when i start the import i get the following error in my log:

    SEVERE: Full Import failed:org.apache.solr.handler.dataimport.DataImportHandlerException: java.util.ServiceConfigurationError: org.apache.tika.parser.Parser: Provider org.apache.tika.parser.html.HtmlParser could not be instantiated: java.lang.NoClassDefFoundError: org/ccil/cowan/tagsoup/Schema
    at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:669)
    at org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:268)
    at org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:187)
    at org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:359)
    at org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:427)
    at org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:408)
    Caused by: java.util.ServiceConfigurationError: org.apache.tika.parser.Parser: Provider org.apache.tika.parser.html.HtmlParser could not be instantiated: java.lang.NoClassDefFoundError: org/ccil/cowan/tagsoup/Schema
    at java.util.ServiceLoader.fail(ServiceLoader.java:224)
    at java.util.ServiceLoader.access$100(ServiceLoader.java:181)
    at java.util.ServiceLoader$LazyIterator.next(ServiceLoader.java:377)
    at java.util.ServiceLoader$1.next(ServiceLoader.java:445)
    at org.apache.tika.parser.DefaultParser.loadParsers(DefaultParser.java:70)
    at org.apache.tika.parser.DefaultParser.(DefaultParser.java:77)
    at org.apache.tika.config.TikaConfig.(TikaConfig.java:179)
    at org.apache.solr.handler.dataimport.TikaEntityProcessor.firstInit(TikaEntityProcessor.java:73)
    at org.apache.solr.handler.dataimport.EntityProcessorBase.init(EntityProcessorBase.java:58)
    at org.apache.solr.handler.dataimport.EntityProcessorWrapper.init(EntityProcessorWrapper.java:69)
    at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:557)
    at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:622)
    … 5 more
    Caused by: java.lang.NoClassDefFoundError: org/ccil/cowan/tagsoup/Schema
    at java.lang.Class.getDeclaredConstructors0(Native Method)
    at java.lang.Class.privateGetDeclaredConstructors(Class.java:2413)
    at java.lang.Class.getConstructor0(Class.java:2723)
    at java.lang.Class.newInstance0(Class.java:345)
    at java.lang.Class.newInstance(Class.java:327)
    at java.util.ServiceLoader$LazyIterator.next(ServiceLoader.java:373)
    … 14 more
    Caused by: java.lang.ClassNotFoundException: org.ccil.cowan.tagsoup.Schema
    at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
    at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
    at java.security.AccessController.doPrivileged(Native Method)
    at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:423)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:356)
    … 20 more

    Jul 10, 2013 3:02:29 PM org.apache.solr.update.DirectUpdateHandler2 rollback

  14. gr0 Says:

    Exception like: org.apache.tika.parser.html.HtmlParser could not be instantiated: java.lang.NoClassDefFoundError: org/ccil/cowan/tagsoup/Schema shows that Solr doesn’t see the needed library. You need to look there.

  15. babylon Says:

    but as I said above: tagsoup 1.2, tika-parsers 0.8 and tika-core 0.8 are being loaded (), its declared in the log file

    example:
    INFO: Adding ‘file:/C:/ColdFusion10/cfusion/jetty/solr/contrib/extraction/lib/tagsoup-1.2.jar’ to classloader
    Jul 10, 2013 3:01:35 PM org.apache.solr.core.SolrResourceLoader replaceClassLoader

  16. gr0 Says:

    Will try to check that, but I don’t have PC with me right now, so it make take longer period of time.

  17. babylon Says:

    thanks gr0, i have downloaded the newest version of tagsoup 1.2.1 and tika-app-1.4 and now i’m getting the following:

    SEVERE: Exception while processing: rec document : SolrInputDocument[{id=id(1.0)={download/online/05_Mai.csv}, title=title(1.0)={Beratungsseminar kundenbrief}, contents=contents(1.0)={wie kommuniziert man}, path=path(1.0)={download/online}}]:org.apache.solr.handler.dataimport.DataImportHandlerException: java.lang.NoSuchMethodError: org.apache.tika.parser.AutoDetectParser.setConfig(Lorg/apache/tika/config/TikaConfig;)V
    at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:669)
    at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:622)
    at org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:268)
    at org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:187)
    at org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:359)
    at org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:427)
    at org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:408)
    Caused by: java.lang.NoSuchMethodError: org.apache.tika.parser.AutoDetectParser.setConfig(Lorg/apache/tika/config/TikaConfig;)V
    at org.apache.solr.handler.dataimport.TikaEntityProcessor.nextRow(TikaEntityProcessor.java:122)
    at org.apache.solr.handler.dataimport.EntityProcessorWrapper.nextRow(EntityProcessorWrapper.java:238)
    at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:596)
    … 6 more

    Jul 11, 2013 2:43:44 PM org.apache.solr.common.SolrException log
    SEVERE: Full Import failed:org.apache.solr.handler.dataimport.DataImportHandlerException: java.lang.NoSuchMethodError: org.apache.tika.parser.AutoDetectParser.setConfig(Lorg/apache/tika/config/TikaConfig;)V
    at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:669)
    at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:622)
    at org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:268)
    at org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:187)
    at org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:359)
    at org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:427)
    at org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:408)
    Caused by: java.lang.NoSuchMethodError: org.apache.tika.parser.AutoDetectParser.setConfig(Lorg/apache/tika/config/TikaConfig;)V
    at org.apache.solr.handler.dataimport.TikaEntityProcessor.nextRow(TikaEntityProcessor.java:122)
    at org.apache.solr.handler.dataimport.EntityProcessorWrapper.nextRow(EntityProcessorWrapper.java:238)
    at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:596)
    … 6 more

    could this have something to do with external parsers?

  18. babylon Says:

    i have now changed some things and the import runs without error. in schema.xml i haven’t got the field “text” but “contentsExact”. unfortunatly the text isn’t indexed. what am i doing wrong?

    <!– 2013-07-05T14:59:46.889Z –>

    ps: how do i import tsstamp, it’s a stice value?

  19. Nutan Says:

    hey,im working on solr4.2 on windows..
    plz help me to indexing .docx n text files on windows,all tutorials are mainly for unix.
    my input xml file is :

    7
    abc1
    Cloud based searching,in documents
    F:\\doc_solr\\abc.txt

    schema is :

    solrconfig.xml :
    requestHandler name=”/update/extract” class=”solr.extraction.ExtractingRequestHandler” >

    last_modified
    true
    attr_

    i did try indexing .pdf file using as follows:
    >curl “http://localhost:8080/solr/update/extract?literal.doc_id=8&commit=true ” -F “myfile=@C:\solr\collection1\conf\solr-word.pdf”
    but i get attr_stream_source_info not defined.

    plz help me..
    thank you

  20. jim Says:

    I wish babylon had specified what he changed to get it to work because I’m getting the same error

  21. SaiRam Says:

    can any one tell me how to index pdf file in solr

  22. gr0 Says:

    I think think may come in handy – https://cwiki.apache.org/confluence/display/solr/Uploading+Data+with+Solr+Cell+using+Apache+Tika