Indexing files like doc, pdf – Solr and Tika integration

In the previous article we have given basic information about how to enable the indexing of binary files, ie MS Word files, PDF files or LibreOffice files. Today we will do the same thing, using the Data Import Handler. Since a few days ago a new version of the Solr server (3.1) have been released, the following guidelines are based on this version. For the purpose of the article I used the “example” application – all of the changes relate to this application.

Assumptions

We assume that the data is available in the XML format and contain basic information about the document along with the file name where the document contents is located. The files are located in a defined directory. Example file look like this:

<?xml version="1.0" encoding="utf-8"?>
<albums>
    <album>
        <author>John F.</author>
        <title>Life in picture</title>
        <description>1.jpg</description>
    </album>
    <album>
        <author>Peter Z.</author>
        <title>Simple presentation</title>
        <description>2.pptx</description>
    </album>
</abums>

As you can see the data is characterized by the fact that individual elements do not have a unique identifier. But we can handle it 🙂
First we modify the schema by adding a definition of a field that holds the contents of the file:

<field name="content" type="text" indexed="true" stored="true" multiValued="true"/>

Next we modify the solrconfig.xml and add DIH configuration:

   
<requestHandler name="/dataimport">
   <lst name="defaults">
      <str name="config">data-config.xml</str>
   </lst>
</requestHandler>

Since we will use the entity processor located in the extras (TikaEntityProcessor), we need to modify the line loading the DIH library:

 <lib dir="../../dist/" regex="apache-solr-dataimporthandler-.*\.jar" />

The next step is to create a data-config.xml file. In our case it should look like this:

<dataConfig>
    <script><![CDATA[
        id = 1;
        function GenerateId(row) {
            row.put('id', (id ++).toFixed());
            return row;
        }       
       ]]></script>
   <dataSource type="BinURLDataSource" name="data"/>
    <dataSource type="URLDataSource" baseUrl="http://localhost/tmp/bin/" name="main"/>
    <document>
        <entity name="rec" processor="XPathEntityProcessor" url="data.xml" forEach="/albums/album" dataSource="main" transformer="script:GenerateId">
            <field column="title" xpath="//title" />
            <field column="description" xpath="//description" />
            <entity processor="TikaEntityProcessor" url="http://localhost/tmp/bin/
Generating record identifier - scripts
The first interesting feature is the use of standard ScriptTransformer in order to generate documents identifiers. With javascript "GenerateId" method and a references to it (transformer="script:GenerateId"), each record will be numbered. Frankly  it is not too good method of dealing with the lack of identifiers,  because it does not allow for incremental indexing (we are not able to  distinguish between the various versions of the record) - it is used  here only to show how easy it is to modify the records. If you do not like Javascript - you can use any scripting language supported by java6.
The use of multiple data sources
Another interesting element is the use of several data sources. Since our metadata is available in the XML file, you need to download this file. We  are using the standard approach: we define the UrlDataSource and then  we use the XpathEntityProcessor to analyze the incoming data. Since  we have to download binary attachments to each record, we define an  additional data source: BinURLDataSource and additional entity, using  the TikaEntityProcessor. Now  we only need to notify the entity where to download the file (the url attribute with the reference to the entity - a parent) and the  notification from which data source should be used (dataSource attribute). The  whole is complemented by a list of fields to be indexed (an additional  attribute meta means that the data are retrieved from the file's  metadata).
Available fields
Apache Tika allows you to download a number of additional data from the document. In the example above we used only the title, author, and content of the document. Complete  information about the available fields are included in the interfaces  that are implemented by the class Metadata  (http://tika.apache.org/0.9/api/org/apache/tika/metadata/Metadata.html)  exactly in the defined constants. In particular interesting are DublinCore and MSOffice.
The end
A short time after starting solr and running the import process (calling the address: http://localhost:8983/solr/dataimport?command=full-import) documents are indexed and that should be visible after sending the following query to Solr: http://localhost:8983/solr/select?q=*:*
{rec.description}" dataSource="data">
                <field column="text" name="content" />
                <field column="Author" name="author" meta="true" />
                <field column="title" name="title" meta="true" />
            </entity>
        </entity>
    </document>
</dataConfig>

Generating record identifier – scripts

The first interesting feature is the use of standard ScriptTransformer in order to generate documents identifiers. With javascript “GenerateId” method and a references to it (transformer=”script:GenerateId”), each record will be numbered. Frankly it is not too good method of dealing with the lack of identifiers, because it does not allow for incremental indexing (we are not able to distinguish between the various versions of the record) – it is used here only to show how easy it is to modify the records. If you do not like Javascript – you can use any scripting language supported by java6.

The use of multiple data sources

Another interesting element is the use of several data sources. Since our metadata is available in the XML file, you need to download this file. We are using the standard approach: we define the UrlDataSource and then we use the XpathEntityProcessor to analyze the incoming data. Since we have to download binary attachments to each record, we define an additional data source: BinURLDataSource and additional entity, using the TikaEntityProcessor. Now we only need to notify the entity where to download the file (the url attribute with the reference to the entity – a parent) and the notification from which data source should be used (dataSource attribute). The whole is complemented by a list of fields to be indexed (an additional attribute meta means that the data are retrieved from the file’s metadata).

Available fields

Apache Tika allows you to download a number of additional data from the document. In the example above we used only the title, author, and content of the document. Complete information about the available fields are included in the interfaces that are implemented by the class Metadata (http://tika.apache.org/0.9/api/org/apache/tika/metadata/Metadata.html) exactly in the defined constants. In particular interesting are DublinCore and MSOffice.

The end

A short time after starting solr and running the import process (calling the address: http://localhost:8983/solr/dataimport?command=full-import) documents are indexed and that should be visible after sending the following query to Solr: http://localhost:8983/solr/select?q=*:*

Solr.pl

Indexing files like doc, pdf – Solr and Tika integration

Assumptions

Generating record identifier - scripts

The use of multiple data sources

Available fields

The end

Generating record identifier – scripts

The use of multiple data sources

Available fields

The end

Leave a Reply Cancel reply