{"id":256,"date":"2011-04-04T20:37:27","date_gmt":"2011-04-04T18:37:27","guid":{"rendered":"http:\/\/sematext.solr.pl\/?p=256"},"modified":"2020-11-11T20:38:12","modified_gmt":"2020-11-11T19:38:12","slug":"indexing-files-like-doc-pdf-solr-and-tika-integration","status":"publish","type":"post","link":"https:\/\/solr.pl\/en\/2011\/04\/04\/indexing-files-like-doc-pdf-solr-and-tika-integration\/","title":{"rendered":"Indexing files like doc, pdf &#8211; Solr and Tika integration"},"content":{"rendered":"<p>In the <a href=\"http:\/\/solr.pl\/en\/2011\/03\/21\/solr-and-tika-integration-part-1-basics\/\" target=\"_blank\" rel=\"noopener noreferrer\">previous article<\/a> we have given basic information about how to enable  the indexing of binary files, ie MS Word files, PDF files or LibreOffice  files. Today we will do the same thing, using the Data Import Handler. Since  a few days ago a new version of the Solr server (3.1) have been  released, the following guidelines are based on this version. For the purpose of the article I used the &#8220;example&#8221; application &#8211; all of the changes relate to this application.<\/p>\n\n\n<!--more-->\n\n\n<h2>Assumptions<\/h2>\n<p>We  assume that the data is available in the XML format and contain basic  information about the document along with the file name where the  document contents is located. The files are located in a defined directory. Example file look like this:\n<\/p>\n<pre class=\"brush:xml\">&lt;?xml version=\"1.0\" encoding=\"utf-8\"?&gt;\n&lt;albums&gt;\n&nbsp; &nbsp; &lt;album&gt;\n&nbsp; &nbsp; &nbsp; &nbsp; &lt;author&gt;John F.&lt;\/author&gt;\n&nbsp; &nbsp; &nbsp; &nbsp; &lt;title&gt;Life in picture&lt;\/title&gt;\n&nbsp; &nbsp; &nbsp; &nbsp; &lt;description&gt;1.jpg&lt;\/description&gt;\n&nbsp; &nbsp; &lt;\/album&gt;\n&nbsp; &nbsp; &lt;album&gt;\n&nbsp; &nbsp; &nbsp; &nbsp; &lt;author&gt;Peter Z.&lt;\/author&gt;\n&nbsp; &nbsp; &nbsp; &nbsp; &lt;title&gt;Simple presentation&lt;\/title&gt;\n&nbsp; &nbsp; &nbsp; &nbsp; &lt;description&gt;2.pptx&lt;\/description&gt;\n&nbsp; &nbsp; &lt;\/album&gt;\n&lt;\/abums&gt;<\/pre>\n<p>As you can see the data is characterized by the fact that individual elements do not have a unique identifier. But we can handle it \ud83d\ude42<br>\nFirst we modify the schema by adding a definition of a field that holds the contents of the file:\n<\/p>\n<pre class=\"brush:xml\">&lt;field name=\"content\" type=\"text\" indexed=\"true\" stored=\"true\" multiValued=\"true\"\/&gt;<\/pre>\n<p>Next we modify the solrconfig.xml and add DIH configuration:\n<\/p>\n<pre class=\"brush:xml\">&nbsp; &nbsp;\n&lt;requestHandler name=\"\/dataimport\"&gt;\n   &lt;lst name=\"defaults\"&gt;\n&nbsp;&nbsp;    &lt;str name=\"config\"&gt;data-config.xml&lt;\/str&gt;\n&nbsp;&nbsp; &lt;\/lst&gt;\n&lt;\/requestHandler&gt;<\/pre>\n<p>Since  we will use the entity processor located in the extras  (<em>TikaEntityProcessor<\/em>), we need to modify the line loading the DIH  library:\n<\/p>\n<pre class=\"brush:xml\">&nbsp;&lt;lib dir=\"..\/..\/dist\/\" regex=\"apache-solr-dataimporthandler-.*\\.jar\" \/&gt;<\/pre>\n<p>The next step is to create a data-config.xml file. In our case it should look like this:\n<\/p>\n<pre class=\"brush:xml\">&lt;dataConfig&gt;\n&nbsp; &nbsp; &lt;script&gt;&lt;![CDATA[\n&nbsp; &nbsp; &nbsp; &nbsp; id = 1;\n        function GenerateId(row) {\n&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; row.put('id', (id ++).toFixed());\n&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; return row;\n&nbsp; &nbsp; &nbsp; &nbsp; }&nbsp; &nbsp; &nbsp; &nbsp;\n       ]]&gt;&lt;\/script&gt;\n   &lt;dataSource type=\"BinURLDataSource\" name=\"data\"\/&gt;\n&nbsp; &nbsp; &lt;dataSource type=\"URLDataSource\" baseUrl=\"http:\/\/localhost\/tmp\/bin\/\" name=\"main\"\/&gt;\n&nbsp; &nbsp; &lt;document&gt;\n&nbsp; &nbsp; &nbsp; &nbsp; &lt;entity name=\"rec\" processor=\"XPathEntityProcessor\" url=\"data.xml\" forEach=\"\/albums\/album\" dataSource=\"main\" transformer=\"script:GenerateId\"&gt;\n&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &lt;field column=\"title\" xpath=\"\/\/title\" \/&gt;\n&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &lt;field column=\"description\" xpath=\"\/\/description\" \/&gt;\n&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &lt;entity processor=\"TikaEntityProcessor\" url=\"http:\/\/localhost\/tmp\/bin\/\n<h2>Generating record identifier - scripts<\/h2>\n<p>The first interesting feature is the use of standard ScriptTransformer in order to generate documents identifiers. With javascript <em>\"GenerateId\"<\/em> method and a references to it <em>(transformer=\"script:GenerateId\")<\/em>, each record will be numbered. Frankly  it is not too good method of dealing with the lack of identifiers,  because it does not allow for incremental indexing (we are not able to  distinguish between the various versions of the record) - it is used  here only to show how easy it is to modify the records. If you do not like Javascript - you can use any scripting language supported by java6.<\/p>\n<h2>The use of multiple data sources<\/h2>\n<p>Another interesting element is the use of several data sources. Since our metadata is available in the XML file, you need to download this file. We  are using the standard approach: we define the <em>UrlDataSource <\/em>and then  we use the <em>XpathEntityProcessor <\/em>to analyze the incoming data. Since  we have to download binary attachments to each record, we define an  additional data source: <em>BinURLDataSource <\/em>and additional entity, using  the <em>TikaEntityProcessor<\/em>. Now  we only need to notify the entity where to download the file (the <em>url <\/em>attribute with the reference to the entity - a parent) and the  notification from which data source should be used (<em>dataSource<span style=\"text-decoration: underline;\"> <\/span><\/em>attribute). The  whole is complemented by a list of fields to be indexed (an additional  attribute <em>meta <\/em>means that the data are retrieved from the file's  metadata).<\/p>\n<h2>Available fields<\/h2>\n<p>Apache Tika allows you to download a number of additional data from the document. In the example above we used only the title, author, and content of the document. Complete  information about the available fields are included in the interfaces  that are implemented by the class Metadata  (<a href=\"http:\/\/tika.apache.org\/0.9\/api\/org\/apache\/tika\/metadata\/Metadata.html\" target=\"_blank\" rel=\"noopener noreferrer\">http:\/\/tika.apache.org\/0.9\/api\/org\/apache\/tika\/metadata\/Metadata.html<\/a>)  exactly in the defined constants. In particular interesting are <em>DublinCore <\/em>and <em>MSOffice<\/em>.<\/p>\n<h2>The end<\/h2>\n<p>A short time after starting solr and running the import process (calling the address: <em>http:\/\/localhost:8983\/solr\/dataimport?command=full-import<\/em>) documents are indexed and that should be visible after sending the following query to Solr: <em>http:\/\/localhost:8983\/solr\/select?q=*:*<\/em><\/p>\n{rec.description}\" dataSource=\"data\"&gt;\n&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &lt;field column=\"text\" name=\"content\" \/&gt;\n&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &lt;field column=\"Author\" name=\"author\" meta=\"true\" \/&gt;\n&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &lt;field column=\"title\" name=\"title\" meta=\"true\" \/&gt;\n&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &lt;\/entity&gt;\n&nbsp; &nbsp; &nbsp; &nbsp; &lt;\/entity&gt;\n&nbsp; &nbsp; &lt;\/document&gt;\n&lt;\/dataConfig&gt;<\/pre>\n<h2>Generating record identifier &#8211; scripts<\/h2>\n<p>The first interesting feature is the use of standard ScriptTransformer in order to generate documents identifiers. With javascript <em>&#8220;GenerateId&#8221;<\/em> method and a references to it <em>(transformer=&#8221;script:GenerateId&#8221;)<\/em>, each record will be numbered. Frankly  it is not too good method of dealing with the lack of identifiers,  because it does not allow for incremental indexing (we are not able to  distinguish between the various versions of the record) &#8211; it is used  here only to show how easy it is to modify the records. If you do not like Javascript &#8211; you can use any scripting language supported by java6.<\/p>\n<h2>The use of multiple data sources<\/h2>\n<p>Another interesting element is the use of several data sources. Since our metadata is available in the XML file, you need to download this file. We  are using the standard approach: we define the <em>UrlDataSource <\/em>and then  we use the <em>XpathEntityProcessor <\/em>to analyze the incoming data. Since  we have to download binary attachments to each record, we define an  additional data source: <em>BinURLDataSource <\/em>and additional entity, using  the <em>TikaEntityProcessor<\/em>. Now  we only need to notify the entity where to download the file (the <em>url <\/em>attribute with the reference to the entity &#8211; a parent) and the  notification from which data source should be used (<em>dataSource<span style=\"text-decoration: underline;\"> <\/span><\/em>attribute). The  whole is complemented by a list of fields to be indexed (an additional  attribute <em>meta <\/em>means that the data are retrieved from the file&#8217;s  metadata).<\/p>\n<h2>Available fields<\/h2>\n<p>Apache Tika allows you to download a number of additional data from the document. In the example above we used only the title, author, and content of the document. Complete  information about the available fields are included in the interfaces  that are implemented by the class Metadata  (<a href=\"http:\/\/tika.apache.org\/0.9\/api\/org\/apache\/tika\/metadata\/Metadata.html\" target=\"_blank\" rel=\"noopener noreferrer\">http:\/\/tika.apache.org\/0.9\/api\/org\/apache\/tika\/metadata\/Metadata.html<\/a>)  exactly in the defined constants. In particular interesting are <em>DublinCore <\/em>and <em>MSOffice<\/em>.<\/p>\n<h2>The end<\/h2>\n<p>A short time after starting solr and running the import process (calling the address: <em>http:\/\/localhost:8983\/solr\/dataimport?command=full-import<\/em>) documents are indexed and that should be visible after sending the following query to Solr: <em>http:\/\/localhost:8983\/solr\/select?q=*:*<\/em><\/p>","protected":false},"excerpt":{"rendered":"<p>In the previous article we have given basic information about how to enable the indexing of binary files, ie MS Word files, PDF files or LibreOffice files. Today we will do the same thing, using the Data Import Handler. Since<\/p>\n","protected":false},"author":2,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":"","_links_to":"","_links_to_target":""},"categories":[27],"tags":[256,254],"class_list":["post-256","post","type-post","status-publish","format-standard","hentry","category-solr-en","tag-data-import-handler-2","tag-dih-2"],"_links":{"self":[{"href":"https:\/\/solr.pl\/en\/wp-json\/wp\/v2\/posts\/256","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/solr.pl\/en\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/solr.pl\/en\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/solr.pl\/en\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/solr.pl\/en\/wp-json\/wp\/v2\/comments?post=256"}],"version-history":[{"count":1,"href":"https:\/\/solr.pl\/en\/wp-json\/wp\/v2\/posts\/256\/revisions"}],"predecessor-version":[{"id":257,"href":"https:\/\/solr.pl\/en\/wp-json\/wp\/v2\/posts\/256\/revisions\/257"}],"wp:attachment":[{"href":"https:\/\/solr.pl\/en\/wp-json\/wp\/v2\/media?parent=256"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/solr.pl\/en\/wp-json\/wp\/v2\/categories?post=256"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/solr.pl\/en\/wp-json\/wp\/v2\/tags?post=256"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}