{"id":223,"date":"2011-03-21T09:19:50","date_gmt":"2011-03-21T08:19:50","guid":{"rendered":"http:\/\/sematext.solr.pl\/?p=223"},"modified":"2020-11-11T09:20:27","modified_gmt":"2020-11-11T08:20:27","slug":"solr-and-tika-integration-part-1-basics","status":"publish","type":"post","link":"https:\/\/solr.pl\/en\/2011\/03\/21\/solr-and-tika-integration-part-1-basics\/","title":{"rendered":"Solr and Tika integration (part 1 &#8211; basics)"},"content":{"rendered":"<p><span id=\"goog-gtc-unit-1\" class=\"goog-gtc-unit\"><span class=\"goog-gtc-translatable goog-gtc-from-human\" dir=\"ltr\">Indexing the so-called &#8220;rich documents&#8221;, ie files like pdf, doc, rtf, and so on (or binary files) always required some additional work on the developer side, at least to get the contents of the file and prepare it in a format understood by the search engines, in this case for Solr.<\/span><\/span> <span id=\"goog-gtc-unit-2\" class=\"goog-gtc-unit\"><span class=\"goog-gtc-translatable goog-gtc-from-mt\" dir=\"ltr\">To minimize this job I decided to look at the <a title=\"http:\/\/tika.apache.org\/\" href=\"http:\/\/tika.apache.org\/\" target=\"_blank\" rel=\"noopener noreferrer\">Apache Tika<\/a> and integration of this library with Solr.<\/span><\/span><\/p>\n\n\n<!--more-->\n\n\n<h3>Introduction<\/h3>\n<p><span id=\"goog-gtc-unit-4\" class=\"goog-gtc-unit\"><span class=\"goog-gtc-translatable goog-gtc-from-human\" dir=\"ltr\">First, a few words about the opportunities that we have when we choose Apache Tika.<\/span><\/span> <span id=\"goog-gtc-unit-5\" class=\"goog-gtc-unit\"><span class=\"goog-gtc-translatable goog-gtc-from-human\" dir=\"ltr\">Apache Tika is a framework designed to extract information from the so-called &#8220;rich documents&#8221;-documents such as PDF files, the files in Microsoft Office format, rtf, but not only.<\/span><\/span> <span id=\"goog-gtc-unit-6\" class=\"goog-gtc-unit\"><span class=\"goog-gtc-translatable goog-gtc-from-human\" dir=\"ltr\">Using Apache Tika we can also extract information from compressed documents, HTML files, images (eg jpg, png, gif), audio files (eg mp3, midi, wave), and compiled Java bytecode files.<\/span><\/span> <span id=\"goog-gtc-unit-7\" class=\"goog-gtc-unit\"><span class=\"goog-gtc-translatable goog-gtc-from-human\" dir=\"ltr\">In addition, Apache Tika can detect the type of file being processed, which further simplifies the work with such documents.<\/span><\/span> <span id=\"goog-gtc-unit-8\" class=\"goog-gtc-unit\"><span class=\"goog-gtc-translatable goog-gtc-from-human\" dir=\"ltr\">It is worth mentioning that the described framework is based on libraries such as PDFBox, Apache POI, or Neko HTML which indirectly guarantee very good results of extracted data.<\/span><\/span><\/p>\n<h3>Sample index structure<\/h3>\n<p><span id=\"goog-gtc-unit-10\" class=\"goog-gtc-unit\"><span class=\"goog-gtc-translatable goog-gtc-from-human\" dir=\"ltr\">I&#8217;ll skip how to manually start the extraction of the contents of the documents in the Apache Tika and I will focus on the integration of this framework with Solr and how trivial it is.<\/span><\/span> <span id=\"goog-gtc-unit-11\" class=\"goog-gtc-unit\"><span class=\"goog-gtc-translatable goog-gtc-from-human\" dir=\"ltr\">Assume that we are interested in the ID, title and contents of the documents we have to index.<\/span><\/span> <span id=\"goog-gtc-unit-12\" class=\"goog-gtc-unit\"><span class=\"goog-gtc-translatable goog-gtc-from-mt\" dir=\"ltr\">Thus we create a simple schema.xml file describing the index structure, which could look like this:<\/span><\/span>\n<\/p>\n<pre class=\"brush:xml\">&lt;field name=\"id\" type=\"string\" indexed=\"true\" stored=\"true\" required=\"true\" \/&gt;\n&lt;field name=\"tytul\" type=\"text\" indexed=\"true\" stored=\"true\"\/&gt;\n&lt;field name=\"zawartosc\" type=\"text\" indexed=\"true\" stored=\"false\" multiValued=\"true\"\/&gt;<\/pre>\n<h3>Configuration<\/h3>\n<p>To the solrconfig.xml file we add the following entry which defines a handler that will handle the indexing of documents using Apache Tika:\n<\/p>\n<pre class=\"brush:xml\">&lt;requestHandler name=\"\/update\/extract\" class=\"org.apache.solr.handler.extraction.ExtractingRequestHandler\"&gt;\n   &lt;lst name=\"defaults\"&gt;\n      &lt;str name=\"fmap.Last-Modified\"&gt;last_modified&lt;\/str&gt;\n      &lt;str name=\"uprefix\"&gt;ignored_&lt;\/str&gt;\n    &lt;\/lst&gt;\n&lt;\/requestHandler&gt;<\/pre>\n<p><span id=\"goog-gtc-unit-26\" class=\"goog-gtc-unit\"><span class=\"goog-gtc-translatable goog-gtc-from-human\" dir=\"ltr\">All update requests sent to the<em> \/update\/extract<\/em> address will be handled by Apache Tika.<\/span><\/span> <span id=\"goog-gtc-unit-27\" class=\"goog-gtc-unit\"><span class=\"goog-gtc-translatable goog-gtc-from-human\" dir=\"ltr\">Of course, remember to send the commit command to the update handler after sending the documents to the handler using Apache Tika. Otherwise, your documents won&#8217;t be visible. In the standard Solr deployment you shoudl send the commit command to the handler located under <em>\/update<\/em>.<\/span><\/span><\/p>\n<p>In the configuration we told the extraction handler to assign the Last-Modified attribute to the <em>last_modified<\/em> field and to ignore the fields that do are not specified.<\/p>\n<h3>Additional notes<\/h3>\n<p>If you are going to index large binary files, remember to change the size limits. To do that, change the following values in the solrconfig.xml file:&lt;requestDispatcher handleSelect=&#8221;true&#8221;&gt; &lt;requestParsers enableRemoteStreaming=&#8221;false&#8221; multipartUploadLimitInKB=&#8221;10240&#8243; \/&gt;\n<\/p>\n<div id=\"_mcePaste\" style=\"overflow: hidden; position: absolute; left: -10000px; top: 269px; width: 1px; height: 1px;\">\n<pre>fmap.Last-Modified<\/pre>\n<\/div>\n<h3><span id=\"goog-gtc-unit-36\" class=\"goog-gtc-unit\"><span class=\"goog-gtc-translatable goog-gtc-from-tm goog-gtc-from-tm-score-100\" dir=\"ltr\">The end<\/span><\/span><\/h3>\n<p><span id=\"goog-gtc-unit-37\" class=\"goog-gtc-unit\"><span class=\"goog-gtc-translatable goog-gtc-from-mt\" dir=\"ltr\">All parameters defining the ExtractingRequestHandler can be found at: <a title=\"http:\/\/wiki.apache.org\/solr\/ExtractingRequestHandler\" href=\"http:\/\/wiki.apache.org\/solr\/ExtractingRequestHandler\" target=\"_blank\" rel=\"noopener noreferrer\">http:\/\/wiki.apache.org\/solr\/ExtractingRequestHandler<\/a>.<\/span><\/span><\/p>","protected":false},"excerpt":{"rendered":"<p>Indexing the so-called &#8220;rich documents&#8221;, ie files like pdf, doc, rtf, and so on (or binary files) always required some additional work on the developer side, at least to get the contents of the file and prepare it in a<\/p>\n","protected":false},"author":3,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":"","_links_to":"","_links_to_target":""},"categories":[27],"tags":[164,206],"class_list":["post-223","post","type-post","status-publish","format-standard","hentry","category-solr-en","tag-solr-2","tag-tika-2"],"_links":{"self":[{"href":"https:\/\/solr.pl\/en\/wp-json\/wp\/v2\/posts\/223","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/solr.pl\/en\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/solr.pl\/en\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/solr.pl\/en\/wp-json\/wp\/v2\/users\/3"}],"replies":[{"embeddable":true,"href":"https:\/\/solr.pl\/en\/wp-json\/wp\/v2\/comments?post=223"}],"version-history":[{"count":1,"href":"https:\/\/solr.pl\/en\/wp-json\/wp\/v2\/posts\/223\/revisions"}],"predecessor-version":[{"id":224,"href":"https:\/\/solr.pl\/en\/wp-json\/wp\/v2\/posts\/223\/revisions\/224"}],"wp:attachment":[{"href":"https:\/\/solr.pl\/en\/wp-json\/wp\/v2\/media?parent=223"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/solr.pl\/en\/wp-json\/wp\/v2\/categories?post=223"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/solr.pl\/en\/wp-json\/wp\/v2\/tags?post=223"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}