{"id":84,"date":"2010-10-11T14:15:04","date_gmt":"2010-10-11T12:15:04","guid":{"rendered":"http:\/\/sematext.solr.pl\/?p=84"},"modified":"2020-11-10T14:15:43","modified_gmt":"2020-11-10T13:15:43","slug":"data-import-handler-how-to-import-data-from-sql-databases-part-1","status":"publish","type":"post","link":"https:\/\/solr.pl\/en\/2010\/10\/11\/data-import-handler-how-to-import-data-from-sql-databases-part-1\/","title":{"rendered":"Data Import Handler \u2013 How to import data from SQL databases (part 1)"},"content":{"rendered":"<p>In the  article on how to import data  (<a title=\"Solr: data indexing for fun and profit\" href=\"http:\/\/solr.pl\/2010\/09\/06\/solr-data-indexing-for-fun-and-profit\/?lang=en\">http:\/\/solr.pl\/2010\/09\/06\/solr-data-indexing-for-fun-and-profit\/?lang=en<\/a>) I mentioned the <strong> Data Import Handler<\/strong> (DIH). The main advantage of  this method of data importing is no need for additional software  development and the rapid integration of the data source. This second advantage,  however, requires skill and practice. In this entry I&#8217;ll  show you the basics of DIH integration with SQL data source.<\/p>\n\n\n<!--more-->\n\n\n<p>In the examples  PostgeSQL database was used containing Polish Wikipedia data.  Test solr  instance is defined as the root of &#8220;wikipedia&#8221; and was available at:<\/p>\n<p><a href=\"http:\/\/localhost:8983\/solr\/wikipedia\/admin\/\">http:\/\/localhost:8983\/solr\/wikipedia\/admin\/<\/a><\/p>\n<p><strong>Solrconfig.xml  configuration<\/strong><\/p>\n<p>All what we need to do  is add an additional request handler. Parameter: config  determines configuration file which provides the definition of data  sources.\n<\/p>\n<pre class=\"brush:xml\"> &lt;requestHandler name=\"\/dataimport\" class=\"org.apache.solr.handler.dataimport.DataImportHandler\"&gt;\n  &lt;lst name=\"defaults\"&gt;\n   &lt;str name=\"config\"&gt;db-data-config.xml&lt;\/str&gt;\n  &lt;\/lst&gt;\n&lt;\/requestHandler&gt;<\/pre>\n<p>With this definition  we gain the ability to call an HTTP address that supports import  commands like:<\/p>\n<p>\/dataimport &#8211; in order  to obtain the current status<br>\n\/dataimport?command=reload-config &#8211;  in order to re-read the configuration<br>\n\/dataimport?command=full-import &#8211;  in order to start a full indexing<br>\n\/dataimport?command=delta-import &#8211;  in order to start an incremental indexing<\/p>\n<p>(for my configuration,  the full address is: <a href=\"http:\/\/localhost:8983\/solr\/wikipedia\/dataimport\" target=\"_blank\" rel=\"noopener noreferrer\">http:\/\/localhost:8983\/solr\/wikipedia\/dataimport<\/a>)<\/p>\n<p>Above we see two  possibilities to import data: full and incremental import.<\/p>\n<p><strong>The full import<\/strong> of  data rely on loading all data every time, while <strong>incremental import<\/strong> means  only adding the data that changed since the last indexing.  The full  import usually takes much longer from here a simple conclusion: if the  full import time is not an issue for use &#8211; it&#8217;s probably not worth to  bother  with the incremental indexing configuration, especially since it  imposes additional requirements on the structure of the data source.<\/p>\n<p>It is worth noting  that the full import of default begins with the removal of the existing  index. You  can avoid this behavior by adding the parameter: <em>clean=false<\/em>.<\/p>\n<p><strong>Configuring data sources<\/strong><\/p>\n<p>Configuration is to  define a queries allowing solr to download data to be indexed.  Configuration is included in the file specified in the definition of a  handler (in our case: db-data-config.xml). In this part of the article  we will define the full import. Then, in another part of  the article we extend it with the ability to import incrementally.<\/p>\n<p><strong>Full import<\/strong>\n<\/p>\n<pre class=\"brush:xml\">&lt;dataConfig&gt;\n  &lt;dataSource driver=\"org.postgresql.Driver\"\n     url=\"jdbc:postgresql:\/\/localhost:5432\/wikipedia\"\n     user=\"wikipedia\"\n     password=\"secret\" \/&gt;\n  &lt;document&gt;\n    &lt;entity name=\"page\" query=\"SELECT page_id, page_title from page\"&gt;\n      &lt;field column=\"page_id\" name=\"id\" \/&gt;\n      &lt;field column=\"page_title\" name=\"name\" \/&gt;\n      &lt;entity name=\"revision\" query=\"select rev_id from revision where rev_page=${page.page_id}\"&gt;\n        &lt;entity name=\"pagecontent\" query=\"select old_text from pagecontent where old_id=${revision.rev_id}\"&gt;\n          &lt;field column=\"old_text\" name=\"text\" \/&gt;\n        &lt;\/entity&gt;\n      &lt;\/entity&gt;\n   &lt;\/entity&gt;\n  &lt;\/document&gt;\n&lt;\/dataConfig&gt;<\/pre>\n<p>As you can see the  definition of the source consists of a description of how to link to  databases and a description of an indexed document. Import will proceed  with the following algorithm:<\/p>\n<ol>\n<li> The old index is  removed (if not used with parameter clean=false)<\/li>\n<li> After indexing  command is called Solr connects to the database.<\/li>\n<li> Database cursor is  defined using the database query specified in the argument &#8220;query&#8221; in  the main entity<\/li>\n<li> Some amount of data is  collected<\/li>\n<li> For each record  downloaded there are variables defined as &lt;entity name&gt;.&lt;returned column&gt;, so the returned value can be appealed in nested  entities<\/li>\n<li> Queries in nested  entities are called<\/li>\n<li> Entity may contain  fields definition. Thanks to that, Solr  is able to identify the mapping of a columns of the result to the field  of the document defined in the schema.xml<\/li>\n<li> Document created by the  values returned by the query is added to the index<\/li>\n<li> If the cursor has  another result we jump to step 4<\/li>\n<li> Solr stores data to a  dataimport.properties file, the data are approved (commit) and index  optimization is performed<\/li>\n<\/ol>\n<p>After opening the page  <a href=\"http:\/\/localhost:8983\/solr\/wikipedia\/dataimport\" target=\"_blank\" rel=\"noopener noreferrer\">http:\/\/localhost:8983\/solr\/wikipedia\/dataimport<\/a> the following answer  appeared:\n<\/p>\n<pre class=\"brush:xml\">&lt;?xml version=\"1.0\" encoding=\"UTF-8\"?&gt;\n&lt;response&gt;\n  &lt;lst name=\"responseHeader\"&gt;\n    &lt;int name=\"status\"&gt;0&lt;\/int&gt;\n    &lt;int name=\"QTime\"&gt;0&lt;\/int&gt;\n  &lt;\/lst&gt;\n  &lt;lst name=\"initArgs\"&gt;\n    &lt;lst name=\"defaults\"&gt;\n      &lt;str name=\"config\"&gt;db-data-config.xml&lt;\/str&gt;\n    &lt;\/lst&gt;\n  &lt;\/lst&gt;\n  &lt;str name=\"status\"&gt;idle&lt;\/str&gt;\n  &lt;str name=\"importResponse\"\/&gt;\n  &lt;lst name=\"statusMessages\"\/&gt;\n  &lt;str name=\"WARNING\"&gt;This response format is experimental.  It is likely to change in the future.&lt;\/str&gt;\n&lt;\/response&gt;<\/pre>\n<p>What is worth seeing  is entry: status: idle. This means that our  importer is ready to work. In another case (eg  error in the  XML configuration file), we get a description of the exception.  Unfortunately,  at this stage we are not able to detect such errors as improper  connection to the database or lack of JDBC driver.<\/p>\n<p>So we go to page:  <a href=\"http:\/\/localhost:8983\/solr\/wikipedia\/dataimport?command=full-import\" target=\"_blank\" rel=\"noopener noreferrer\">http:\/\/localhost:8983\/solr\/wikipedia\/dataimport?command=full-import<\/a><\/p>\n<p>What we should get,  is similar to the above XML. However, after re-entering the page:  <a href=\"http:\/\/localhost:8983\/solr\/wikipedia\/dataimport\" target=\"_blank\" rel=\"noopener noreferrer\">http:\/\/localhost:8983\/solr\/wikipedia\/dataimport<\/a> we get a different  result.\n<\/p>\n<pre class=\"brush:xml\">&lt;?xml version=\"1.0\" encoding=\"UTF-8\"?&gt;\n&lt;response&gt;\n  &lt;lst name=\"responseHeader\"&gt;\n    &lt;int name=\"status\"&gt;0&lt;\/int&gt;\n    &lt;int name=\"QTime\"&gt;0&lt;\/int&gt;\n  &lt;\/lst&gt;\n  &lt;lst name=\"initArgs\"&gt;\n    &lt;lst name=\"defaults\"&gt;\n      &lt;str name=\"config\"&gt;db-data-config.xml&lt;\/str&gt;\n    &lt;\/lst&gt;\n  &lt;\/lst&gt;\n  &lt;str name=\"status\"&gt;busy&lt;\/str&gt;\n  &lt;str name=\"importResponse\"&gt;A command is still running...&lt;\/str&gt;\n  &lt;lst name=\"statusMessages\"&gt;\n    &lt;str name=\"Time Elapsed\"&gt;0:1:15.460&lt;\/str&gt;\n    &lt;str name=\"Total Requests made to DataSource\"&gt;39547&lt;\/str&gt;\n    &lt;str name=\"Total Rows Fetched\"&gt;59319&lt;\/str&gt;\n    &lt;str name=\"Total Documents Processed\"&gt;19772&lt;\/str&gt;\n    &lt;str name=\"Total Documents Skipped\"&gt;0&lt;\/str&gt;\n    &lt;str name=\"Full Dump Started\"&gt;2010-10-03 14:28:00&lt;\/str&gt;\n  &lt;\/lst&gt;\n  &lt;str name=\"WARNING\"&gt;This response format is experimental.  It is likely to change in the future.&lt;\/str&gt;\n&lt;\/response&gt;<\/pre>\n<p>So the importer is  working.<\/p>\n<p>After some time,  depending on the amount of indexed data and the speed of your computer  you will receive:\n<\/p>\n<pre class=\"brush:xml\">&lt;?xml version=\"1.0\" encoding=\"UTF-8\"?&gt;\n&lt;response&gt;\n  &lt;lst name=\"responseHeader\"&gt;\n    &lt;int name=\"status\"&gt;0&lt;\/int&gt;\n    &lt;int name=\"QTime\"&gt;0&lt;\/int&gt;\n  &lt;\/lst&gt;\n  &lt;lst name=\"initArgs\"&gt;\n    &lt;lst name=\"defaults\"&gt;\n      &lt;str name=\"config\"&gt;db-data-config.xml&lt;\/str&gt;\n    &lt;\/lst&gt;\n  &lt;\/lst&gt;\n  &lt;str name=\"status\"&gt;idle&lt;\/str&gt;\n  &lt;str name=\"importResponse\"\/&gt;\n  &lt;lst name=\"statusMessages\"&gt;\n    &lt;str name=\"Total Requests made to DataSource\"&gt;2118645&lt;\/str&gt;\n    &lt;str name=\"Total Rows Fetched\"&gt;3177966&lt;\/str&gt;\n    &lt;str name=\"Total Documents Skipped\"&gt;0&lt;\/str&gt;\n    &lt;str name=\"Full Dump Started\"&gt;2010-10-03 14:28:00&lt;\/str&gt;\n    &lt;str name=\"\"&gt;Indexing completed. Added\/Updated: 1059322 documents. Deleted 0 documents.&lt;\/str&gt;\n    &lt;str name=\"Committed\"&gt;2010-10-03 14:55:20&lt;\/str&gt;\n    &lt;str name=\"Optimized\"&gt;2010-10-03 14:55:20&lt;\/str&gt;\n    &lt;str name=\"Total Documents Processed\"&gt;1059322&lt;\/str&gt;\n    &lt;str name=\"Time taken \"&gt;0:27:20.325&lt;\/str&gt;\n  &lt;\/lst&gt;\n  &lt;str name=\"WARNING\"&gt;This response format is experimental.  It is likely to change in the future.&lt;\/str&gt;\n&lt;\/response&gt;<\/pre>\n<p>As you can see indexing  was successful.<\/p>\n<p>In the next section we  will try to add the ability to import incrementally.<\/p>","protected":false},"excerpt":{"rendered":"<p>In the article on how to import data (http:\/\/solr.pl\/2010\/09\/06\/solr-data-indexing-for-fun-and-profit\/?lang=en) I mentioned the Data Import Handler (DIH). The main advantage of this method of data importing is no need for additional software development and the rapid integration of the data source.<\/p>\n","protected":false},"author":2,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":"","_links_to":"","_links_to_target":""},"categories":[27],"tags":[],"class_list":["post-84","post","type-post","status-publish","format-standard","hentry","category-solr-en"],"_links":{"self":[{"href":"https:\/\/solr.pl\/en\/wp-json\/wp\/v2\/posts\/84","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/solr.pl\/en\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/solr.pl\/en\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/solr.pl\/en\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/solr.pl\/en\/wp-json\/wp\/v2\/comments?post=84"}],"version-history":[{"count":1,"href":"https:\/\/solr.pl\/en\/wp-json\/wp\/v2\/posts\/84\/revisions"}],"predecessor-version":[{"id":85,"href":"https:\/\/solr.pl\/en\/wp-json\/wp\/v2\/posts\/84\/revisions\/85"}],"wp:attachment":[{"href":"https:\/\/solr.pl\/en\/wp-json\/wp\/v2\/media?parent=84"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/solr.pl\/en\/wp-json\/wp\/v2\/categories?post=84"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/solr.pl\/en\/wp-json\/wp\/v2\/tags?post=84"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}