Data Import Handler – How to import data from SQL databases (part 1)

In the article on how to import data (http://solr.pl/2010/09/06/solr-data-indexing-for-fun-and-profit/?lang=en) I mentioned the Data Import Handler (DIH). The main advantage of this method of data importing is no need for additional software development and the rapid integration of the data source. This second advantage, however, requires skill and practice. In this entry I’ll show you the basics of DIH integration with SQL data source.

In the examples PostgeSQL database was used containing Polish Wikipedia data. Test solr instance is defined as the root of “wikipedia” and was available at:

http://localhost:8983/solr/wikipedia/admin/

Solrconfig.xml configuration

All what we need to do is add an additional request handler. Parameter: config determines configuration file which provides the definition of data sources.

 <requestHandler name="/dataimport" class="org.apache.solr.handler.dataimport.DataImportHandler">
  <lst name="defaults">
   <str name="config">db-data-config.xml</str>
  </lst>
</requestHandler>

With this definition we gain the ability to call an HTTP address that supports import commands like:

/dataimport – in order to obtain the current status
/dataimport?command=reload-config – in order to re-read the configuration
/dataimport?command=full-import – in order to start a full indexing
/dataimport?command=delta-import – in order to start an incremental indexing

(for my configuration, the full address is: http://localhost:8983/solr/wikipedia/dataimport)

Above we see two possibilities to import data: full and incremental import.

The full import of data rely on loading all data every time, while incremental import means only adding the data that changed since the last indexing. The full import usually takes much longer from here a simple conclusion: if the full import time is not an issue for use – it’s probably not worth to bother with the incremental indexing configuration, especially since it imposes additional requirements on the structure of the data source.

It is worth noting that the full import of default begins with the removal of the existing index. You can avoid this behavior by adding the parameter: clean=false.

Configuring data sources

Configuration is to define a queries allowing solr to download data to be indexed. Configuration is included in the file specified in the definition of a handler (in our case: db-data-config.xml). In this part of the article we will define the full import. Then, in another part of the article we extend it with the ability to import incrementally.

Full import

<dataConfig>
  <dataSource driver="org.postgresql.Driver"
     url="jdbc:postgresql://localhost:5432/wikipedia"
     user="wikipedia"
     password="secret" />
  <document>
    <entity name="page" query="SELECT page_id, page_title from page">
      <field column="page_id" name="id" />
      <field column="page_title" name="name" />
      <entity name="revision" query="select rev_id from revision where rev_page=${page.page_id}">
        <entity name="pagecontent" query="select old_text from pagecontent where old_id=${revision.rev_id}">
          <field column="old_text" name="text" />
        </entity>
      </entity>
   </entity>
  </document>
</dataConfig>

As you can see the definition of the source consists of a description of how to link to databases and a description of an indexed document. Import will proceed with the following algorithm:

  1. The old index is removed (if not used with parameter clean=false)
  2. After indexing command is called Solr connects to the database.
  3. Database cursor is defined using the database query specified in the argument “query” in the main entity
  4. Some amount of data is collected
  5. For each record downloaded there are variables defined as <entity name>.<returned column>, so the returned value can be appealed in nested entities
  6. Queries in nested entities are called
  7. Entity may contain fields definition. Thanks to that, Solr is able to identify the mapping of a columns of the result to the field of the document defined in the schema.xml
  8. Document created by the values returned by the query is added to the index
  9. If the cursor has another result we jump to step 4
  10. Solr stores data to a dataimport.properties file, the data are approved (commit) and index optimization is performed

After opening the page http://localhost:8983/solr/wikipedia/dataimport the following answer appeared:

<?xml version="1.0" encoding="UTF-8"?>
<response>
  <lst name="responseHeader">
    <int name="status">0</int>
    <int name="QTime">0</int>
  </lst>
  <lst name="initArgs">
    <lst name="defaults">
      <str name="config">db-data-config.xml</str>
    </lst>
  </lst>
  <str name="status">idle</str>
  <str name="importResponse"/>
  <lst name="statusMessages"/>
  <str name="WARNING">This response format is experimental.  It is likely to change in the future.</str>
</response>

What is worth seeing is entry: status: idle. This means that our importer is ready to work. In another case (eg error in the XML configuration file), we get a description of the exception. Unfortunately, at this stage we are not able to detect such errors as improper connection to the database or lack of JDBC driver.

So we go to page: http://localhost:8983/solr/wikipedia/dataimport?command=full-import

What we should get, is similar to the above XML. However, after re-entering the page: http://localhost:8983/solr/wikipedia/dataimport we get a different result.

<?xml version="1.0" encoding="UTF-8"?>
<response>
  <lst name="responseHeader">
    <int name="status">0</int>
    <int name="QTime">0</int>
  </lst>
  <lst name="initArgs">
    <lst name="defaults">
      <str name="config">db-data-config.xml</str>
    </lst>
  </lst>
  <str name="status">busy</str>
  <str name="importResponse">A command is still running...</str>
  <lst name="statusMessages">
    <str name="Time Elapsed">0:1:15.460</str>
    <str name="Total Requests made to DataSource">39547</str>
    <str name="Total Rows Fetched">59319</str>
    <str name="Total Documents Processed">19772</str>
    <str name="Total Documents Skipped">0</str>
    <str name="Full Dump Started">2010-10-03 14:28:00</str>
  </lst>
  <str name="WARNING">This response format is experimental.  It is likely to change in the future.</str>
</response>

So the importer is working.

After some time, depending on the amount of indexed data and the speed of your computer you will receive:

<?xml version="1.0" encoding="UTF-8"?>
<response>
  <lst name="responseHeader">
    <int name="status">0</int>
    <int name="QTime">0</int>
  </lst>
  <lst name="initArgs">
    <lst name="defaults">
      <str name="config">db-data-config.xml</str>
    </lst>
  </lst>
  <str name="status">idle</str>
  <str name="importResponse"/>
  <lst name="statusMessages">
    <str name="Total Requests made to DataSource">2118645</str>
    <str name="Total Rows Fetched">3177966</str>
    <str name="Total Documents Skipped">0</str>
    <str name="Full Dump Started">2010-10-03 14:28:00</str>
    <str name="">Indexing completed. Added/Updated: 1059322 documents. Deleted 0 documents.</str>
    <str name="Committed">2010-10-03 14:55:20</str>
    <str name="Optimized">2010-10-03 14:55:20</str>
    <str name="Total Documents Processed">1059322</str>
    <str name="Time taken ">0:27:20.325</str>
  </lst>
  <str name="WARNING">This response format is experimental.  It is likely to change in the future.</str>
</response>

As you can see indexing was successful.

In the next section we will try to add the ability to import incrementally.

Leave a Reply

Your email address will not be published. Required fields are marked *