Data Import Handler – How to import data from SQL databases (part 1)

In the article on how to import data (http://solr.pl/2010/09/06/solr-data-indexing-for-fun-and-profit/?lang=en) I mentioned the Data Import Handler (DIH). The main advantage of this method of data importing is no need for additional software development and the rapid integration of the data source. This second advantage, however, requires skill and practice. In this entry I’ll show you the basics of DIH integration with SQL data source.
In the examples PostgeSQL database was used containing Polish Wikipedia data. Test solr instance is defined as the root of “wikipedia” and was available at:

http://localhost:8983/solr/wikipedia/admin/

Solrconfig.xml configuration

All what we need to do is add an additional request handler. Parameter: config determines configuration file which provides the definition of data sources.

 <requestHandler name="/dataimport" class="org.apache.solr.handler.dataimport.DataImportHandler">
  <lst name="defaults">
   <str name="config">db-data-config.xml</str>
  </lst>
</requestHandler>

With this definition we gain the ability to call an HTTP address that supports import commands like:

/dataimport – in order to obtain the current status
/dataimport?command=reload-config – in order to re-read the configuration
/dataimport?command=full-import – in order to start a full indexing
/dataimport?command=delta-import – in order to start an incremental indexing

(for my configuration, the full address is: http://localhost:8983/solr/wikipedia/dataimport)

Above we see two possibilities to import data: full and incremental import.

The full import of data rely on loading all data every time, while incremental import means only adding the data that changed since the last indexing. The full import usually takes much longer from here a simple conclusion: if the full import time is not an issue for use – it’s probably not worth to bother with the incremental indexing configuration, especially since it imposes additional requirements on the structure of the data source.

It is worth noting that the full import of default begins with the removal of the existing index. You can avoid this behavior by adding the parameter: clean=false.

Configuring data sources

Configuration is to define a queries allowing solr to download data to be indexed. Configuration is included in the file specified in the definition of a handler (in our case: db-data-config.xml). In this part of the article we will define the full import. Then, in another part of the article we extend it with the ability to import incrementally.

Full import

<dataConfig>
  <dataSource driver="org.postgresql.Driver"
     url="jdbc:postgresql://localhost:5432/wikipedia"
     user="wikipedia"
     password="secret" />
  <document>
    <entity name="page" query="SELECT page_id, page_title from page">
      <field column="page_id" name="id" />
      <field column="page_title" name="name" />
      <entity name="revision" query="select rev_id from revision where rev_page=${page.page_id}">
        <entity name="pagecontent" query="select old_text from pagecontent where old_id=${revision.rev_id}">
          <field column="old_text" name="text" />
        </entity>
      </entity>
   </entity>
  </document>
</dataConfig>

As you can see the definition of the source consists of a description of how to link to databases and a description of an indexed document. Import will proceed with the following algorithm:

  1. The old index is removed (if not used with parameter clean=false)
  2. After indexing command is called Solr connects to the database.
  3. Database cursor is defined using the database query specified in the argument “query” in the main entity
  4. Some amount of data is collected
  5. For each record downloaded there are variables defined as <entity name>.<returned column>, so the returned value can be appealed in nested entities
  6. Queries in nested entities are called
  7. Entity may contain fields definition. Thanks to that, Solr is able to identify the mapping of a columns of the result to the field of the document defined in the schema.xml
  8. Document created by the values returned by the query is added to the index
  9. If the cursor has another result we jump to step 4
  10. Solr stores data to a dataimport.properties file, the data are approved (commit) and index optimization is performed

After opening the page http://localhost:8983/solr/wikipedia/dataimport the following answer appeared:

<?xml version="1.0" encoding="UTF-8"?>
<response>
  <lst name="responseHeader">
    <int name="status">0</int>
    <int name="QTime">0</int>
  </lst>
  <lst name="initArgs">
    <lst name="defaults">
      <str name="config">db-data-config.xml</str>
    </lst>
  </lst>
  <str name="status">idle</str>
  <str name="importResponse"/>
  <lst name="statusMessages"/>
  <str name="WARNING">This response format is experimental.  It is likely to change in the future.</str>
</response>

What is worth seeing is entry: status: idle. This means that our importer is ready to work. In another case (eg error in the XML configuration file), we get a description of the exception. Unfortunately, at this stage we are not able to detect such errors as improper connection to the database or lack of JDBC driver.

So we go to page: http://localhost:8983/solr/wikipedia/dataimport?command=full-import

What we should get, is similar to the above XML. However, after re-entering the page: http://localhost:8983/solr/wikipedia/dataimport we get a different result.

<?xml version="1.0" encoding="UTF-8"?>
<response>
  <lst name="responseHeader">
    <int name="status">0</int>
    <int name="QTime">0</int>
  </lst>
  <lst name="initArgs">
    <lst name="defaults">
      <str name="config">db-data-config.xml</str>
    </lst>
  </lst>
  <str name="status">busy</str>
  <str name="importResponse">A command is still running...</str>
  <lst name="statusMessages">
    <str name="Time Elapsed">0:1:15.460</str>
    <str name="Total Requests made to DataSource">39547</str>
    <str name="Total Rows Fetched">59319</str>
    <str name="Total Documents Processed">19772</str>
    <str name="Total Documents Skipped">0</str>
    <str name="Full Dump Started">2010-10-03 14:28:00</str>
  </lst>
  <str name="WARNING">This response format is experimental.  It is likely to change in the future.</str>
</response>

So the importer is working.

After some time, depending on the amount of indexed data and the speed of your computer you will receive:

<?xml version="1.0" encoding="UTF-8"?>
<response>
  <lst name="responseHeader">
    <int name="status">0</int>
    <int name="QTime">0</int>
  </lst>
  <lst name="initArgs">
    <lst name="defaults">
      <str name="config">db-data-config.xml</str>
    </lst>
  </lst>
  <str name="status">idle</str>
  <str name="importResponse"/>
  <lst name="statusMessages">
    <str name="Total Requests made to DataSource">2118645</str>
    <str name="Total Rows Fetched">3177966</str>
    <str name="Total Documents Skipped">0</str>
    <str name="Full Dump Started">2010-10-03 14:28:00</str>
    <str name="">Indexing completed. Added/Updated: 1059322 documents. Deleted 0 documents.</str>
    <str name="Committed">2010-10-03 14:55:20</str>
    <str name="Optimized">2010-10-03 14:55:20</str>
    <str name="Total Documents Processed">1059322</str>
    <str name="Time taken ">0:27:20.325</str>
  </lst>
  <str name="WARNING">This response format is experimental.  It is likely to change in the future.</str>
</response>

As you can see indexing was successful.

In the next section we will try to add the ability to import incrementally.

This post is also available in: Polish

This entry was posted on Monday, October 11th, 2010 at 06:54 and is filed under Bez kategorii. You can follow any responses to this entry through the RSS 2.0 feed. You can leave a response, or trackback from your own site.

23 Responses to “Data Import Handler – How to import data from SQL databases (part 1)”

  1. jared Says:

    Hi, I followed your article, but when I try to do a full-import i get the status:

    “status”:”DataImportHandler started. Not Initialized. No commands can be run”

    Any idea how I can initialize the import handler? The log shows no error, just:

    Jan 6, 2011 4:38:52 PM org.apache.solr.core.SolrCore execute
    INFO: [] webapp=/solr path=/dataimport params={indent=on&command=full-import&wt=json} status=0 QTime=0

  2. negativ Says:

    Please send us your DIH and schema configuration file (You can use email address from “information” page). Which solr version do you have?

  3. Rahul Says:

    Hi, I m a novice user of solr I’ve followed this article of your’s. But my need is to index database tables from mysql into solr rather than using the default integrated hsql db I want solr to point to mysql by default what all configuration changes do I need to make in order to get this working. Any guidance here would be highly appreciated.

  4. negativ Says:

    Hi,

    I’m not sure I understand you correctly but if you use data import handler with database, you will find driver and jdbc url definition in dataSource tag. Propably you need something like that:

  5. Martin Sánchez Says:

    Hi.

    in order to made a successful import we need to put the database driver in the lib folder.

    In my case, I work with MySQL
    Download the JDBC Driver for MySQL from: http://mysql.spd.co.il/Downloads/Connector-J/
    Put the “mysql-connector-java-version-bin.jar” in
    Solr SOLRFolder/example/lib

    It should work

  6. Milan Says:

    Hello,
    Whenver I do full-import, the indexing doucments gets created, but the index folder contains all the existing documents as well, it should overwrite the existing files, as I am doing a full-import.
    What is the way to clean the existing documents in case of full-import

  7. gr0 Says:

    Add the clean=true parameter to achieve that.

  8. Jameel Says:

    Hi,

    I am able to run as per the instruction. When I access, http://localhost:8983/solr/dataimport, I dont get any errors. In the status response it is idle (completed).

    In response, It shows: “Total Rows Fetched”=7
    7

    But final response is: “Indexing completed. Added/Updated: 0 documents. Deleted 0 documents.”

    Where I missing… can u please help on the same.

    Thanks
    Jameel

  9. gr0 Says:

    That mean that no document was indexed to Solr. Check your data import handler configuration – it looks like 7 rows from DB was fetched, but they didn’t create a Solr document. Without looking on the configuration file it hard to say what is wrong.

  10. Jameel Says:

    When I execute http://localhost:8983/solr/dataimport (after it is completed),
    I dont see DB indexes are getting stored in the properties file (D:\apache-solr-3.6.1\apache-solr-3.6.1\example\solr\conf\dataimport.properties) as mentioned.

    Where the db indexes are actually stored ?

    Thanks
    Jameel

  11. gr0 Says:

    Was there any data indexed ?

  12. rajesh Says:

    I cannot able to download the sql connector. please provide me the correct website where i can download the mySQL connector.

  13. gr0 Says:

    First link when I search – http://www.mysql.com/products/connector/

  14. Binh Thanh Nguyen Says:

    Thanks, nice post

  15. muhammet Says:

    this is my data-config doc. there is a problem.So, name is column in articles table contains words which contains ‘ . Therefore this error occur Exception while processing: articles document : SolrInputDocument(articles.name=’A morte ‘e Carnevale)

  16. sandy Says:

    Hello friends,
    after following the toturial, i am getting two error, please have a look
    1)Exception while processing: empinfo document : SolrInputDocument(fields: []):org.apache.solr.handler.dataimport.DataImportHandlerException: Unable to execute query: select id,​name from empinfo Processing Document # 1
    and
    2)Full Import failed:java.lang.RuntimeException: java.lang.RuntimeException: org.apache.solr.handler.dataimport.DataImportHandlerException: Unable to execute query: select id,​name from empinfo Processing Document # 1

    These errors generate during import from mysql to solr. plz tell me what the issue. where am wrong?

  17. gr0 Says:

    It seems that data import handler is not able to execute the query. Please look at Solr logs and it will give you more information on potential reasons behind it.

  18. james Says:

    Hi,

    I’m still very much a newbie in the realms of Solr.

    I m working on a web application in which i have used apache solr and mysql database

    i’m using data import handler to index data from my database

    and i ‘m wondering if you can help me to find a solution to my problem:

    my database schemas look like this

    one book is associated to one category (or discipline) which can have many subcategory (or subdiscipline) and each subcategory can have other subcategories and so on…

    1/ how can solr index each with it’s associated categories and subcategories and sub subcategories and so on…?

    2/ i tried to configure a data import handler in this way :

    but in this way i can only index book info with a discipline and a subdiscipline only without getting the subsubdiscipline indexed, what should i modify there in order to index all subdisciplines and sub subdisciplines and so on…

    thank you so much

  19. gr0 Says:

    Hi, our comment system cuts out non text parts. Could you please send the same, but to the contact mail?

  20. kratika Says:

    I am trying to import and index data from database, bt after executing it doesn’t show anything. Please help. new to solr.

  21. gr0 Says:

    How are you trying to do that? Do you see anything in Solr logs?

  22. kratika Says:

    And when I tried with xml, it showws indexing complete. added/updated 0 document. now what should I do.

  23. gr0 Says:

    It means that Data Import Handler didn’t index any documents. Are there any errors and have you checked the query you are running?