Data Import Handler – How to import data from SQL databases (part 1)

In the article on how to import data (http://solr.pl/2010/09/06/solr-data-indexing-for-fun-and-profit/?lang=en) I mentioned the Data Import Handler (DIH). The main advantage of this method of data importing is no need for additional software development and the rapid integration of the data source. This second advantage, however, requires skill and practice. In this entry I’ll show you the basics of DIH integration with SQL data source.

In the examples PostgeSQL database was used containing Polish Wikipedia data. Test solr instance is defined as the root of “wikipedia” and was available at:

http://localhost:8983/solr/wikipedia/admin/

Solrconfig.xml configuration

All what we need to do is add an additional request handler. Parameter: config determines configuration file which provides the definition of data sources.

With this definition we gain the ability to call an HTTP address that supports import commands like:

/dataimport – in order to obtain the current status
/dataimport?command=reload-config – in order to re-read the configuration
/dataimport?command=full-import – in order to start a full indexing
/dataimport?command=delta-import – in order to start an incremental indexing

(for my configuration, the full address is: http://localhost:8983/solr/wikipedia/dataimport)

Above we see two possibilities to import data: full and incremental import.

The full import of data rely on loading all data every time, while incremental import means only adding the data that changed since the last indexing. The full import usually takes much longer from here a simple conclusion: if the full import time is not an issue for use – it’s probably not worth to bother with the incremental indexing configuration, especially since it imposes additional requirements on the structure of the data source.

It is worth noting that the full import of default begins with the removal of the existing index. You can avoid this behavior by adding the parameter: clean=false.

Configuring data sources

Configuration is to define a queries allowing solr to download data to be indexed. Configuration is included in the file specified in the definition of a handler (in our case: db-data-config.xml). In this part of the article we will define the full import. Then, in another part of the article we extend it with the ability to import incrementally.

Full import

As you can see the definition of the source consists of a description of how to link to databases and a description of an indexed document. Import will proceed with the following algorithm:

  1. The old index is removed (if not used with parameter clean=false)
  2. After indexing command is called Solr connects to the database.
  3. Database cursor is defined using the database query specified in the argument “query” in the main entity
  4. Some amount of data is collected
  5. For each record downloaded there are variables defined as <entity name>.<returned column>, so the returned value can be appealed in nested entities
  6. Queries in nested entities are called
  7. Entity may contain fields definition. Thanks to that, Solr is able to identify the mapping of a columns of the result to the field of the document defined in the schema.xml
  8. Document created by the values returned by the query is added to the index
  9. If the cursor has another result we jump to step 4
  10. Solr stores data to a dataimport.properties file, the data are approved (commit) and index optimization is performed

After opening the page http://localhost:8983/solr/wikipedia/dataimport the following answer appeared:

What is worth seeing is entry: status: idle. This means that our importer is ready to work. In another case (eg error in the XML configuration file), we get a description of the exception. Unfortunately, at this stage we are not able to detect such errors as improper connection to the database or lack of JDBC driver.

So we go to page: http://localhost:8983/solr/wikipedia/dataimport?command=full-import

What we should get, is similar to the above XML. However, after re-entering the page: http://localhost:8983/solr/wikipedia/dataimport we get a different result.

So the importer is working.

After some time, depending on the amount of indexed data and the speed of your computer you will receive:

As you can see indexing was successful.

In the next section we will try to add the ability to import incrementally.

31 thoughts on “Data Import Handler – How to import data from SQL databases (part 1)

  • 7 January 2011 at 00:41
    Permalink

    Hi, I followed your article, but when I try to do a full-import i get the status:

    “status”:”DataImportHandler started. Not Initialized. No commands can be run”

    Any idea how I can initialize the import handler? The log shows no error, just:

    Jan 6, 2011 4:38:52 PM org.apache.solr.core.SolrCore execute
    INFO: [] webapp=/solr path=/dataimport params={indent=on&command=full-import&wt=json} status=0 QTime=0

    Reply
  • 7 January 2011 at 08:49
    Permalink

    Please send us your DIH and schema configuration file (You can use email address from “information” page). Which solr version do you have?

    Reply
  • 6 May 2011 at 11:49
    Permalink

    Hi, I m a novice user of solr I’ve followed this article of your’s. But my need is to index database tables from mysql into solr rather than using the default integrated hsql db I want solr to point to mysql by default what all configuration changes do I need to make in order to get this working. Any guidance here would be highly appreciated.

    Reply
  • 6 May 2011 at 16:15
    Permalink

    Hi,

    I’m not sure I understand you correctly but if you use data import handler with database, you will find driver and jdbc url definition in dataSource tag. Propably you need something like that:

    Reply
  • 5 October 2012 at 20:02
    Permalink

    Hello,
    Whenver I do full-import, the indexing doucments gets created, but the index folder contains all the existing documents as well, it should overwrite the existing files, as I am doing a full-import.
    What is the way to clean the existing documents in case of full-import

    Reply
    • 18 October 2012 at 09:24
      Permalink

      Add the clean=true parameter to achieve that.

      Reply
  • 9 October 2012 at 16:13
    Permalink

    Hi,

    I am able to run as per the instruction. When I access, http://localhost:8983/solr/dataimport, I dont get any errors. In the status response it is idle (completed).

    In response, It shows: “Total Rows Fetched”=7
    7

    But final response is: “Indexing completed. Added/Updated: 0 documents. Deleted 0 documents.”

    Where I missing… can u please help on the same.

    Thanks
    Jameel

    Reply
    • 18 October 2012 at 09:26
      Permalink

      That mean that no document was indexed to Solr. Check your data import handler configuration – it looks like 7 rows from DB was fetched, but they didn’t create a Solr document. Without looking on the configuration file it hard to say what is wrong.

      Reply
  • 9 October 2012 at 16:17
    Permalink

    When I execute http://localhost:8983/solr/dataimport (after it is completed),
    I dont see DB indexes are getting stored in the properties file (D:\apache-solr-3.6.1\apache-solr-3.6.1\example\solr\conf\dataimport.properties) as mentioned.

    Where the db indexes are actually stored ?

    Thanks
    Jameel

    Reply
  • 10 December 2013 at 12:52
    Permalink

    I cannot able to download the sql connector. please provide me the correct website where i can download the mySQL connector.

    Reply
  • 16 December 2014 at 07:52
    Permalink

    Thanks, nice post

    Reply
  • 8 February 2015 at 21:10
    Permalink

    this is my data-config doc. there is a problem.So, name is column in articles table contains words which contains ‘ . Therefore this error occur Exception while processing: articles document : SolrInputDocument(articles.name=’A morte ‘e Carnevale)

    Reply
  • 4 October 2015 at 17:39
    Permalink

    Hello friends,
    after following the toturial, i am getting two error, please have a look
    1)Exception while processing: empinfo document : SolrInputDocument(fields: []):org.apache.solr.handler.dataimport.DataImportHandlerException: Unable to execute query: select id,​name from empinfo Processing Document # 1
    and
    2)Full Import failed:java.lang.RuntimeException: java.lang.RuntimeException: org.apache.solr.handler.dataimport.DataImportHandlerException: Unable to execute query: select id,​name from empinfo Processing Document # 1

    These errors generate during import from mysql to solr. plz tell me what the issue. where am wrong?

    Reply
    • 13 October 2015 at 15:55
      Permalink

      It seems that data import handler is not able to execute the query. Please look at Solr logs and it will give you more information on potential reasons behind it.

      Reply
  • 1 July 2016 at 13:34
    Permalink

    Hi,

    I’m still very much a newbie in the realms of Solr.

    I m working on a web application in which i have used apache solr and mysql database

    i’m using data import handler to index data from my database

    and i ‘m wondering if you can help me to find a solution to my problem:

    my database schemas look like this

    one book is associated to one category (or discipline) which can have many subcategory (or subdiscipline) and each subcategory can have other subcategories and so on…

    1/ how can solr index each with it’s associated categories and subcategories and sub subcategories and so on…?

    2/ i tried to configure a data import handler in this way :

    but in this way i can only index book info with a discipline and a subdiscipline only without getting the subsubdiscipline indexed, what should i modify there in order to index all subdisciplines and sub subdisciplines and so on…

    thank you so much

    Reply
    • 6 July 2016 at 17:24
      Permalink

      Hi, our comment system cuts out non text parts. Could you please send the same, but to the contact mail?

      Reply
  • 18 October 2016 at 12:36
    Permalink

    I am trying to import and index data from database, bt after executing it doesn’t show anything. Please help. new to solr.

    Reply
    • 20 January 2017 at 14:52
      Permalink

      How are you trying to do that? Do you see anything in Solr logs?

      Reply
  • 18 October 2016 at 12:38
    Permalink

    And when I tried with xml, it showws indexing complete. added/updated 0 document. now what should I do.

    Reply
    • 20 January 2017 at 14:53
      Permalink

      It means that Data Import Handler didn’t index any documents. Are there any errors and have you checked the query you are running?

      Reply
  • 4 April 2017 at 15:35
    Permalink

    Greetings,
    I can execute the import by clicking the “execute” button after visiting the URL:

    http://x.x.x.x:8080/solr/#/samplecore/dataimport/dataimport_solrsearchview

    It does not automatically start indexing if I use the URL:
    http://x.x.x.x:8080/solr/#/samplecore/dataimport/dataimport_solrsearchview?command=full-import

    I’m not sure why I have to click “execute” to start the process.
    This is on a SOLR 6 server.

    Any thoughts?
    I can provide more details/configs as required.
    Thanks!

    Reply
    • 17 July 2017 at 22:28
      Permalink

      If I understand what you are doing correctly, this is normal behavior. Data Import Handle doesn’t start automatically, you need to execute the full import command to start full import.

      Reply
  • 8 August 2017 at 14:16
    Permalink

    Hi I am using the following db-data-config.xml

    in solr config ..i hva configured the database importer .

    When i am trying to use database importer using solr admin ui for the core i have created , required data are facing from oracle database , but data are not getting indexed .

    I am getting verbose as follows:

    {
    “command”: “full-import”,
    “responseHeader”: {
    “status”: 0,
    “QTime”: 2995
    },
    “initArgs”: [
    “defaults”,
    [
    “config”,
    “db-data-config.xml”
    ]
    ],
    “mode”: “debug”,
    “documents”: [],
    “verbose-output”: [
    “entity:user”,
    [
    “document#1”,
    [
    “query”,
    “select user_IID, USERNAME from usr”,
    “time-taken”,
    “0:0:2.485”,
    null,
    “———– row #1————-“,
    “USER_IID”,
    “1258”,
    “USERNAME”,
    “pt098admin05”,
    null,
    “———————————————”
    ],
    “document#1”,
    [
    null,
    “———– row #1————-“,
    “USER_IID”,
    “1259”,
    “USERNAME”,
    “pt099admin01”,
    null,
    “———————————————”
    ],
    “document#1”,
    [
    null,
    “———– row #1————-“,
    “USER_IID”,
    “1260”,
    “USERNAME”,
    “pt099admin02”,
    null,
    “———————————————”
    ],
    “document#1”,
    [
    null,
    “———– row #1————-“,
    “USER_IID”,
    “1261”,
    “USERNAME”,
    “pt099admin03”,
    null,
    “———————————————”
    ],
    “document#1”,
    [
    null,
    “———– row #1————-“,
    “USER_IID”,
    “1262”,
    “USERNAME”,
    “pt099admin04”,
    null,
    “———————————————”
    ],
    “document#1”,
    [
    null,
    “———– row #1————-“,
    “USER_IID”,
    “1263”,
    “USERNAME”,
    “pt099admin05”,
    null,
    “———————————————”
    ],
    “document#1”,
    [
    null,
    “———– row #1————-“,
    “USER_IID”,
    “1264”,
    “USERNAME”,
    “pt100admin01”,
    null,
    “———————————————”
    ],
    “document#1”,
    [
    null,
    “———– row #1————-“,
    “USER_IID”,
    “1265”,
    “USERNAME”,
    “pt100admin02”,
    null,
    “———————————————”
    ],
    “document#1”,
    [
    null,
    “———– row #1————-“,
    “USER_IID”,
    “1266”,
    “USERNAME”,
    “pt100admin03”,
    null,
    “———————————————”
    ],
    “document#1”,
    [
    null,
    “———– row #1————-“,
    “USER_IID”,
    “1267”,
    “USERNAME”,
    “pt100admin04”,
    null,
    “———————————————”
    ],
    “document#1”,
    []
    ]
    ],
    “status”: “idle”,
    “importResponse”: “”,
    “statusMessages”: {
    “Total Requests made to DataSource”: “1”,
    “Total Rows Fetched”: “10”,
    “Total Documents Processed”: “0”,
    “Total Documents Skipped”: “0”,
    “Full Dump Started”: “2017-08-08 13:04:44”,
    “”: “Indexing completed. Added/Updated: 0 documents. Deleted 0 documents.”,
    “Committed”: “2017-08-08 13:04:46”,
    “Time taken”: “0:0:2.497”
    }
    }

    I have also made autocomit true in solr config.xml.

    Please help me to reslove the issue

    Reply
  • 8 August 2017 at 14:46
    Permalink

    <dataConfig> <dataSource name=”jdbc” driver=”oracle.jdbc.OracleDriver” url=”jdbc:oracle:thin:@XXX:1521/YYYY” user=”user” password=”pass”/> <document> <entity name=”user” query=”select user_IID, USERNAME from usr”>
    <field column=”USER_IID” name=”user_iid” />
    <field column=”USERNAME” name=”username” />
    </entity> </document></dataConfig>

    Reply
    • 2 September 2017 at 08:24
      Permalink

      Check your schema if it contains needed fields.

      Reply
  • 3 November 2017 at 09:58
    Permalink

    <entity name="Jobs" transformer="TemplateTransformer" dataSource="JdbcDataSource"
    …………………….

    <entity name="Candidates" transformer="TemplateTransformer" dataSource="JdbcDataSource"
    …………………..

    Question: There is a Solr collection that contains Job and Candidate entity. When I do full-import all data gets indexed into the solr. When I do full-import of Job entity only, candidate entity data gets removed from the solr collection and When I do full-import of candidate entity only, then job entity data gets removed. Please suggest..

    Thanks

    Reply
  • 3 January 2019 at 12:40
    Permalink

    Hi, i have setup a solr cloud with 3 nodes and i am trying to index 500Mb of data with 9million records in the table . when i am trying to do full import it is failing within 3 seconds of starting up the import .. i have tried setting up the batchSize=”-1″ but this is throwing string index out of bound error . here is my db-data-config.xml :

    and i have solrconfig.xml and schema.xml right in place ..
    But the log is :
    Caused by: java.lang.RuntimeException: org.apache.solr.handler.dataimport.DataImportHandlerException: Unable to execute query: SELECT * from counter.series Processing Document # 1
    at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:416)
    at org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:329)
    at org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:232)
    … 3 more
    Caused by: org.apache.solr.handler.dataimport.DataImportHandlerException: Unable to execute query: SELECT * from counter.series Processing Document # 1
    at org.apache.solr.handler.dataimport.DataImportHandlerException.wrapAndThrow(DataImportHandlerException.java:69)
    at org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator.(JdbcDataSource.java:318)
    at org.apache.solr.handler.dataimport.JdbcDataSource.getData(JdbcDataSource.java:279)
    at org.apache.solr.handler.dataimport.JdbcDataSource.getData(JdbcDataSource.java:54)
    at org.apache.solr.handler.dataimport.SqlEntityProcessor.initQuery(SqlEntityProcessor.java:59)
    at org.apache.solr.handler.dataimport.SqlEntityProcessor.nextRow(SqlEntityProcessor.java:73)
    at org.apache.solr.handler.dataimport.EntityProcessorWrapper.nextRow(EntityProcessorWrapper.java:244)
    at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:475)
    at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:414)
    … 5 more
    Caused by: java.lang.StringIndexOutOfBoundsException: String index out of range: -1
    at java.lang.String.substring(String.java:1967)
    at org.apache.cassandra.cql.jdbc.CassandraResultSet.createColumn(CassandraResultSet.java:1159)
    at org.apache.cassandra.cql.jdbc.CassandraResultSet.populateMetaData(CassandraResultSet.java:220)
    at org.apache.cassandra.cql.jdbc.CassandraResultSet.(CassandraResultSet.java:190)
    at org.apache.cassandra.cql.jdbc.CassandraStatement.doExecute(CassandraStatement.java:169)
    at org.apache.cassandra.cql.jdbc.CassandraStatement.execute(CassandraStatement.java:205)
    at org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator.executeStatement(JdbcDataSource.java:338)
    at org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator.(JdbcDataSource.java:313)
    … 12 more

    I am stuck .. any help is appreciated

    Reply
    • 11 January 2019 at 18:00
      Permalink

      Hard to say what can be the main issue. What does the Solr log say?

      Reply

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.