Data Import Handler – sharding

Our reader (greetings!) reported us a problem with the cooperation of DIH and sharding mechanism. The Solr project wiki, in my opinion, discuss the solution to this issue, but makes it a little around and on the occasion.

What is sharding?

Sharding means the division of data into several parts and the storage and processing of the data independently. The additional logic within the application allows you to select the appropriate part of data and/or pooling results from various sources. In the case of DIH and sarding we have to deal with the following case:

sharding on the side of the data source – this means multiple locations/tables with different parts of the data set
sharding on the SOLR side – that is, dividing the data from a source on many independent instances of SOLR
both of these simultaneously

In our case we have one set of data and we want to create a lot of sets (called shards) on the SOLR side.

When to use sharding?

A very important question: why we use sharding mechanism ? In my opinion sharding happens to be abused too often and thus generate lots of additional complications and limitations. The main reason o use sharding is the large volume of data that make SOLR index does not fall within one machine. If it does not – it often means that sharding is redundant. Another reason is performance. But sharding can help here only if other optimization fails and the queries are so complicated that the same addintional cost of sharding (forward queries to the individual Shards and combining their answers) is less than the profit performance that can be achieved.

Test data

Let’s assume that we need sharding. In the example below, I used data from the MusicBrainz creating a simple postgresql table:

Table "public.albums"

 Column |  Type   |                      Modifiers
--------+---------+-----------------------------------------------------

 id     | integer | not null default nextval('albums_id_seq'::regclass)

 name   | text    | not null

 author | text    | not null

Indexes:

"albums_pk" PRIMARY KEY, btree (id)

The table contains 825,661 records. I stress here that both the structure and amount of data is so small that the practical usefulness of using sharding here is negligible.

Test instalation

For the tests we use three instances of SOLR. All instances are identical, the difference is related only to the number of ports (8983, 7872, 6761) – Tests will be performed on one physical machine.

Definition at schema.xml:

<fields>
 <field name="id" type="string" indexed="true" stored="true" required="true" />
 <field name="album" type="text" indexed="true" stored="true" multiValued="true"/>
 <field name="author" type="text" indexed="true" stored="true" multiValued="true"/>
</fields>
<uniqueKey>id</uniqueKey>
<defaultSearchField>album</defaultSearchField>

Definition of DIH in solrconfig.xml:

<requestHandler name="/dataimport" class="org.apache.solr.handler.dataimport.DataImportHandler">
 <lst name="defaults">
  <str name="config">db-data-config.xml</str>
 </lst>
</requestHandler>

And the file DIH db-data-config.xml:

<dataConfig>
 <dataSource driver="org.postgresql.Driver" url="jdbc:postgresql://localhost:5432/shardtest" user="solr" password="secret" />
 <document>
  <entity name="album" query="SELECT * from albums">
   <field column="id" name="id" />
   <field column="name" name="album" />
   <field column="author" name="author" />
  </entity>
 </document>
</dataConfig>

At this point, each instance is unable to complete the data import.

So let’s setup sharding

Our goal is to modify the configuration such that each instance of DIH index only “their” part of the data. The easiest way to do this is by modifying the query retrieving data to the one like this:

SELECT * from albums where id % NUMBER_OF_INSTANCES = INSTANCE_NUMBER

where:

NUMBER_OF_INSTANCES – the number of Solr servers that store the number of unique parts of the data set
INSTANCE_NUMBER – instance number (starting from zero)

such query does not guarantee exactly and perfectly equal distribution but satisfies two necessary conditions:

the record will always go to a specific and always the same instance
single record will always go to only one instance

so the db-data-config.xml on each machine is different now and looks like this:

SELECT * from albums where id % 3 = 0
SELECT * from albums where id % 3 = 1
SELECT * from albums where id % 3 = 2

How it works

After starting up each of the Solr instances we run the following query on each of them:

/solr/dataimport?command=full-import

When DIH command ends we send the following command:

/solr/dataimport?command=status

We should get the following responses:

Added/Updated: 275220 documents.
Added/Updated: 275221 documents.
Added/Updated: 275220 documents.

Performing a simple insert operation, we see that in all instances we have a total of 825,661 documents, as much as there should be 🙂
Make another request – ask for all document. Using sharding we can send the following query to any instance:

/solr/select/?q=*:*&shards=localhost:6761/solr,localhost:7872/solr,localhost:8983/solr

Result: 825661.

It works! 🙂

Solr.pl