dih – Solr.pl

Data Import Handler – import from Solr XML files

Marek Rogoziński — Tue, 16 Aug 2011 19:46:51 +0000

So far, in previous articles, we looked at the import data from SQL databases. Today it’s time to import from XML files.

Example

Lets look at the following example:

The additional, not mandatory attributes are obvious:

basePath – the directory which will be used to calculate the relative path of the "entity" tag
encoding – the file encoding (default: the OS default encoding)

After the source definition, we have document definition with two nested entities.

The purpose of the main entity is to generate the files list. To do that, we use the FileListEntityProcessor.This entity is self-supporting and doesn't need any data source (thus dataSource="null"). The used attributes:

fileName (mandatory) – regular expression that says which files to choose
recursive – should subdirectories be checked (default: no)
rootEntity – says about if the data from the entity should be treated as documents source. Because we don't want to index files list, which this entity provides, we need to set this attribute to false. After setting this attribute to false the next entity will be treated as main entity and its data will be indexed.
baseDir (mandatory) - the directory where the files should be located
dataSource – in this case we set this parameter to "null" because the entity doesn't use data source (we can ommit this parameter in Solr > 1.3)
excludes – regular expression which says which files to exclude from indexing
newerThan – says, that only files newer that the parameter value will be taken into consideration. This parameter can take a value in the format: YYYY-MM-dd HH:mm:ss or a single quoted string, like: 'NOW – 7DAYS' or an variable that contains the data, for example: ${variable}
olderThan – the same as above, but says about older files
biggerThan – only files bigger than the value of the parameter will be taken into consideration
smallerThan –only files smaller than the value of this parameter will be taken into consideration

If you already have a list of files we can go further, the inner entity - its task is to download specific data contained in the files. The data is taken from the file specified by an external entity using a data source. The processor type: XpathEntityProcessor used for XML files have the following attributes:

url – the input data
useSolrAddSchema – information, that the input data is in Solr XML format
stream – should we use stream for document processing. In case of large XML files, it's good to use stream="true" which will use far less memory and won't try to load the whole XML file into the memory.

Additional parameters are not useful in our case and we describe them on another occasion:)

But why all this?

The example allows to read all XML files from the selected directory. We used exactly the same file format as the "classical" method of sending documents to Solr using HTTP POST. So why use this method?

Push and Pull

The first argument can be the control of the connections between the Solr server and the system that is responsible for generating the files that are going to be indexed. When we do not have full control over the data source, it's better to retrieve the data from the source than to provide additional services that could potentially become a target of attack.

Prototyping and change testing

How does it work in practice? In my case I decided to add the possibility of a more advanced search and an ability to do faceting on the category tree. The document contained the "category" field, where it stored the path such as "Cars / Four sits / Audi". To create a new queries, we should have an additional fields in the index that hold the category name, the level of the category and also how many levels we have.

To add the required fields we used the ability to define scripts. Quoted previously imported file now looks like this:


  
  
  
    
"
      recursive="false"
      rootEntity="false"
      dataSource="null">

The additional, not mandatory attributes are obvious:

basePath – the directory which will be used to calculate the relative path of the "entity" tag
encoding – the file encoding (default: the OS default encoding)

After the source definition, we have document definition with two nested entities.

fileName (mandatory) – regular expression that says which files to choose
recursive – should subdirectories be checked (default: no)
rootEntity – says about if the data from the entity should be treated as documents source. Because we don't want to index files list, which this entity provides, we need to set this attribute to false. After setting this attribute to false the next entity will be treated as main entity and its data will be indexed.
baseDir (mandatory) - the directory where the files should be located
dataSource – in this case we set this parameter to "null" because the entity doesn't use data source (we can ommit this parameter in Solr > 1.3)
excludes – regular expression which says which files to exclude from indexing
newerThan – says, that only files newer that the parameter value will be taken into consideration. This parameter can take a value in the format: YYYY-MM-dd HH:mm:ss or a single quoted string, like: 'NOW – 7DAYS' or an variable that contains the data, for example: ${variable}
olderThan – the same as above, but says about older files
biggerThan – only files bigger than the value of the parameter will be taken into consideration
smallerThan –only files smaller than the value of this parameter will be taken into consideration

url – the input data
useSolrAddSchema – information, that the input data is in Solr XML format
stream – should we use stream for document processing. In case of large XML files, it's good to use stream="true" which will use far less memory and won't try to load the whole XML file into the memory.

Additional parameters are not useful in our case and we describe them on another occasion:)

But why all this?

Push and Pull

Prototyping and change testing

To add the required fields we used the ability to define scripts. Quoted previously imported file now looks like this:

Note at the end

When using DIH we need to be aware that it works a little differently. In particular, the attempt to load multiple values to a field that is not multivalued (in the schema.xml file) in DIH succeeds - the excess value will be ignored. In the case of "classical" indexing method Solr will return an error.

{document.fileAbsolutePath}"
useSolrAddSchema="true"
stream="true">

Explanation of the example

In comparison with the examples from the earlier articles a new type appeared - the FileDataSource. Example of a complete call:

The additional, not mandatory attributes are obvious:

basePath – the directory which will be used to calculate the relative path of the "entity" tag
encoding – the file encoding (default: the OS default encoding)

After the source definition, we have document definition with two nested entities.

fileName (mandatory) – regular expression that says which files to choose
recursive – should subdirectories be checked (default: no)
rootEntity – says about if the data from the entity should be treated as documents source. Because we don't want to index files list, which this entity provides, we need to set this attribute to false. After setting this attribute to false the next entity will be treated as main entity and its data will be indexed.
baseDir (mandatory) - the directory where the files should be located
dataSource – in this case we set this parameter to "null" because the entity doesn't use data source (we can ommit this parameter in Solr > 1.3)
excludes – regular expression which says which files to exclude from indexing
newerThan – says, that only files newer that the parameter value will be taken into consideration. This parameter can take a value in the format: YYYY-MM-dd HH:mm:ss or a single quoted string, like: 'NOW – 7DAYS' or an variable that contains the data, for example: ${variable}
olderThan – the same as above, but says about older files
biggerThan – only files bigger than the value of the parameter will be taken into consideration
smallerThan –only files smaller than the value of this parameter will be taken into consideration

url – the input data
useSolrAddSchema – information, that the input data is in Solr XML format
stream – should we use stream for document processing. In case of large XML files, it's good to use stream="true" which will use far less memory and won't try to load the whole XML file into the memory.

Additional parameters are not useful in our case and we describe them on another occasion:)

But why all this?

Push and Pull

Prototyping and change testing

To add the required fields we used the ability to define scripts. Quoted previously imported file now looks like this:

Note at the end

"
recursive="false"
rootEntity="false"
dataSource="null">
processor="XPathEntityProcessor"
transformer=”script:CategoryPieces”
url="

Note at the end

"
recursive="false"
rootEntity="false"
dataSource="null">
processor="XPathEntityProcessor"
url="

Explanation of the example

In comparison with the examples from the earlier articles a new type appeared - the FileDataSource. Example of a complete call:

The additional, not mandatory attributes are obvious:

basePath – the directory which will be used to calculate the relative path of the "entity" tag
encoding – the file encoding (default: the OS default encoding)

After the source definition, we have document definition with two nested entities.

fileName (mandatory) – regular expression that says which files to choose
recursive – should subdirectories be checked (default: no)
rootEntity – says about if the data from the entity should be treated as documents source. Because we don't want to index files list, which this entity provides, we need to set this attribute to false. After setting this attribute to false the next entity will be treated as main entity and its data will be indexed.
baseDir (mandatory) - the directory where the files should be located
dataSource – in this case we set this parameter to "null" because the entity doesn't use data source (we can ommit this parameter in Solr > 1.3)
excludes – regular expression which says which files to exclude from indexing
newerThan – says, that only files newer that the parameter value will be taken into consideration. This parameter can take a value in the format: YYYY-MM-dd HH:mm:ss or a single quoted string, like: 'NOW – 7DAYS' or an variable that contains the data, for example: ${variable}
olderThan – the same as above, but says about older files
biggerThan – only files bigger than the value of the parameter will be taken into consideration
smallerThan –only files smaller than the value of this parameter will be taken into consideration

url – the input data
useSolrAddSchema – information, that the input data is in Solr XML format
stream – should we use stream for document processing. In case of large XML files, it's good to use stream="true" which will use far less memory and won't try to load the whole XML file into the memory.

Additional parameters are not useful in our case and we describe them on another occasion:)

But why all this?

Push and Pull

Prototyping and change testing

To add the required fields we used the ability to define scripts. Quoted previously imported file now looks like this:

Note at the end

{document.fileAbsolutePath}"
useSolrAddSchema="true"
stream="true">

Explanation of the example

In comparison with the examples from the earlier articles a new type appeared – the FileDataSource. Example of a complete call:

The additional, not mandatory attributes are obvious:

basePath – the directory which will be used to calculate the relative path of the “entity” tag
encoding – the file encoding (default: the OS default encoding)

After the source definition, we have document definition with two nested entities.

The purpose of the main entity is to generate the files list. To do that, we use the FileListEntityProcessor.This entity is self-supporting and doesn’t need any data source (thus dataSource=”null”). The used attributes:

fileName (mandatory) – regular expression that says which files to choose
recursive – should subdirectories be checked (default: no)
rootEntity – says about if the data from the entity should be treated as documents source. Because we don’t want to index files list, which this entity provides, we need to set this attribute to false. After setting this attribute to false the next entity will be treated as main entity and its data will be indexed.
baseDir (mandatory) – the directory where the files should be located
dataSource – in this case we set this parameter to “null” because the entity doesn’t use data source (we can ommit this parameter in Solr > 1.3)
excludes – regular expression which says which files to exclude from indexing
newerThan – says, that only files newer that the parameter value will be taken into consideration. This parameter can take a value in the format: YYYY-MM-dd HH:mm:ss or a single quoted string, like: ‘NOW – 7DAYS’ or an variable that contains the data, for example: ${variable}
olderThan – the same as above, but says about older files
biggerThan – only files bigger than the value of the parameter will be taken into consideration
smallerThan –only files smaller than the value of this parameter will be taken into consideration

If you already have a list of files we can go further, the inner entity – its task is to download specific data contained in the files. The data is taken from the file specified by an external entity using a data source. The processor type: XpathEntityProcessor used for XML files have the following attributes:

url – the input data
useSolrAddSchema – information, that the input data is in Solr XML format
stream – should we use stream for document processing. In case of large XML files, it’s good to use stream=”true” which will use far less memory and won’t try to load the whole XML file into the memory.

Additional parameters are not useful in our case and we describe them on another occasion:)

But why all this?

The example allows to read all XML files from the selected directory. We used exactly the same file format as the “classical” method of sending documents to Solr using HTTP POST. So why use this method?

Push and Pull

The first argument can be the control of the connections between the Solr server and the system that is responsible for generating the files that are going to be indexed. When we do not have full control over the data source, it’s better to retrieve the data from the source than to provide additional services that could potentially become a target of attack.

Prototyping and change testing

How does it work in practice? In my case I decided to add the possibility of a more advanced search and an ability to do faceting on the category tree. The document contained the “category” field, where it stored the path such as “Cars / Four sits / Audi”. To create a new queries, we should have an additional fields in the index that hold the category name, the level of the category and also how many levels we have.

To add the required fields we used the ability to define scripts. Quoted previously imported file now looks like this:

Note at the end

When using DIH we need to be aware that it works a little differently. In particular, the attempt to load multiple values to a field that is not multivalued (in the schema.xml file) in DIH succeeds – the excess value will be ignored. In the case of “classical” indexing method Solr will return an error.

{document.fileAbsolutePath}”
useSolrAddSchema=”true”
stream=”true”>

Note at the end

When using DIH we need to be aware that it works a little differently. In particular, the attempt to load multiple values to a field that is not multivalued (in the schema.xml file) in DIH succeeds – the excess value will be ignored. In the case of “classical” indexing method Solr will return an error.

”
recursive=”false”
rootEntity=”false”
dataSource=”null”>
processor=”XPathEntityProcessor”
url=”

Explanation of the example

In comparison with the examples from the earlier articles a new type appeared – the FileDataSource. Example of a complete call:

The additional, not mandatory attributes are obvious:

basePath – the directory which will be used to calculate the relative path of the “entity” tag
encoding – the file encoding (default: the OS default encoding)

After the source definition, we have document definition with two nested entities.

fileName (mandatory) – regular expression that says which files to choose
recursive – should subdirectories be checked (default: no)
rootEntity – says about if the data from the entity should be treated as documents source. Because we don’t want to index files list, which this entity provides, we need to set this attribute to false. After setting this attribute to false the next entity will be treated as main entity and its data will be indexed.
baseDir (mandatory) – the directory where the files should be located
dataSource – in this case we set this parameter to “null” because the entity doesn’t use data source (we can ommit this parameter in Solr > 1.3)
excludes – regular expression which says which files to exclude from indexing
newerThan – says, that only files newer that the parameter value will be taken into consideration. This parameter can take a value in the format: YYYY-MM-dd HH:mm:ss or a single quoted string, like: ‘NOW – 7DAYS’ or an variable that contains the data, for example: ${variable}
olderThan – the same as above, but says about older files
biggerThan – only files bigger than the value of the parameter will be taken into consideration
smallerThan –only files smaller than the value of this parameter will be taken into consideration

url – the input data
useSolrAddSchema – information, that the input data is in Solr XML format
stream – should we use stream for document processing. In case of large XML files, it’s good to use stream=”true” which will use far less memory and won’t try to load the whole XML file into the memory.

Additional parameters are not useful in our case and we describe them on another occasion:)

But why all this?

Push and Pull

The first argument can be the control of the connections between the Solr server and the system that is responsible for generating the files that are going to be indexed. When we do not have full control over the data source, it’s better to retrieve the data from the source than to provide additional services that could potentially become a target of attack.

Prototyping and change testing

To add the required fields we used the ability to define scripts. Quoted previously imported file now looks like this:

Note at the end

When using DIH we need to be aware that it works a little differently. In particular, the attempt to load multiple values to a field that is not multivalued (in the schema.xml file) in DIH succeeds – the excess value will be ignored. In the case of “classical” indexing method Solr will return an error.

{document.fileAbsolutePath}”
useSolrAddSchema=”true”
stream=”true”>

Explanation of the example

In comparison with the examples from the earlier articles a new type appeared – the FileDataSource. Example of a complete call:

The additional, not mandatory attributes are obvious:

basePath – the directory which will be used to calculate the relative path of the “entity” tag
encoding – the file encoding (default: the OS default encoding)

After the source definition, we have document definition with two nested entities.

fileName (mandatory) – regular expression that says which files to choose
recursive – should subdirectories be checked (default: no)
rootEntity – says about if the data from the entity should be treated as documents source. Because we don’t want to index files list, which this entity provides, we need to set this attribute to false. After setting this attribute to false the next entity will be treated as main entity and its data will be indexed.
baseDir (mandatory) – the directory where the files should be located
dataSource – in this case we set this parameter to “null” because the entity doesn’t use data source (we can ommit this parameter in Solr > 1.3)
excludes – regular expression which says which files to exclude from indexing
newerThan – says, that only files newer that the parameter value will be taken into consideration. This parameter can take a value in the format: YYYY-MM-dd HH:mm:ss or a single quoted string, like: ‘NOW – 7DAYS’ or an variable that contains the data, for example: ${variable}
olderThan – the same as above, but says about older files
biggerThan – only files bigger than the value of the parameter will be taken into consideration
smallerThan –only files smaller than the value of this parameter will be taken into consideration

url – the input data
useSolrAddSchema – information, that the input data is in Solr XML format
stream – should we use stream for document processing. In case of large XML files, it’s good to use stream=”true” which will use far less memory and won’t try to load the whole XML file into the memory.

Additional parameters are not useful in our case and we describe them on another occasion:)

But why all this?

Push and Pull

The first argument can be the control of the connections between the Solr server and the system that is responsible for generating the files that are going to be indexed. When we do not have full control over the data source, it’s better to retrieve the data from the source than to provide additional services that could potentially become a target of attack.

Prototyping and change testing

To add the required fields we used the ability to define scripts. Quoted previously imported file now looks like this:

Note at the end

When using DIH we need to be aware that it works a little differently. In particular, the attempt to load multiple values to a field that is not multivalued (in the schema.xml file) in DIH succeeds – the excess value will be ignored. In the case of “classical” indexing method Solr will return an error.

Indexing files like doc, pdf – Solr and Tika integration

Marek Rogoziński — Mon, 04 Apr 2011 18:37:27 +0000

In the previous article we have given basic information about how to enable the indexing of binary files, ie MS Word files, PDF files or LibreOffice files. Today we will do the same thing, using the Data Import Handler. Since a few days ago a new version of the Solr server (3.1) have been released, the following guidelines are based on this version. For the purpose of the article I used the “example” application – all of the changes relate to this application.

Assumptions

We assume that the data is available in the XML format and contain basic information about the document along with the file name where the document contents is located. The files are located in a defined directory. Example file look like this:



    
        John F.
        Life in picture
        1.jpg
    
    
        Peter Z.
        Simple presentation
        2.pptx

As you can see the data is characterized by the fact that individual elements do not have a unique identifier. But we can handle it
First we modify the schema by adding a definition of a field that holds the contents of the file:

Next we modify the solrconfig.xml and add DIH configuration:

   

   
      data-config.xml

Since we will use the entity processor located in the extras (TikaEntityProcessor), we need to modify the line loading the DIH library:

The next step is to create a data-config.xml file. In our case it should look like this:


    
   
    
    
        
            
            
             method and a references to it (transformer="script:GenerateId"), each record will be numbered. Frankly  it is not too good method of dealing with the lack of identifiers,  because it does not allow for incremental indexing (we are not able to  distinguish between the various versions of the record) - it is used  here only to show how easy it is to modify the records. If you do not like Javascript - you can use any scripting language supported by java6.
The use of multiple data sources
Another interesting element is the use of several data sources. Since our metadata is available in the XML file, you need to download this file. We  are using the standard approach: we define the UrlDataSource and then  we use the XpathEntityProcessor to analyze the incoming data. Since  we have to download binary attachments to each record, we define an  additional data source: BinURLDataSource and additional entity, using  the TikaEntityProcessor. Now  we only need to notify the entity where to download the file (the url attribute with the reference to the entity - a parent) and the  notification from which data source should be used (dataSource attribute). The  whole is complemented by a list of fields to be indexed (an additional  attribute meta means that the data are retrieved from the file's  metadata).
Available fields
Apache Tika allows you to download a number of additional data from the document. In the example above we used only the title, author, and content of the document. Complete  information about the available fields are included in the interfaces  that are implemented by the class Metadata  (http://tika.apache.org/0.9/api/org/apache/tika/metadata/Metadata.html)  exactly in the defined constants. In particular interesting are DublinCore and MSOffice.
The end
A short time after starting solr and running the import process (calling the address: http://localhost:8983/solr/dataimport?command=full-import) documents are indexed and that should be visible after sending the following query to Solr: http://localhost:8983/solr/select?q=*:*
{rec.description}" dataSource="data">

Generating record identifier – scripts

The first interesting feature is the use of standard ScriptTransformer in order to generate documents identifiers. With javascript “GenerateId” method and a references to it (transformer=”script:GenerateId”), each record will be numbered. Frankly it is not too good method of dealing with the lack of identifiers, because it does not allow for incremental indexing (we are not able to distinguish between the various versions of the record) – it is used here only to show how easy it is to modify the records. If you do not like Javascript – you can use any scripting language supported by java6.

The use of multiple data sources

Another interesting element is the use of several data sources. Since our metadata is available in the XML file, you need to download this file. We are using the standard approach: we define the UrlDataSource and then we use the XpathEntityProcessor to analyze the incoming data. Since we have to download binary attachments to each record, we define an additional data source: BinURLDataSource and additional entity, using the TikaEntityProcessor. Now we only need to notify the entity where to download the file (the url attribute with the reference to the entity – a parent) and the notification from which data source should be used (dataSource attribute). The whole is complemented by a list of fields to be indexed (an additional attribute meta means that the data are retrieved from the file’s metadata).

Available fields

Apache Tika allows you to download a number of additional data from the document. In the example above we used only the title, author, and content of the document. Complete information about the available fields are included in the interfaces that are implemented by the class Metadata (http://tika.apache.org/0.9/api/org/apache/tika/metadata/Metadata.html) exactly in the defined constants. In particular interesting are DublinCore and MSOffice.

The end

A short time after starting solr and running the import process (calling the address: http://localhost:8983/solr/dataimport?command=full-import) documents are indexed and that should be visible after sending the following query to Solr: http://localhost:8983/solr/select?q=*:*

Data Import Handler – removing data from index

Rafał Kuć — Mon, 03 Jan 2011 08:05:13 +0000

Deleting data from an index using DIH incremental indexing, on Solr wiki, is residually treated as something that works similarly to update the records. Similarly, in a previous article, I used this shortcut, the more that I have given an example of indexing wikipedia data that does not need to delete data.

Having at hand a sample data of the albums and performers, I decided to show my way of dealing with such cases. For simplicity and clarity, I assume that after the first import, the data can only decrease.

Test data

My test data are located in the PostgreSQL database table defined as follows:

Table "public.albums"
Column |  Type   |                      Modifiers
--------+---------+-----------------------------------------------------
id     | integer | not null default nextval('albums_id_seq'::regclass)
name   | text    | not null
author | text    | not null
Indexes:
"albums_pk" PRIMARY KEY, btree (id)

The table has 825,661 records.

Test installation

For testing purposes I used the Solr instance having the following characteristics:

Definition at schema.xml:


 
 
 

id
album

Definition of DIH in solrconfig.xml:


 
  db-data-config.xml

And the file DIH db-data-config.xml:

Before our test, I imported all the data from the table albums.

Deleting Data

Looking at the table shows that when we remove the record, he is deleted without leaving a trace, and the only way to update our index would be to compare the documents identifiers in the index to the identifiers in the database and deleting those that no longer exist in the database. Slow and cumbersome. Another way is adding a column deleted_at: instead of physically deleting the record, only add information to this column. DIH can then retrieve all records from the set date later than the last crawl. The disadvantage of this solution may be necessary to modify the application to take such information into consideration.

I apply a different solution, transparent to applications. Let’s create a new table:

CREATE TABLE deletes
(
id serial NOT NULL,
deleted_id bigint,
deleted_at timestamp without time zone NOT NULL,
CONSTRAINT deletes_pk PRIMARY KEY (id)
);

This table will automagically add an identifier of those items that were removed from the table albums and information when they were removed.

Now we add the function:

CREATE OR REPLACE FUNCTION insert_after_delete()
RETURNS trigger AS
$BODY$BEGIN
IF tg_op = 'DELETE' THEN
INSERT INTO deletes(deleted_id, deleted_at)
VALUES (old.id, now());
RETURN old;
END IF;
END$BODY$
LANGUAGE plpgsql VOLATILE;

and a trigger:

CREATE TRIGGER deleted_trg
BEFORE DELETE
ON albums
FOR EACH ROW
EXECUTE PROCEDURE insert_after_delete();

How it works

Each entry deleted from the albums table should result in addition to the table deletes. Let’s check it out. Remove a few records:

=> DELETE FROM albums where id < 37;
DELETE 2
=> SELECT * from deletes;
id | deleted_id |         deleted_at
----+------------+----------------------------
26 |         35 | 2010-12-23 13:53:18.034612
27 |         36 | 2010-12-23 13:53:18.034612
(2 rows)

So the database part works.

We fill up the DIH configuration file so that the entity has been defined as follows:

This allows the import DIH incremental import to use the deletedPkQuery attribute to get the identifiers of the documents which should be removed.

A clever reader will probably begin to wonder, are you sure we need the column with the date of deletion. We could delete all records that are found in the table deletes and then delete the contents of this table. Theoretically this is true, but in the event of a problem with the Solr indexing server we can easily replace it with another – the degree of synchronization with the database is not very important – just the next incremental imports will sync with the database. If we would delete the contents of the deletes table such possibility does not exist.

We can now do the incremental import by calling the following address: /solr/dataimport?command=delta-import
In the logs you should see a line similar to this:
INFO: {delete=[35, 36],optimize=} 0 2
Which means that DIH properly removed from the index the documents, which were previously removed from the database.

Data Import Handler – sharding

Marek Rogoziński — Mon, 27 Dec 2010 08:04:17 +0000

Our reader (greetings!) reported us a problem with the cooperation of DIH and sharding mechanism. The Solr project wiki, in my opinion, discuss the solution to this issue, but makes it a little around and on the occasion.

What is sharding?

Sharding means the division of data into several parts and the storage and processing of the data independently. The additional logic within the application allows you to select the appropriate part of data and/or pooling results from various sources. In the case of DIH and sarding we have to deal with the following case:

sharding on the side of the data source – this means multiple locations/tables with different parts of the data set
sharding on the SOLR side – that is, dividing the data from a source on many independent instances of SOLR
both of these simultaneously

In our case we have one set of data and we want to create a lot of sets (called shards) on the SOLR side.

When to use sharding?

A very important question: why we use sharding mechanism ? In my opinion sharding happens to be abused too often and thus generate lots of additional complications and limitations. The main reason o use sharding is the large volume of data that make SOLR index does not fall within one machine. If it does not – it often means that sharding is redundant. Another reason is performance. But sharding can help here only if other optimization fails and the queries are so complicated that the same addintional cost of sharding (forward queries to the individual Shards and combining their answers) is less than the profit performance that can be achieved.

Test data

Let’s assume that we need sharding. In the example below, I used data from the MusicBrainz creating a simple postgresql table:

Table "public.albums"

 Column |  Type   |                      Modifiers
--------+---------+-----------------------------------------------------

 id     | integer | not null default nextval('albums_id_seq'::regclass)

 name   | text    | not null

 author | text    | not null

Indexes:

"albums_pk" PRIMARY KEY, btree (id)

The table contains 825,661 records. I stress here that both the structure and amount of data is so small that the practical usefulness of using sharding here is negligible.

Test instalation

For the tests we use three instances of SOLR. All instances are identical, the difference is related only to the number of ports (8983, 7872, 6761) – Tests will be performed on one physical machine.

Definition at schema.xml:


 
 
 

id
album

Definition of DIH in solrconfig.xml:


 
  db-data-config.xml

And the file DIH db-data-config.xml:

At this point, each instance is unable to complete the data import.

So let’s setup sharding

Our goal is to modify the configuration such that each instance of DIH index only “their” part of the data. The easiest way to do this is by modifying the query retrieving data to the one like this:

SELECT * from albums where id % NUMBER_OF_INSTANCES = INSTANCE_NUMBER

where:

NUMBER_OF_INSTANCES – the number of Solr servers that store the number of unique parts of the data set
INSTANCE_NUMBER – instance number (starting from zero)

such query does not guarantee exactly and perfectly equal distribution but satisfies two necessary conditions:

the record will always go to a specific and always the same instance
single record will always go to only one instance

so the db-data-config.xml on each machine is different now and looks like this:

SELECT * from albums where id % 3 = 0
SELECT * from albums where id % 3 = 1
SELECT * from albums where id % 3 = 2

How it works

After starting up each of the Solr instances we run the following query on each of them:

/solr/dataimport?command=full-import

When DIH command ends we send the following command:

/solr/dataimport?command=status

We should get the following responses:

Added/Updated: 275220 documents.
Added/Updated: 275221 documents.
Added/Updated: 275220 documents.

Performing a simple insert operation, we see that in all instances we have a total of 825,661 documents, as much as there should be
Make another request – ask for all document. Using sharding we can send the following query to any instance:

/solr/select/?q=*:*&shards=localhost:6761/solr,localhost:7872/solr,localhost:8983/solr

Result: 825661.

It works!

Data Import Handler – How to import data from SQL databases (part 3)

Marek Rogoziński — Mon, 22 Nov 2010 22:38:15 +0000

In previous episodes (part 1 i part 2) we were able to import data from a database in a both wyas full and incremental. Today is the time for a short summary.

Setting dataSource

Recall the line with our setup:

These are not all attributes that can appear. For readability let’s mention them all:

name – the name of the source – you can define many different sources and refer to them by attribute “DataSource” tag “entity“
driver – JDBC driver class name
url – JDBC database url
user – database user name (if not defined, or empty, the connection to the database occurs without a pair user/password)
password – user password
jndiName – instead of giving elements: driver/url/user/password, you can specify the JNDI name under which the data source implementation (javax.sql.DataSource) is made available by the container (eg Jetty/Tomcat)

Advanced arguments:

batchSize (default: 500) – sets the maximum number (or rather a suggestion for the driver) records retrieved from the database in one query to the database. Changing this parameter can help in situations where queries return to much results. It may not help, since implementation of this mechanism depends on the JDBC driver.
convertType (default: false) – Applies an additional conversion from the field type returned by the database to the field type defined in the schema.xml. The default value seems to be safer, because it does not cause extra, magical conversion. However, in special cases (eg BLOB fields), that conversion is one of the ways of solving the problem.
maxRows (default: 0 – no limit) – sets the maximum number of results returned by a query to the database.
readOnly – set the connection to the database in read mode. In principle, this could mean that the driver will be able to perform additional optimizations. At the same time it means the default (!) transactionIsolation setting the TRANSACTION_READ_UNCOMMITTED, holdability the CLOSE_CURSORS_AT_COMMIT, autoCommit to true.
autoCommit – set autocommit transaction after each query.
transactionIsolation (TRANSACTION_READ_UNCOMMITTED, TRANSACTION_READ_COMMITTED, TRANSACTION_REPEATABLE_READ, TRANSACTION_SERIALIZABLE, TRANSACTION_NONE) – sets the transaction isolation (ie, the visibility of the data changed within a transaction)
holdability (CLOSE_CURSORS_AT_COMMIT, HOLD_CURSORS_OVER_COMMIT) – defines how the results will closed (ResultSet) when the transaction is closed
… – The important thing is that there may be any other attributes. All of them will be forwarded by DIH to the JDBC driver, which allows you to define special behavior defined by a specific JDBC driver.
type – type of source. The default value (JdbcDataSource) is sufficient, so the tag can forgotten (I remind you of him on the occasion of the definition of non-SQLowych source)

The „entity” element

Let us now turn to the description of the “entity” item.

As a reminder:

And all the attributes:

Primary:

name – the name of the entity
query – SQL query used to retrieve data associated with that entity.
deltaQuery – query responsible for returning the IDs of those records that have changed since the last crawl (full or incremental) – the last crawl time is provided by DIH in the variable: ${dataimporter.last_index_time}. This query is used by Solr to find those records that have changed.
parentDeltaQuery – query requesting data for the parent entity record. With these queries Solr is able to retrieve all the data that make up the document, regardless of the entity from which they originate. This is necessary because the indexing engine is not able to modify the indexed data – so we need to index the entire document, regardless of the fact that some data has not changed.
deletedPkQuery – provides identifiers of deleted items.
deltaImportQuery – query requesting data for a given record identified by ID that is avaiable as a DIH variable: ${dataimport.delta.id}.
dataSource – the name of the source, the definitions used in several sources (see dataSource.name)

and advanced:

processor – SQLEntityProcessor by default. Element whose function is to provide the data source further elements to a crawl. In the case of databases, usually the default implementation is sufficient
transformer – the data retrieved from the source can be further modified before transmission to a crawl. In particular, the transformer may return additional records, which makes it a very powerful tool
rootEntity – default true for entity element below the document element. This marks the element, which is treated as a root, that is, it will be used to create new items in the index
threads – the number of threads used in the service component entity
onError (abort, skip, continue) – a way to respond to issues: to stop working (abort, the default behavior), ignoring the document (skip), ignore the error (continue)
preImportDeleteQuery – used instead of “*:*” to delete data from the index. (Note: The query to the index, does not query database) – makes sense only in the main entity element
postImportDeleteQuery – used after a full import. (Like preImportDeleteQuery query to the index) – makes sense only in the main entity element
pk – primary key (database, not to be confused with the unique key of the document) – is relevant only in incremental indexing, if we let DIH deltaImportQuery guess based on the query

In the text above the word “guess” appeared. DIH is trying to streamline the work, by adopting reasonable defaults. For example, as mentioned above, during the incremental import is able to try to determine deltaImportQuery. Actually, it was the only behavior in earlier versions, it was realized, that the generated queries does not always work. Hence, I suggest caution and the limited principle of trust

Another thing is the ability to override the definition of the field in a situation where the column names returned by the query correspond to the names of fields in the schema.xml. (Hand up: who noted that the above example is not a copy of the second part but is using that mechanism?)

Yet another example of that DIH is very flexible is to draw attention to the fact that having a structure:

${dataimporter.last_index_time}

we can write the full import of this definition that when the import has already been carried out, it will be preserved as an incremental import! I think this functionality, “came” a little by accident