Data Import Handler – import from Solr XML files

So far, in previous articles, we looked at the import data from SQL databases. Today it’s time to import from XML files.

Example

Lets look at the following example:

<dataConfig> <dataSource type="FileDataSource" /> <document> <entity name="document" processor="FileListEntityProcessor" baseDir="/home/import/data/2011-06-27" fileName=".*\.xml

Explanation of the example

In comparison with the examples from the earlier articles a new type appeared - the FileDataSource. Example of a complete call:

<dataSource
  type="FileDataSource"
  basePath="/home/import/input"
  encoding="utf-8"/>

The additional, not mandatory attributes are obvious:

basePath – the directory which will be used to calculate the relative path of the "entity" tag
encoding – the file encoding (default: the OS default encoding)

After the source definition, we have document definition with two nested entities.

The purpose of the main entity is to generate the files list. To do that, we use the FileListEntityProcessor.This entity is self-supporting and doesn't need any data source (thus dataSource="null"). The used attributes:

fileName (mandatory) – regular expression that says which files to choose
recursive – should subdirectories be checked (default: no)
rootEntity – says about if the data from the entity should be treated as documents source. Because we don't want to index files list, which this entity provides, we need to set this attribute to false. After setting this attribute to false the next entity will be treated as main entity and its data will be indexed.
baseDir (mandatory) - the directory where the files should be located
dataSource – in this case we set this parameter to "null" because the entity doesn't use data source (we can ommit this parameter in Solr > 1.3)
excludes – regular expression which says which files to exclude from indexing
newerThan – says, that only files newer that the parameter value will be taken into consideration. This parameter can take a value in the format: YYYY-MM-dd HH:mm:ss or a single quoted string, like: 'NOW – 7DAYS' or an variable that contains the data, for example: ${variable}
olderThan – the same as above, but says about older files
biggerThan – only files bigger than the value of the parameter will be taken into consideration
smallerThan –only files smaller than the value of this parameter will be taken into consideration

If you already have a list of files we can go further, the inner entity - its task is to download specific data contained in the files. The data is taken from the file specified by an external entity using a data source. The processor type: XpathEntityProcessor used for XML files have the following attributes:

url – the input data
useSolrAddSchema – information, that the input data is in Solr XML format
stream – should we use stream for document processing. In case of large XML files, it's good to use stream="true" which will use far less memory and won't try to load the whole XML file into the memory.

Additional parameters are not useful in our case and we describe them on another occasion:)

But why all this?

The example allows to read all XML files from the selected directory. We used exactly the same file format as the "classical" method of sending documents to Solr using HTTP POST. So why use this method?

Push and Pull

The first argument can be the control of the connections between the Solr server and the system that is responsible for generating the files that are going to be indexed. When we do not have full control over the data source, it's better to retrieve the data from the source than to provide additional services that could potentially become a target of attack.

Prototyping and change testing

How does it work in practice? In my case I decided to add the possibility of a more advanced search and an ability to do faceting on the category tree. The document contained the "category" field, where it stored the path such as "Cars / Four sits / Audi". To create a new queries, we should have an additional fields in the index that hold the category name, the level of the category and also how many levels we have.

To add the required fields we used the ability to define scripts. Quoted previously imported file now looks like this:

<dataConfig> <script><![CDATA[ function CategoryPieces(row) { var pieces = row.get('category').split('/'); var arr = new Array(); for (var i=0; i < pieces.length; i++) { row.put('category_level_' + i, pieces[i].trim()); arr[i] = pieces[i].trim(); } row.put('category_level_max', (pieces.length - 1).toFixed()); row.put('category', arr.join('/')); return row; } ]]></script> <dataSource type="FileDataSource" /> <document> <entity name="document" processor="FileListEntityProcessor" baseDir="/home/import/data/2011-06-27" fileName=".*\.xml

Note at the end

When using DIH we need to be aware that it works a little differently. In particular, the attempt to load multiple values to a field that is not multivalued (in the schema.xml file) in DIH succeeds - the excess value will be ignored. In the case of "classical" indexing method Solr will return an error.

" recursive="false" rootEntity="false" dataSource="null"> <entity processor="XPathEntityProcessor" url="

Explanation of the example

In comparison with the examples from the earlier articles a new type appeared - the FileDataSource. Example of a complete call:

The additional, not mandatory attributes are obvious:

basePath – the directory which will be used to calculate the relative path of the "entity" tag
encoding – the file encoding (default: the OS default encoding)

After the source definition, we have document definition with two nested entities.

fileName (mandatory) – regular expression that says which files to choose
recursive – should subdirectories be checked (default: no)
rootEntity – says about if the data from the entity should be treated as documents source. Because we don't want to index files list, which this entity provides, we need to set this attribute to false. After setting this attribute to false the next entity will be treated as main entity and its data will be indexed.
baseDir (mandatory) - the directory where the files should be located
dataSource – in this case we set this parameter to "null" because the entity doesn't use data source (we can ommit this parameter in Solr > 1.3)
excludes – regular expression which says which files to exclude from indexing
newerThan – says, that only files newer that the parameter value will be taken into consideration. This parameter can take a value in the format: YYYY-MM-dd HH:mm:ss or a single quoted string, like: 'NOW – 7DAYS' or an variable that contains the data, for example: ${variable}
olderThan – the same as above, but says about older files
biggerThan – only files bigger than the value of the parameter will be taken into consideration
smallerThan –only files smaller than the value of this parameter will be taken into consideration

url – the input data
useSolrAddSchema – information, that the input data is in Solr XML format
stream – should we use stream for document processing. In case of large XML files, it's good to use stream="true" which will use far less memory and won't try to load the whole XML file into the memory.

Additional parameters are not useful in our case and we describe them on another occasion:)

But why all this?

Push and Pull

Prototyping and change testing

To add the required fields we used the ability to define scripts. Quoted previously imported file now looks like this:

Note at the end

{document.fileAbsolutePath}"
useSolrAddSchema="true"
stream="true">
</entity>
</entity>
</document>
</dataConfig>

Explanation of the example

In comparison with the examples from the earlier articles a new type appeared - the FileDataSource. Example of a complete call:

The additional, not mandatory attributes are obvious:

basePath – the directory which will be used to calculate the relative path of the "entity" tag
encoding – the file encoding (default: the OS default encoding)

After the source definition, we have document definition with two nested entities.

fileName (mandatory) – regular expression that says which files to choose
recursive – should subdirectories be checked (default: no)
rootEntity – says about if the data from the entity should be treated as documents source. Because we don't want to index files list, which this entity provides, we need to set this attribute to false. After setting this attribute to false the next entity will be treated as main entity and its data will be indexed.
baseDir (mandatory) - the directory where the files should be located
dataSource – in this case we set this parameter to "null" because the entity doesn't use data source (we can ommit this parameter in Solr > 1.3)
excludes – regular expression which says which files to exclude from indexing
newerThan – says, that only files newer that the parameter value will be taken into consideration. This parameter can take a value in the format: YYYY-MM-dd HH:mm:ss or a single quoted string, like: 'NOW – 7DAYS' or an variable that contains the data, for example: ${variable}
olderThan – the same as above, but says about older files
biggerThan – only files bigger than the value of the parameter will be taken into consideration
smallerThan –only files smaller than the value of this parameter will be taken into consideration

url – the input data
useSolrAddSchema – information, that the input data is in Solr XML format
stream – should we use stream for document processing. In case of large XML files, it's good to use stream="true" which will use far less memory and won't try to load the whole XML file into the memory.

Additional parameters are not useful in our case and we describe them on another occasion:)

But why all this?

Push and Pull

Prototyping and change testing

To add the required fields we used the ability to define scripts. Quoted previously imported file now looks like this:

Note at the end

"
recursive="false"
rootEntity="false"
dataSource="null">
<entity
processor="XPathEntityProcessor"
transformer=”script:CategoryPieces”
url="

Note at the end

"
recursive="false"
rootEntity="false"
dataSource="null">
<entity
processor="XPathEntityProcessor"
url="

Explanation of the example

In comparison with the examples from the earlier articles a new type appeared - the FileDataSource. Example of a complete call:

The additional, not mandatory attributes are obvious:

basePath – the directory which will be used to calculate the relative path of the "entity" tag
encoding – the file encoding (default: the OS default encoding)

After the source definition, we have document definition with two nested entities.

fileName (mandatory) – regular expression that says which files to choose
recursive – should subdirectories be checked (default: no)
rootEntity – says about if the data from the entity should be treated as documents source. Because we don't want to index files list, which this entity provides, we need to set this attribute to false. After setting this attribute to false the next entity will be treated as main entity and its data will be indexed.
baseDir (mandatory) - the directory where the files should be located
dataSource – in this case we set this parameter to "null" because the entity doesn't use data source (we can ommit this parameter in Solr > 1.3)
excludes – regular expression which says which files to exclude from indexing
newerThan – says, that only files newer that the parameter value will be taken into consideration. This parameter can take a value in the format: YYYY-MM-dd HH:mm:ss or a single quoted string, like: 'NOW – 7DAYS' or an variable that contains the data, for example: ${variable}
olderThan – the same as above, but says about older files
biggerThan – only files bigger than the value of the parameter will be taken into consideration
smallerThan –only files smaller than the value of this parameter will be taken into consideration

url – the input data
useSolrAddSchema – information, that the input data is in Solr XML format
stream – should we use stream for document processing. In case of large XML files, it's good to use stream="true" which will use far less memory and won't try to load the whole XML file into the memory.

Additional parameters are not useful in our case and we describe them on another occasion:)

But why all this?

Push and Pull

Prototyping and change testing

To add the required fields we used the ability to define scripts. Quoted previously imported file now looks like this:

Note at the end

{document.fileAbsolutePath}"
useSolrAddSchema="true"
stream="true">
</entity>
</entity>
</document>
</dataConfig>

Explanation of the example

In comparison with the examples from the earlier articles a new type appeared – the FileDataSource. Example of a complete call:

The additional, not mandatory attributes are obvious:

basePath – the directory which will be used to calculate the relative path of the “entity” tag
encoding – the file encoding (default: the OS default encoding)

After the source definition, we have document definition with two nested entities.

The purpose of the main entity is to generate the files list. To do that, we use the FileListEntityProcessor.This entity is self-supporting and doesn’t need any data source (thus dataSource=”null”). The used attributes:

fileName (mandatory) – regular expression that says which files to choose
recursive – should subdirectories be checked (default: no)
rootEntity – says about if the data from the entity should be treated as documents source. Because we don’t want to index files list, which this entity provides, we need to set this attribute to false. After setting this attribute to false the next entity will be treated as main entity and its data will be indexed.
baseDir (mandatory) – the directory where the files should be located
dataSource – in this case we set this parameter to “null” because the entity doesn’t use data source (we can ommit this parameter in Solr > 1.3)
excludes – regular expression which says which files to exclude from indexing
newerThan – says, that only files newer that the parameter value will be taken into consideration. This parameter can take a value in the format: YYYY-MM-dd HH:mm:ss or a single quoted string, like: ‘NOW – 7DAYS’ or an variable that contains the data, for example: ${variable}
olderThan – the same as above, but says about older files
biggerThan – only files bigger than the value of the parameter will be taken into consideration
smallerThan –only files smaller than the value of this parameter will be taken into consideration

If you already have a list of files we can go further, the inner entity – its task is to download specific data contained in the files. The data is taken from the file specified by an external entity using a data source. The processor type: XpathEntityProcessor used for XML files have the following attributes:

url – the input data
useSolrAddSchema – information, that the input data is in Solr XML format
stream – should we use stream for document processing. In case of large XML files, it’s good to use stream=”true” which will use far less memory and won’t try to load the whole XML file into the memory.

Additional parameters are not useful in our case and we describe them on another occasion:)

But why all this?

The example allows to read all XML files from the selected directory. We used exactly the same file format as the “classical” method of sending documents to Solr using HTTP POST. So why use this method?

Push and Pull

The first argument can be the control of the connections between the Solr server and the system that is responsible for generating the files that are going to be indexed. When we do not have full control over the data source, it’s better to retrieve the data from the source than to provide additional services that could potentially become a target of attack.

Prototyping and change testing

How does it work in practice? In my case I decided to add the possibility of a more advanced search and an ability to do faceting on the category tree. The document contained the “category” field, where it stored the path such as “Cars / Four sits / Audi”. To create a new queries, we should have an additional fields in the index that hold the category name, the level of the category and also how many levels we have.

To add the required fields we used the ability to define scripts. Quoted previously imported file now looks like this:

Note at the end

When using DIH we need to be aware that it works a little differently. In particular, the attempt to load multiple values to a field that is not multivalued (in the schema.xml file) in DIH succeeds – the excess value will be ignored. In the case of “classical” indexing method Solr will return an error.

{document.fileAbsolutePath}”
useSolrAddSchema=”true”
stream=”true”>
</entity>
</entity>
</document>
</dataConfig>

Note at the end

When using DIH we need to be aware that it works a little differently. In particular, the attempt to load multiple values to a field that is not multivalued (in the schema.xml file) in DIH succeeds – the excess value will be ignored. In the case of “classical” indexing method Solr will return an error.

”
recursive=”false”
rootEntity=”false”
dataSource=”null”>
<entity
processor=”XPathEntityProcessor”
url=”

Explanation of the example

In comparison with the examples from the earlier articles a new type appeared – the FileDataSource. Example of a complete call:

The additional, not mandatory attributes are obvious:

basePath – the directory which will be used to calculate the relative path of the “entity” tag
encoding – the file encoding (default: the OS default encoding)

After the source definition, we have document definition with two nested entities.

fileName (mandatory) – regular expression that says which files to choose
recursive – should subdirectories be checked (default: no)
rootEntity – says about if the data from the entity should be treated as documents source. Because we don’t want to index files list, which this entity provides, we need to set this attribute to false. After setting this attribute to false the next entity will be treated as main entity and its data will be indexed.
baseDir (mandatory) – the directory where the files should be located
dataSource – in this case we set this parameter to “null” because the entity doesn’t use data source (we can ommit this parameter in Solr > 1.3)
excludes – regular expression which says which files to exclude from indexing
newerThan – says, that only files newer that the parameter value will be taken into consideration. This parameter can take a value in the format: YYYY-MM-dd HH:mm:ss or a single quoted string, like: ‘NOW – 7DAYS’ or an variable that contains the data, for example: ${variable}
olderThan – the same as above, but says about older files
biggerThan – only files bigger than the value of the parameter will be taken into consideration
smallerThan –only files smaller than the value of this parameter will be taken into consideration

url – the input data
useSolrAddSchema – information, that the input data is in Solr XML format
stream – should we use stream for document processing. In case of large XML files, it’s good to use stream=”true” which will use far less memory and won’t try to load the whole XML file into the memory.

Additional parameters are not useful in our case and we describe them on another occasion:)

But why all this?

Push and Pull

The first argument can be the control of the connections between the Solr server and the system that is responsible for generating the files that are going to be indexed. When we do not have full control over the data source, it’s better to retrieve the data from the source than to provide additional services that could potentially become a target of attack.

Prototyping and change testing

To add the required fields we used the ability to define scripts. Quoted previously imported file now looks like this:

Note at the end

When using DIH we need to be aware that it works a little differently. In particular, the attempt to load multiple values to a field that is not multivalued (in the schema.xml file) in DIH succeeds – the excess value will be ignored. In the case of “classical” indexing method Solr will return an error.

{document.fileAbsolutePath}”
useSolrAddSchema=”true”
stream=”true”>
</entity>
</entity>
</document>
</dataConfig>

Explanation of the example

In comparison with the examples from the earlier articles a new type appeared – the FileDataSource. Example of a complete call:

The additional, not mandatory attributes are obvious:

basePath – the directory which will be used to calculate the relative path of the “entity” tag
encoding – the file encoding (default: the OS default encoding)

After the source definition, we have document definition with two nested entities.

fileName (mandatory) – regular expression that says which files to choose
recursive – should subdirectories be checked (default: no)
rootEntity – says about if the data from the entity should be treated as documents source. Because we don’t want to index files list, which this entity provides, we need to set this attribute to false. After setting this attribute to false the next entity will be treated as main entity and its data will be indexed.
baseDir (mandatory) – the directory where the files should be located
dataSource – in this case we set this parameter to “null” because the entity doesn’t use data source (we can ommit this parameter in Solr > 1.3)
excludes – regular expression which says which files to exclude from indexing
newerThan – says, that only files newer that the parameter value will be taken into consideration. This parameter can take a value in the format: YYYY-MM-dd HH:mm:ss or a single quoted string, like: ‘NOW – 7DAYS’ or an variable that contains the data, for example: ${variable}
olderThan – the same as above, but says about older files
biggerThan – only files bigger than the value of the parameter will be taken into consideration
smallerThan –only files smaller than the value of this parameter will be taken into consideration

url – the input data
useSolrAddSchema – information, that the input data is in Solr XML format
stream – should we use stream for document processing. In case of large XML files, it’s good to use stream=”true” which will use far less memory and won’t try to load the whole XML file into the memory.

Additional parameters are not useful in our case and we describe them on another occasion:)

But why all this?

Push and Pull

The first argument can be the control of the connections between the Solr server and the system that is responsible for generating the files that are going to be indexed. When we do not have full control over the data source, it’s better to retrieve the data from the source than to provide additional services that could potentially become a target of attack.

Prototyping and change testing

To add the required fields we used the ability to define scripts. Quoted previously imported file now looks like this:

Note at the end

When using DIH we need to be aware that it works a little differently. In particular, the attempt to load multiple values to a field that is not multivalued (in the schema.xml file) in DIH succeeds – the excess value will be ignored. In the case of “classical” indexing method Solr will return an error.

Example

Explanation of the example

But why all this?

Push and Pull

Prototyping and change testing

Note at the end

Explanation of the example

But why all this?

Push and Pull

Prototyping and change testing

Note at the end

Explanation of the example

But why all this?

Push and Pull

Prototyping and change testing

Note at the end

Note at the end

Explanation of the example

But why all this?

Push and Pull

Prototyping and change testing

Note at the end

Explanation of the example

But why all this?

Push and Pull

Prototyping and change testing

Note at the end

Note at the end

Explanation of the example

But why all this?

Push and Pull

Prototyping and change testing

Note at the end

Explanation of the example

But why all this?

Push and Pull

Prototyping and change testing

Note at the end

Leave a Reply Cancel reply