Data Import Handler – import from Solr XML files

So far, in previous articles, we looked at the import data from SQL databases. Today it’s time to import from XML files.

Example

Lets look at the following example:

Explanation of the example

In comparison with the examples from the earlier articles a new type appeared – the FileDataSource. Example of a complete call:

The additional, not mandatory attributes are obvious:

  • basePath – the directory which will be used to calculate the relative path of the “entity” tag
  • encoding – the file encoding (default: the OS default encoding)

After the source definition, we have document definition with two nested entities.

The purpose of the main entity is to generate the files list. To do that, we use the FileListEntityProcessor.This entity is self-supporting and doesn’t need any data source (thus dataSource=”null”). The used attributes:

  • fileName (mandatory) – regular expression that says which files to choose
  • recursive – should subdirectories be checked  (default: no)
  • rootEntity – says about if the data from the entity should be treated as documents source. Because we don’t want to index files list, which this entity provides, we need to set this attribute to false. After setting this attribute to false the next entity will be treated as main entity and its data will be indexed.
  • baseDir (mandatory) – the directory where the files should be located
  • dataSource – in this case we set this parameter to “null” because the entity doesn’t use data source  (we can ommit this parameter in Solr > 1.3)
  • excludes – regular expression which says which files to exclude from indexing
  • newerThan – says, that only files newer that the parameter value will be taken into consideration. This parameter can take a value in the format: YYYY-MM-dd HH:mm:ss or a single quoted string, like: ‘NOW – 7DAYS’ or an variable that contains the data, for example: ${variable}
  • olderThan – the same as above, but says about older files
  • biggerThan – only files bigger than the value of the parameter will be taken into consideration
  • smallerThan –only files smaller than the value of this parameter will be taken into consideration

If you already have a list of files we can go further, the inner entity – its task is to download specific data contained in the files. The data is taken from the file specified by an external entity using a data source. The processor type: XpathEntityProcessor used for XML files have the following attributes:

  • url – the input data
  • useSolrAddSchemainformation, that the input data is in Solr XML format
  • stream – should we use stream for document processing. In case of large XML files, it’s good to use stream=”true” which will  use far less memory and won’t try to load the whole XML file into the memory.

Additional parameters are not useful in our case and we describe them on another occasion:)

But why all this?

The example allows to read all XML files from the selected directory. We used exactly the same file format as the “classical” method of sending documents to Solr using HTTP POST. So why use this method?

Push and Pull

The first argument can be the control of the connections between the Solr server and the system that is responsible for generating the files that are going to be indexed. When we do not have full control over the data source, it’s better to retrieve the data from the source than to provide additional services that could potentially become a target of attack.

Prototyping and change testing

How does it work in practice? In my case I decided to add the possibility of a more advanced search and an ability to do faceting on the category tree. The document contained the “category” field, where it stored the path such as “Cars / Four sits / Audi”. To create a new queries, we should have an additional fields in the index that hold the category name, the level of the category and also how many levels we have.

To add the required fields we used the ability to define scripts. Quoted previously imported file now looks like this:

Note at the end

When using DIH we need to be aware that it works a little differently. In particular, the attempt to load multiple values ​​to a field that is not multivalued (in the schema.xml file) in DIH succeeds – the excess value will be ignored. In the case of “classical” indexing method Solr will return an error.

5 thoughts on “Data Import Handler – import from Solr XML files

  • 25 March 2015 at 18:55
    Permalink

    Sample.xml
    —————

    689
    689
    20950
    Feature
    Theatrical
    Not Rated

    1281
    1281
    3907
    Feature
    Theatrical
    R

    2081
    2081
    59688
    Feature
    Theatrical
    PG-13

    2184
    2184
    57384
    Feature
    Theatrical
    R

    data-config.xml
    ——————-

    schema.xml
    —————

    SGENNo
    …..
    Query Response:
    ———————

    0
    1

    true
    *:*
    1427243653161
    xml

    Feature
    20950
    1496569000036925440

    Issue:
    ———-
    When i am uploading i am getting following result….
    Indexing completed. Added/Updated: 1 documents. Deleted 0 documents.
    Requests: 0, Fetched: 5, Skipped: 0, Processed: 1
    Started: about an hour ago
    In the Query result, i can see only last node. It’s not looping all the nodes. Please let me know, is there any issue in the Config file?

    2184
    2184
    57384
    Feature
    Theatrical
    R

    Reply
    • 11 April 2015 at 23:10
      Permalink

      Sorry, but we don’t allow pasting XML files, so I can’t really see what the content of the configuration file is 🙁

      Reply
  • 26 June 2015 at 12:04
    Permalink

    Thanks for the info, I am able to config it until I try to run or import my files. I think I’m doing it wrong because I’m posting to /update/extract/ where ExtractingRequestHandler takes action.

    I have set up this requestHandler:

    data-config.xml

    Should I fire the importing process with this handler?, how?

    Thanks in advance 😉

    Reply
  • 29 April 2016 at 10:50
    Permalink

    I want to import data from mysql-table and csv file ata the same time beacuse some data are in mysql tables and some are in csv file . I want to match specific id from mysql table in csv file then add the data in solar.

    What i think or wnat to do….

    Is this possible in solr?

    Please suggest me How to import data from csv and mysql table at the same time.

    Reply

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.