{"id":363,"date":"2011-08-16T21:46:51","date_gmt":"2011-08-16T19:46:51","guid":{"rendered":"http:\/\/sematext.solr.pl\/?p=363"},"modified":"2020-11-11T21:47:53","modified_gmt":"2020-11-11T20:47:53","slug":"data-import-handler-import-from-solr-xml-files","status":"publish","type":"post","link":"https:\/\/solr.pl\/en\/2011\/08\/16\/data-import-handler-import-from-solr-xml-files\/","title":{"rendered":"Data Import Handler \u2013 import from Solr XML files"},"content":{"rendered":"<p>So far, in previous articles, we looked at the import data from SQL databases. Today it&#8217;s time to import from XML files.<\/p>\n\n\n<!--more-->\n\n\n<h2>Example<\/h2>\n<p>Lets look at the following example:\n<\/p>\n<pre class=\"brush:xml\">&lt;dataConfig&gt;\n  &lt;dataSource type=\"FileDataSource\" \/&gt;\n  &lt;document&gt;\n    &lt;entity\n      name=\"document\"\n      processor=\"FileListEntityProcessor\"\n      baseDir=\"\/home\/import\/data\/2011-06-27\"\n      fileName=\".*\\.xml\n<h2>Explanation of the example<\/h2>\n<p>In comparison with the examples from the earlier articles a new type appeared - the FileDataSource. Example of a complete call:\n<\/p>\n<pre class=\"brush:xml\">&lt;dataSource\n  type=\"FileDataSource\"\n  basePath=\"\/home\/import\/input\"\n  encoding=\"utf-8\"\/&gt;<\/pre>\n<p>The additional, not mandatory attributes are obvious:<\/p>\n<ul>\n<li><strong>basePath<\/strong> \u2013 the directory which will be used to calculate the relative path of the \"entity\" tag<\/li>\n<li><strong>encoding<\/strong> \u2013 the file encoding (default: the OS default encoding)<\/li>\n<\/ul>\n<p>After the source definition, we have document definition with two nested entities.<\/p>\n<p>The purpose of the main entity is to generate the files list. To do that, we use the <strong>FileListEntityProcessor<\/strong>.This entity is self-supporting and doesn't need any data source (thus <em>dataSource=\"null\"<\/em>). The used attributes:<\/p>\n<ul>\n<li><strong>fileName<\/strong> (mandatory) \u2013 regular expression that says which files to choose<\/li>\n<li><strong>recursive<\/strong> \u2013 should subdirectories be checked&nbsp; (default: no)<\/li>\n<li><strong>rootEntity<\/strong> \u2013 says about if the data from the entity should be treated as documents source. Because we don't want to index files list, which this entity provides, we need to set this attribute to <em>false<\/em>. After setting this attribute to <em>false<\/em> the next entity will be treated as main entity and its data will be indexed.<\/li>\n<li><strong>baseDir<\/strong> (mandatory) - the directory where the files should be located<\/li>\n<li><strong>dataSource<\/strong> \u2013 in this case we set this parameter to \"null\" because the entity doesn't use data source&nbsp; (we can ommit this parameter in Solr &gt; 1.3)<\/li>\n<li><strong>excludes<\/strong> \u2013 regular expression which says which files to exclude from indexing<\/li>\n<li><strong>newerThan<\/strong> \u2013 says, that only files newer that the parameter value will be taken into consideration. This parameter can take a value in the format: YYYY-MM-dd HH:mm:ss or a single quoted string, like: 'NOW \u2013 7DAYS' or an variable that contains the data, for example: ${variable}<\/li>\n<li><strong>olderThan<\/strong> \u2013 the same as above, but says about older files<\/li>\n<li><strong>biggerThan<\/strong> \u2013 only files bigger than the value of the parameter will be taken into consideration<\/li>\n<li><strong>smallerThan<\/strong> \u2013only files smaller than the value of this parameter will be taken into consideration<\/li>\n<\/ul>\n<p>If you already have a list of files we can go further, the inner entity - its task is to download specific data contained in the files. The data is taken from the file specified by an external entity using a data source. The processor type: <strong>XpathEntityProcessor<\/strong> used for XML files have the following attributes:<\/p>\n<ul>\n<li><strong>url<\/strong> \u2013 the input data<\/li>\n<li><strong>useSolrAddSchema<\/strong> \u2013 information, that the input data is in Solr XML format<\/li>\n<li><strong>stream<\/strong> \u2013 should we use stream for document processing. In case of large XML files, it's good to use <em>stream=\"true\"<\/em> which will&nbsp; use far less memory and won't try to load the whole XML file into the memory.<\/li>\n<\/ul>\n<p>Additional parameters are not useful in our case and we describe them on another occasion:)<\/p>\n<h2>But why all this?<\/h2>\n<p>The example allows to read all XML files from the selected directory. We used exactly the same file format as the \"classical\" method of sending documents to Solr using HTTP POST. So why use this method?<\/p>\n<h2>Push and Pull<\/h2>\n<p>The first argument can be the control of the connections between the Solr server and the system that is responsible for generating the files that are going to be indexed. When we do not have full control over the data source, it's better to retrieve the data from the source than to provide additional services that could potentially become a target of attack.<\/p>\n<h2>Prototyping and change testing<\/h2>\n<p>How does it work in practice? In my case I decided to add the possibility of a more advanced search and an ability to do faceting on the category tree. The document contained the \"category\" field, where it stored the path such as \"Cars \/ Four sits \/ Audi\". To create a new queries, we should have an additional fields in the index that hold the category name, the level of the category and also how many levels we have.<\/p>\n<p>To add the required fields we used the ability to define scripts. Quoted previously imported file now looks like this:\n<\/p>\n<pre class=\"brush:xml\">&lt;dataConfig&gt;\n  &lt;script&gt;&lt;![CDATA[\n    function CategoryPieces(row) {\n      var pieces = row.get('category').split('\/');\n      var arr = new Array();\n      for (var i=0; i &lt; pieces.length; i++) {\n        row.put('category_level_' + i, pieces[i].trim());\n        arr[i] = pieces[i].trim();\n      }\n      row.put('category_level_max', (pieces.length - 1).toFixed());\n      row.put('category', arr.join('\/'));\n      return row;\n  }\n  ]]&gt;&lt;\/script&gt;\n  &lt;dataSource type=\"FileDataSource\" \/&gt;\n  &lt;document&gt;\n    &lt;entity\n      name=\"document\"\n      processor=\"FileListEntityProcessor\"\n      baseDir=\"\/home\/import\/data\/2011-06-27\"\n      fileName=\".*\\.xml\n<h2>Note at the end<\/h2>\n<p>When using DIH we need to be aware that it works a little differently. In particular, the attempt to load multiple values \u200b\u200bto a field that is not multivalued (in the schema.xml file) in DIH succeeds - the excess value will be ignored. In the case of \"classical\" indexing method Solr will return an error.<\/p>\n\"\n      recursive=\"false\"\n      rootEntity=\"false\"\n      dataSource=\"null\"&gt;\n      &lt;entity\n        processor=\"XPathEntityProcessor\"\n        url=\"\n<h2>Explanation of the example<\/h2>\n<p>In comparison with the examples from the earlier articles a new type appeared - the FileDataSource. Example of a complete call:\n<\/p>\n<pre wp-pre-tag-1=\"\"><\/pre>\n<p>The additional, not mandatory attributes are obvious:<\/p>\n<ul>\n<li><strong>basePath<\/strong> \u2013 the directory which will be used to calculate the relative path of the \"entity\" tag<\/li>\n<li><strong>encoding<\/strong> \u2013 the file encoding (default: the OS default encoding)<\/li>\n<\/ul>\n<p>After the source definition, we have document definition with two nested entities.<\/p>\n<p>The purpose of the main entity is to generate the files list. To do that, we use the <strong>FileListEntityProcessor<\/strong>.This entity is self-supporting and doesn't need any data source (thus <em>dataSource=\"null\"<\/em>). The used attributes:<\/p>\n<ul>\n<li><strong>fileName<\/strong> (mandatory) \u2013 regular expression that says which files to choose<\/li>\n<li><strong>recursive<\/strong> \u2013 should subdirectories be checked&nbsp; (default: no)<\/li>\n<li><strong>rootEntity<\/strong> \u2013 says about if the data from the entity should be treated as documents source. Because we don't want to index files list, which this entity provides, we need to set this attribute to <em>false<\/em>. After setting this attribute to <em>false<\/em> the next entity will be treated as main entity and its data will be indexed.<\/li>\n<li><strong>baseDir<\/strong> (mandatory) - the directory where the files should be located<\/li>\n<li><strong>dataSource<\/strong> \u2013 in this case we set this parameter to \"null\" because the entity doesn't use data source&nbsp; (we can ommit this parameter in Solr &gt; 1.3)<\/li>\n<li><strong>excludes<\/strong> \u2013 regular expression which says which files to exclude from indexing<\/li>\n<li><strong>newerThan<\/strong> \u2013 says, that only files newer that the parameter value will be taken into consideration. This parameter can take a value in the format: YYYY-MM-dd HH:mm:ss or a single quoted string, like: 'NOW \u2013 7DAYS' or an variable that contains the data, for example: ${variable}<\/li>\n<li><strong>olderThan<\/strong> \u2013 the same as above, but says about older files<\/li>\n<li><strong>biggerThan<\/strong> \u2013 only files bigger than the value of the parameter will be taken into consideration<\/li>\n<li><strong>smallerThan<\/strong> \u2013only files smaller than the value of this parameter will be taken into consideration<\/li>\n<\/ul>\n<p>If you already have a list of files we can go further, the inner entity - its task is to download specific data contained in the files. The data is taken from the file specified by an external entity using a data source. The processor type: <strong>XpathEntityProcessor<\/strong> used for XML files have the following attributes:<\/p>\n<ul>\n<li><strong>url<\/strong> \u2013 the input data<\/li>\n<li><strong>useSolrAddSchema<\/strong> \u2013 information, that the input data is in Solr XML format<\/li>\n<li><strong>stream<\/strong> \u2013 should we use stream for document processing. In case of large XML files, it's good to use <em>stream=\"true\"<\/em> which will&nbsp; use far less memory and won't try to load the whole XML file into the memory.<\/li>\n<\/ul>\n<p>Additional parameters are not useful in our case and we describe them on another occasion:)<\/p>\n<h2>But why all this?<\/h2>\n<p>The example allows to read all XML files from the selected directory. We used exactly the same file format as the \"classical\" method of sending documents to Solr using HTTP POST. So why use this method?<\/p>\n<h2>Push and Pull<\/h2>\n<p>The first argument can be the control of the connections between the Solr server and the system that is responsible for generating the files that are going to be indexed. When we do not have full control over the data source, it's better to retrieve the data from the source than to provide additional services that could potentially become a target of attack.<\/p>\n<h2>Prototyping and change testing<\/h2>\n<p>How does it work in practice? In my case I decided to add the possibility of a more advanced search and an ability to do faceting on the category tree. The document contained the \"category\" field, where it stored the path such as \"Cars \/ Four sits \/ Audi\". To create a new queries, we should have an additional fields in the index that hold the category name, the level of the category and also how many levels we have.<\/p>\n<p>To add the required fields we used the ability to define scripts. Quoted previously imported file now looks like this:\n<\/p>\n<pre wp-pre-tag-2=\"\"><\/pre>\n<h2>Note at the end<\/h2>\n<p>When using DIH we need to be aware that it works a little differently. In particular, the attempt to load multiple values \u200b\u200bto a field that is not multivalued (in the schema.xml file) in DIH succeeds - the excess value will be ignored. In the case of \"classical\" indexing method Solr will return an error.<\/p>\n{document.fileAbsolutePath}\"\n        useSolrAddSchema=\"true\"\n        stream=\"true\"&gt;\n      &lt;\/entity&gt;\n    &lt;\/entity&gt;\n  &lt;\/document&gt;\n&lt;\/dataConfig&gt;<\/pre>\n<h2>Explanation of the example<\/h2>\n<p>In comparison with the examples from the earlier articles a new type appeared - the FileDataSource. Example of a complete call:\n<\/p>\n<pre wp-pre-tag-1=\"\"><\/pre>\n<p>The additional, not mandatory attributes are obvious:<\/p>\n<ul>\n<li><strong>basePath<\/strong> \u2013 the directory which will be used to calculate the relative path of the \"entity\" tag<\/li>\n<li><strong>encoding<\/strong> \u2013 the file encoding (default: the OS default encoding)<\/li>\n<\/ul>\n<p>After the source definition, we have document definition with two nested entities.<\/p>\n<p>The purpose of the main entity is to generate the files list. To do that, we use the <strong>FileListEntityProcessor<\/strong>.This entity is self-supporting and doesn't need any data source (thus <em>dataSource=\"null\"<\/em>). The used attributes:<\/p>\n<ul>\n<li><strong>fileName<\/strong> (mandatory) \u2013 regular expression that says which files to choose<\/li>\n<li><strong>recursive<\/strong> \u2013 should subdirectories be checked&nbsp; (default: no)<\/li>\n<li><strong>rootEntity<\/strong> \u2013 says about if the data from the entity should be treated as documents source. Because we don't want to index files list, which this entity provides, we need to set this attribute to <em>false<\/em>. After setting this attribute to <em>false<\/em> the next entity will be treated as main entity and its data will be indexed.<\/li>\n<li><strong>baseDir<\/strong> (mandatory) - the directory where the files should be located<\/li>\n<li><strong>dataSource<\/strong> \u2013 in this case we set this parameter to \"null\" because the entity doesn't use data source&nbsp; (we can ommit this parameter in Solr &gt; 1.3)<\/li>\n<li><strong>excludes<\/strong> \u2013 regular expression which says which files to exclude from indexing<\/li>\n<li><strong>newerThan<\/strong> \u2013 says, that only files newer that the parameter value will be taken into consideration. This parameter can take a value in the format: YYYY-MM-dd HH:mm:ss or a single quoted string, like: 'NOW \u2013 7DAYS' or an variable that contains the data, for example: ${variable}<\/li>\n<li><strong>olderThan<\/strong> \u2013 the same as above, but says about older files<\/li>\n<li><strong>biggerThan<\/strong> \u2013 only files bigger than the value of the parameter will be taken into consideration<\/li>\n<li><strong>smallerThan<\/strong> \u2013only files smaller than the value of this parameter will be taken into consideration<\/li>\n<\/ul>\n<p>If you already have a list of files we can go further, the inner entity - its task is to download specific data contained in the files. The data is taken from the file specified by an external entity using a data source. The processor type: <strong>XpathEntityProcessor<\/strong> used for XML files have the following attributes:<\/p>\n<ul>\n<li><strong>url<\/strong> \u2013 the input data<\/li>\n<li><strong>useSolrAddSchema<\/strong> \u2013 information, that the input data is in Solr XML format<\/li>\n<li><strong>stream<\/strong> \u2013 should we use stream for document processing. In case of large XML files, it's good to use <em>stream=\"true\"<\/em> which will&nbsp; use far less memory and won't try to load the whole XML file into the memory.<\/li>\n<\/ul>\n<p>Additional parameters are not useful in our case and we describe them on another occasion:)<\/p>\n<h2>But why all this?<\/h2>\n<p>The example allows to read all XML files from the selected directory. We used exactly the same file format as the \"classical\" method of sending documents to Solr using HTTP POST. So why use this method?<\/p>\n<h2>Push and Pull<\/h2>\n<p>The first argument can be the control of the connections between the Solr server and the system that is responsible for generating the files that are going to be indexed. When we do not have full control over the data source, it's better to retrieve the data from the source than to provide additional services that could potentially become a target of attack.<\/p>\n<h2>Prototyping and change testing<\/h2>\n<p>How does it work in practice? In my case I decided to add the possibility of a more advanced search and an ability to do faceting on the category tree. The document contained the \"category\" field, where it stored the path such as \"Cars \/ Four sits \/ Audi\". To create a new queries, we should have an additional fields in the index that hold the category name, the level of the category and also how many levels we have.<\/p>\n<p>To add the required fields we used the ability to define scripts. Quoted previously imported file now looks like this:\n<\/p>\n<pre wp-pre-tag-2=\"\"><\/pre>\n<h2>Note at the end<\/h2>\n<p>When using DIH we need to be aware that it works a little differently. In particular, the attempt to load multiple values \u200b\u200bto a field that is not multivalued (in the schema.xml file) in DIH succeeds - the excess value will be ignored. In the case of \"classical\" indexing method Solr will return an error.<\/p>\n\"\n      recursive=\"false\"\n      rootEntity=\"false\"\n      dataSource=\"null\"&gt;\n      &lt;entity\n        processor=\"XPathEntityProcessor\"\n        transformer=\u201dscript:CategoryPieces\u201d\n        url=\"\n<h2>Note at the end<\/h2>\n<p>When using DIH we need to be aware that it works a little differently. In particular, the attempt to load multiple values \u200b\u200bto a field that is not multivalued (in the schema.xml file) in DIH succeeds - the excess value will be ignored. In the case of \"classical\" indexing method Solr will return an error.<\/p>\n\"\n      recursive=\"false\"\n      rootEntity=\"false\"\n      dataSource=\"null\"&gt;\n      &lt;entity\n        processor=\"XPathEntityProcessor\"\n        url=\"\n<h2>Explanation of the example<\/h2>\n<p>In comparison with the examples from the earlier articles a new type appeared - the FileDataSource. Example of a complete call:\n<\/p>\n<pre wp-pre-tag-1=\"\"><\/pre>\n<p>The additional, not mandatory attributes are obvious:<\/p>\n<ul>\n<li><strong>basePath<\/strong> \u2013 the directory which will be used to calculate the relative path of the \"entity\" tag<\/li>\n<li><strong>encoding<\/strong> \u2013 the file encoding (default: the OS default encoding)<\/li>\n<\/ul>\n<p>After the source definition, we have document definition with two nested entities.<\/p>\n<p>The purpose of the main entity is to generate the files list. To do that, we use the <strong>FileListEntityProcessor<\/strong>.This entity is self-supporting and doesn't need any data source (thus <em>dataSource=\"null\"<\/em>). The used attributes:<\/p>\n<ul>\n<li><strong>fileName<\/strong> (mandatory) \u2013 regular expression that says which files to choose<\/li>\n<li><strong>recursive<\/strong> \u2013 should subdirectories be checked&nbsp; (default: no)<\/li>\n<li><strong>rootEntity<\/strong> \u2013 says about if the data from the entity should be treated as documents source. Because we don't want to index files list, which this entity provides, we need to set this attribute to <em>false<\/em>. After setting this attribute to <em>false<\/em> the next entity will be treated as main entity and its data will be indexed.<\/li>\n<li><strong>baseDir<\/strong> (mandatory) - the directory where the files should be located<\/li>\n<li><strong>dataSource<\/strong> \u2013 in this case we set this parameter to \"null\" because the entity doesn't use data source&nbsp; (we can ommit this parameter in Solr &gt; 1.3)<\/li>\n<li><strong>excludes<\/strong> \u2013 regular expression which says which files to exclude from indexing<\/li>\n<li><strong>newerThan<\/strong> \u2013 says, that only files newer that the parameter value will be taken into consideration. This parameter can take a value in the format: YYYY-MM-dd HH:mm:ss or a single quoted string, like: 'NOW \u2013 7DAYS' or an variable that contains the data, for example: ${variable}<\/li>\n<li><strong>olderThan<\/strong> \u2013 the same as above, but says about older files<\/li>\n<li><strong>biggerThan<\/strong> \u2013 only files bigger than the value of the parameter will be taken into consideration<\/li>\n<li><strong>smallerThan<\/strong> \u2013only files smaller than the value of this parameter will be taken into consideration<\/li>\n<\/ul>\n<p>If you already have a list of files we can go further, the inner entity - its task is to download specific data contained in the files. The data is taken from the file specified by an external entity using a data source. The processor type: <strong>XpathEntityProcessor<\/strong> used for XML files have the following attributes:<\/p>\n<ul>\n<li><strong>url<\/strong> \u2013 the input data<\/li>\n<li><strong>useSolrAddSchema<\/strong> \u2013 information, that the input data is in Solr XML format<\/li>\n<li><strong>stream<\/strong> \u2013 should we use stream for document processing. In case of large XML files, it's good to use <em>stream=\"true\"<\/em> which will&nbsp; use far less memory and won't try to load the whole XML file into the memory.<\/li>\n<\/ul>\n<p>Additional parameters are not useful in our case and we describe them on another occasion:)<\/p>\n<h2>But why all this?<\/h2>\n<p>The example allows to read all XML files from the selected directory. We used exactly the same file format as the \"classical\" method of sending documents to Solr using HTTP POST. So why use this method?<\/p>\n<h2>Push and Pull<\/h2>\n<p>The first argument can be the control of the connections between the Solr server and the system that is responsible for generating the files that are going to be indexed. When we do not have full control over the data source, it's better to retrieve the data from the source than to provide additional services that could potentially become a target of attack.<\/p>\n<h2>Prototyping and change testing<\/h2>\n<p>How does it work in practice? In my case I decided to add the possibility of a more advanced search and an ability to do faceting on the category tree. The document contained the \"category\" field, where it stored the path such as \"Cars \/ Four sits \/ Audi\". To create a new queries, we should have an additional fields in the index that hold the category name, the level of the category and also how many levels we have.<\/p>\n<p>To add the required fields we used the ability to define scripts. Quoted previously imported file now looks like this:\n<\/p>\n<pre wp-pre-tag-2=\"\"><\/pre>\n<h2>Note at the end<\/h2>\n<p>When using DIH we need to be aware that it works a little differently. In particular, the attempt to load multiple values \u200b\u200bto a field that is not multivalued (in the schema.xml file) in DIH succeeds - the excess value will be ignored. In the case of \"classical\" indexing method Solr will return an error.<\/p>\n{document.fileAbsolutePath}\"\n        useSolrAddSchema=\"true\"\n        stream=\"true\"&gt;\n      &lt;\/entity&gt;\n    &lt;\/entity&gt;\n  &lt;\/document&gt;\n&lt;\/dataConfig&gt;<\/pre>\n<h2>Explanation of the example<\/h2>\n<p>In comparison with the examples from the earlier articles a new type appeared &#8211; the FileDataSource. Example of a complete call:\n<\/p>\n<pre wp-pre-tag-1=\"\"><\/pre>\n<p>The additional, not mandatory attributes are obvious:<\/p>\n<ul>\n<li><strong>basePath<\/strong> \u2013 the directory which will be used to calculate the relative path of the &#8220;entity&#8221; tag<\/li>\n<li><strong>encoding<\/strong> \u2013 the file encoding (default: the OS default encoding)<\/li>\n<\/ul>\n<p>After the source definition, we have document definition with two nested entities.<\/p>\n<p>The purpose of the main entity is to generate the files list. To do that, we use the <strong>FileListEntityProcessor<\/strong>.This entity is self-supporting and doesn&#8217;t need any data source (thus <em>dataSource=&#8221;null&#8221;<\/em>). The used attributes:<\/p>\n<ul>\n<li><strong>fileName<\/strong> (mandatory) \u2013 regular expression that says which files to choose<\/li>\n<li><strong>recursive<\/strong> \u2013 should subdirectories be checked&nbsp; (default: no)<\/li>\n<li><strong>rootEntity<\/strong> \u2013 says about if the data from the entity should be treated as documents source. Because we don&#8217;t want to index files list, which this entity provides, we need to set this attribute to <em>false<\/em>. After setting this attribute to <em>false<\/em> the next entity will be treated as main entity and its data will be indexed.<\/li>\n<li><strong>baseDir<\/strong> (mandatory) &#8211; the directory where the files should be located<\/li>\n<li><strong>dataSource<\/strong> \u2013 in this case we set this parameter to &#8220;null&#8221; because the entity doesn&#8217;t use data source&nbsp; (we can ommit this parameter in Solr &gt; 1.3)<\/li>\n<li><strong>excludes<\/strong> \u2013 regular expression which says which files to exclude from indexing<\/li>\n<li><strong>newerThan<\/strong> \u2013 says, that only files newer that the parameter value will be taken into consideration. This parameter can take a value in the format: YYYY-MM-dd HH:mm:ss or a single quoted string, like: &#8216;NOW \u2013 7DAYS&#8217; or an variable that contains the data, for example: ${variable}<\/li>\n<li><strong>olderThan<\/strong> \u2013 the same as above, but says about older files<\/li>\n<li><strong>biggerThan<\/strong> \u2013 only files bigger than the value of the parameter will be taken into consideration<\/li>\n<li><strong>smallerThan<\/strong> \u2013only files smaller than the value of this parameter will be taken into consideration<\/li>\n<\/ul>\n<p>If you already have a list of files we can go further, the inner entity &#8211; its task is to download specific data contained in the files. The data is taken from the file specified by an external entity using a data source. The processor type: <strong>XpathEntityProcessor<\/strong> used for XML files have the following attributes:<\/p>\n<ul>\n<li><strong>url<\/strong> \u2013 the input data<\/li>\n<li><strong>useSolrAddSchema<\/strong> \u2013 information, that the input data is in Solr XML format<\/li>\n<li><strong>stream<\/strong> \u2013 should we use stream for document processing. In case of large XML files, it&#8217;s good to use <em>stream=&#8221;true&#8221;<\/em> which will&nbsp; use far less memory and won&#8217;t try to load the whole XML file into the memory.<\/li>\n<\/ul>\n<p>Additional parameters are not useful in our case and we describe them on another occasion:)<\/p>\n<h2>But why all this?<\/h2>\n<p>The example allows to read all XML files from the selected directory. We used exactly the same file format as the &#8220;classical&#8221; method of sending documents to Solr using HTTP POST. So why use this method?<\/p>\n<h2>Push and Pull<\/h2>\n<p>The first argument can be the control of the connections between the Solr server and the system that is responsible for generating the files that are going to be indexed. When we do not have full control over the data source, it&#8217;s better to retrieve the data from the source than to provide additional services that could potentially become a target of attack.<\/p>\n<h2>Prototyping and change testing<\/h2>\n<p>How does it work in practice? In my case I decided to add the possibility of a more advanced search and an ability to do faceting on the category tree. The document contained the &#8220;category&#8221; field, where it stored the path such as &#8220;Cars \/ Four sits \/ Audi&#8221;. To create a new queries, we should have an additional fields in the index that hold the category name, the level of the category and also how many levels we have.<\/p>\n<p>To add the required fields we used the ability to define scripts. Quoted previously imported file now looks like this:\n<\/p>\n<pre wp-pre-tag-2=\"\"><\/pre>\n<h2>Note at the end<\/h2>\n<p>When using DIH we need to be aware that it works a little differently. In particular, the attempt to load multiple values \u200b\u200bto a field that is not multivalued (in the schema.xml file) in DIH succeeds &#8211; the excess value will be ignored. In the case of &#8220;classical&#8221; indexing method Solr will return an error.<\/p>\n{document.fileAbsolutePath}&#8221;\n        useSolrAddSchema=&#8221;true&#8221;\n        stream=&#8221;true&#8221;&gt;\n      &lt;\/entity&gt;\n    &lt;\/entity&gt;\n  &lt;\/document&gt;\n&lt;\/dataConfig&gt;\n<h2>Note at the end<\/h2>\n<p>When using DIH we need to be aware that it works a little differently. In particular, the attempt to load multiple values \u200b\u200bto a field that is not multivalued (in the schema.xml file) in DIH succeeds &#8211; the excess value will be ignored. In the case of &#8220;classical&#8221; indexing method Solr will return an error.<\/p>\n&#8221;\n      recursive=&#8221;false&#8221;\n      rootEntity=&#8221;false&#8221;\n      dataSource=&#8221;null&#8221;&gt;\n      &lt;entity\n        processor=&#8221;XPathEntityProcessor&#8221;\n        url=&#8221;\n<h2>Explanation of the example<\/h2>\n<p>In comparison with the examples from the earlier articles a new type appeared &#8211; the FileDataSource. Example of a complete call:\n<\/p>\n<pre wp-pre-tag-1=\"\"><\/pre>\n<p>The additional, not mandatory attributes are obvious:<\/p>\n<ul>\n<li><strong>basePath<\/strong> \u2013 the directory which will be used to calculate the relative path of the &#8220;entity&#8221; tag<\/li>\n<li><strong>encoding<\/strong> \u2013 the file encoding (default: the OS default encoding)<\/li>\n<\/ul>\n<p>After the source definition, we have document definition with two nested entities.<\/p>\n<p>The purpose of the main entity is to generate the files list. To do that, we use the <strong>FileListEntityProcessor<\/strong>.This entity is self-supporting and doesn&#8217;t need any data source (thus <em>dataSource=&#8221;null&#8221;<\/em>). The used attributes:<\/p>\n<ul>\n<li><strong>fileName<\/strong> (mandatory) \u2013 regular expression that says which files to choose<\/li>\n<li><strong>recursive<\/strong> \u2013 should subdirectories be checked&nbsp; (default: no)<\/li>\n<li><strong>rootEntity<\/strong> \u2013 says about if the data from the entity should be treated as documents source. Because we don&#8217;t want to index files list, which this entity provides, we need to set this attribute to <em>false<\/em>. After setting this attribute to <em>false<\/em> the next entity will be treated as main entity and its data will be indexed.<\/li>\n<li><strong>baseDir<\/strong> (mandatory) &#8211; the directory where the files should be located<\/li>\n<li><strong>dataSource<\/strong> \u2013 in this case we set this parameter to &#8220;null&#8221; because the entity doesn&#8217;t use data source&nbsp; (we can ommit this parameter in Solr &gt; 1.3)<\/li>\n<li><strong>excludes<\/strong> \u2013 regular expression which says which files to exclude from indexing<\/li>\n<li><strong>newerThan<\/strong> \u2013 says, that only files newer that the parameter value will be taken into consideration. This parameter can take a value in the format: YYYY-MM-dd HH:mm:ss or a single quoted string, like: &#8216;NOW \u2013 7DAYS&#8217; or an variable that contains the data, for example: ${variable}<\/li>\n<li><strong>olderThan<\/strong> \u2013 the same as above, but says about older files<\/li>\n<li><strong>biggerThan<\/strong> \u2013 only files bigger than the value of the parameter will be taken into consideration<\/li>\n<li><strong>smallerThan<\/strong> \u2013only files smaller than the value of this parameter will be taken into consideration<\/li>\n<\/ul>\n<p>If you already have a list of files we can go further, the inner entity &#8211; its task is to download specific data contained in the files. The data is taken from the file specified by an external entity using a data source. The processor type: <strong>XpathEntityProcessor<\/strong> used for XML files have the following attributes:<\/p>\n<ul>\n<li><strong>url<\/strong> \u2013 the input data<\/li>\n<li><strong>useSolrAddSchema<\/strong> \u2013 information, that the input data is in Solr XML format<\/li>\n<li><strong>stream<\/strong> \u2013 should we use stream for document processing. In case of large XML files, it&#8217;s good to use <em>stream=&#8221;true&#8221;<\/em> which will&nbsp; use far less memory and won&#8217;t try to load the whole XML file into the memory.<\/li>\n<\/ul>\n<p>Additional parameters are not useful in our case and we describe them on another occasion:)<\/p>\n<h2>But why all this?<\/h2>\n<p>The example allows to read all XML files from the selected directory. We used exactly the same file format as the &#8220;classical&#8221; method of sending documents to Solr using HTTP POST. So why use this method?<\/p>\n<h2>Push and Pull<\/h2>\n<p>The first argument can be the control of the connections between the Solr server and the system that is responsible for generating the files that are going to be indexed. When we do not have full control over the data source, it&#8217;s better to retrieve the data from the source than to provide additional services that could potentially become a target of attack.<\/p>\n<h2>Prototyping and change testing<\/h2>\n<p>How does it work in practice? In my case I decided to add the possibility of a more advanced search and an ability to do faceting on the category tree. The document contained the &#8220;category&#8221; field, where it stored the path such as &#8220;Cars \/ Four sits \/ Audi&#8221;. To create a new queries, we should have an additional fields in the index that hold the category name, the level of the category and also how many levels we have.<\/p>\n<p>To add the required fields we used the ability to define scripts. Quoted previously imported file now looks like this:\n<\/p>\n<pre wp-pre-tag-2=\"\"><\/pre>\n<h2>Note at the end<\/h2>\n<p>When using DIH we need to be aware that it works a little differently. In particular, the attempt to load multiple values \u200b\u200bto a field that is not multivalued (in the schema.xml file) in DIH succeeds &#8211; the excess value will be ignored. In the case of &#8220;classical&#8221; indexing method Solr will return an error.<\/p>\n{document.fileAbsolutePath}&#8221;\n        useSolrAddSchema=&#8221;true&#8221;\n        stream=&#8221;true&#8221;&gt;\n      &lt;\/entity&gt;\n    &lt;\/entity&gt;\n  &lt;\/document&gt;\n&lt;\/dataConfig&gt;\n<h2>Explanation of the example<\/h2>\n<p>In comparison with the examples from the earlier articles a new type appeared &#8211; the FileDataSource. Example of a complete call:\n<\/p>\n<pre wp-pre-tag-1=\"\"><\/pre>\n<p>The additional, not mandatory attributes are obvious:<\/p>\n<ul>\n<li><strong>basePath<\/strong> \u2013 the directory which will be used to calculate the relative path of the &#8220;entity&#8221; tag<\/li>\n<li><strong>encoding<\/strong> \u2013 the file encoding (default: the OS default encoding)<\/li>\n<\/ul>\n<p>After the source definition, we have document definition with two nested entities.<\/p>\n<p>The purpose of the main entity is to generate the files list. To do that, we use the <strong>FileListEntityProcessor<\/strong>.This entity is self-supporting and doesn&#8217;t need any data source (thus <em>dataSource=&#8221;null&#8221;<\/em>). The used attributes:<\/p>\n<ul>\n<li><strong>fileName<\/strong> (mandatory) \u2013 regular expression that says which files to choose<\/li>\n<li><strong>recursive<\/strong> \u2013 should subdirectories be checked&nbsp; (default: no)<\/li>\n<li><strong>rootEntity<\/strong> \u2013 says about if the data from the entity should be treated as documents source. Because we don&#8217;t want to index files list, which this entity provides, we need to set this attribute to <em>false<\/em>. After setting this attribute to <em>false<\/em> the next entity will be treated as main entity and its data will be indexed.<\/li>\n<li><strong>baseDir<\/strong> (mandatory) &#8211; the directory where the files should be located<\/li>\n<li><strong>dataSource<\/strong> \u2013 in this case we set this parameter to &#8220;null&#8221; because the entity doesn&#8217;t use data source&nbsp; (we can ommit this parameter in Solr &gt; 1.3)<\/li>\n<li><strong>excludes<\/strong> \u2013 regular expression which says which files to exclude from indexing<\/li>\n<li><strong>newerThan<\/strong> \u2013 says, that only files newer that the parameter value will be taken into consideration. This parameter can take a value in the format: YYYY-MM-dd HH:mm:ss or a single quoted string, like: &#8216;NOW \u2013 7DAYS&#8217; or an variable that contains the data, for example: ${variable}<\/li>\n<li><strong>olderThan<\/strong> \u2013 the same as above, but says about older files<\/li>\n<li><strong>biggerThan<\/strong> \u2013 only files bigger than the value of the parameter will be taken into consideration<\/li>\n<li><strong>smallerThan<\/strong> \u2013only files smaller than the value of this parameter will be taken into consideration<\/li>\n<\/ul>\n<p>If you already have a list of files we can go further, the inner entity &#8211; its task is to download specific data contained in the files. The data is taken from the file specified by an external entity using a data source. The processor type: <strong>XpathEntityProcessor<\/strong> used for XML files have the following attributes:<\/p>\n<ul>\n<li><strong>url<\/strong> \u2013 the input data<\/li>\n<li><strong>useSolrAddSchema<\/strong> \u2013 information, that the input data is in Solr XML format<\/li>\n<li><strong>stream<\/strong> \u2013 should we use stream for document processing. In case of large XML files, it&#8217;s good to use <em>stream=&#8221;true&#8221;<\/em> which will&nbsp; use far less memory and won&#8217;t try to load the whole XML file into the memory.<\/li>\n<\/ul>\n<p>Additional parameters are not useful in our case and we describe them on another occasion:)<\/p>\n<h2>But why all this?<\/h2>\n<p>The example allows to read all XML files from the selected directory. We used exactly the same file format as the &#8220;classical&#8221; method of sending documents to Solr using HTTP POST. So why use this method?<\/p>\n<h2>Push and Pull<\/h2>\n<p>The first argument can be the control of the connections between the Solr server and the system that is responsible for generating the files that are going to be indexed. When we do not have full control over the data source, it&#8217;s better to retrieve the data from the source than to provide additional services that could potentially become a target of attack.<\/p>\n<h2>Prototyping and change testing<\/h2>\n<p>How does it work in practice? In my case I decided to add the possibility of a more advanced search and an ability to do faceting on the category tree. The document contained the &#8220;category&#8221; field, where it stored the path such as &#8220;Cars \/ Four sits \/ Audi&#8221;. To create a new queries, we should have an additional fields in the index that hold the category name, the level of the category and also how many levels we have.<\/p>\n<p>To add the required fields we used the ability to define scripts. Quoted previously imported file now looks like this:\n<\/p>\n<pre wp-pre-tag-2=\"\"><\/pre>\n<h2>Note at the end<\/h2>\n<p>When using DIH we need to be aware that it works a little differently. In particular, the attempt to load multiple values \u200b\u200bto a field that is not multivalued (in the schema.xml file) in DIH succeeds &#8211; the excess value will be ignored. In the case of &#8220;classical&#8221; indexing method Solr will return an error.<\/p>","protected":false},"excerpt":{"rendered":"<p>So far, in previous articles, we looked at the import data from SQL databases. Today it&#8217;s time to import from XML files.<\/p>\n","protected":false},"author":2,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":"","_links_to":"","_links_to_target":""},"categories":[2],"tags":[407,256,254,204],"class_list":["post-363","post","type-post","status-publish","format-standard","hentry","category-general","tag-apache-solr","tag-data-import-handler-2","tag-dih-2","tag-import-2"],"_links":{"self":[{"href":"https:\/\/solr.pl\/en\/wp-json\/wp\/v2\/posts\/363","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/solr.pl\/en\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/solr.pl\/en\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/solr.pl\/en\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/solr.pl\/en\/wp-json\/wp\/v2\/comments?post=363"}],"version-history":[{"count":2,"href":"https:\/\/solr.pl\/en\/wp-json\/wp\/v2\/posts\/363\/revisions"}],"predecessor-version":[{"id":365,"href":"https:\/\/solr.pl\/en\/wp-json\/wp\/v2\/posts\/363\/revisions\/365"}],"wp:attachment":[{"href":"https:\/\/solr.pl\/en\/wp-json\/wp\/v2\/media?parent=363"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/solr.pl\/en\/wp-json\/wp\/v2\/categories?post=363"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/solr.pl\/en\/wp-json\/wp\/v2\/tags?post=363"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}