<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>import &#8211; Solr.pl</title>
	<atom:link href="https://solr.pl/en/tag/import-2/feed/" rel="self" type="application/rss+xml" />
	<link>https://solr.pl/en/</link>
	<description>All things to be found - Blog related to Apache Solr &#38; Lucene projects - https://solr.apache.org</description>
	<lastBuildDate>Wed, 11 Nov 2020 20:47:53 +0000</lastBuildDate>
	<language>en-US</language>
	<sy:updatePeriod>
	hourly	</sy:updatePeriod>
	<sy:updateFrequency>
	1	</sy:updateFrequency>
	<generator>https://wordpress.org/?v=6.9</generator>
	<item>
		<title>Data Import Handler – import from Solr XML files</title>
		<link>https://solr.pl/en/2011/08/16/data-import-handler-import-from-solr-xml-files/</link>
					<comments>https://solr.pl/en/2011/08/16/data-import-handler-import-from-solr-xml-files/#respond</comments>
		
		<dc:creator><![CDATA[Marek Rogoziński]]></dc:creator>
		<pubDate>Tue, 16 Aug 2011 19:46:51 +0000</pubDate>
				<category><![CDATA[General]]></category>
		<category><![CDATA[apache solr]]></category>
		<category><![CDATA[data import handler]]></category>
		<category><![CDATA[dih]]></category>
		<category><![CDATA[import]]></category>
		<guid isPermaLink="false">http://sematext.solr.pl/?p=363</guid>

					<description><![CDATA[So far, in previous articles, we looked at the import data from SQL databases. Today it&#8217;s time to import from XML files. Example Lets look at the following example: &#60;dataConfig&#62; &#60;dataSource type="FileDataSource" /&#62; &#60;document&#62; &#60;entity name="document" processor="FileListEntityProcessor" baseDir="/home/import/data/2011-06-27" fileName=".*\.xml Explanation]]></description>
										<content:encoded><![CDATA[<p>So far, in previous articles, we looked at the import data from SQL databases. Today it&#8217;s time to import from XML files.</p>
<p><span id="more-363"></span></p>
<h2>Example</h2>
<p>Lets look at the following example:
</p>
<pre class="brush:xml">&lt;dataConfig&gt;
  &lt;dataSource type="FileDataSource" /&gt;
  &lt;document&gt;
    &lt;entity
      name="document"
      processor="FileListEntityProcessor"
      baseDir="/home/import/data/2011-06-27"
      fileName=".*\.xml
<h2>Explanation of the example</h2>
<p>In comparison with the examples from the earlier articles a new type appeared - the FileDataSource. Example of a complete call:
</p>
<pre class="brush:xml">&lt;dataSource
  type="FileDataSource"
  basePath="/home/import/input"
  encoding="utf-8"/&gt;</pre>
<p>The additional, not mandatory attributes are obvious:</p>
<ul>
<li><strong>basePath</strong> – the directory which will be used to calculate the relative path of the "entity" tag</li>
<li><strong>encoding</strong> – the file encoding (default: the OS default encoding)</li>
</ul>
<p>After the source definition, we have document definition with two nested entities.</p>
<p>The purpose of the main entity is to generate the files list. To do that, we use the <strong>FileListEntityProcessor</strong>.This entity is self-supporting and doesn't need any data source (thus <em>dataSource="null"</em>). The used attributes:</p>
<ul>
<li><strong>fileName</strong> (mandatory) – regular expression that says which files to choose</li>
<li><strong>recursive</strong> – should subdirectories be checked&nbsp; (default: no)</li>
<li><strong>rootEntity</strong> – says about if the data from the entity should be treated as documents source. Because we don't want to index files list, which this entity provides, we need to set this attribute to <em>false</em>. After setting this attribute to <em>false</em> the next entity will be treated as main entity and its data will be indexed.</li>
<li><strong>baseDir</strong> (mandatory) - the directory where the files should be located</li>
<li><strong>dataSource</strong> – in this case we set this parameter to "null" because the entity doesn't use data source&nbsp; (we can ommit this parameter in Solr &gt; 1.3)</li>
<li><strong>excludes</strong> – regular expression which says which files to exclude from indexing</li>
<li><strong>newerThan</strong> – says, that only files newer that the parameter value will be taken into consideration. This parameter can take a value in the format: YYYY-MM-dd HH:mm:ss or a single quoted string, like: 'NOW – 7DAYS' or an variable that contains the data, for example: ${variable}</li>
<li><strong>olderThan</strong> – the same as above, but says about older files</li>
<li><strong>biggerThan</strong> – only files bigger than the value of the parameter will be taken into consideration</li>
<li><strong>smallerThan</strong> –only files smaller than the value of this parameter will be taken into consideration</li>
</ul>
<p>If you already have a list of files we can go further, the inner entity - its task is to download specific data contained in the files. The data is taken from the file specified by an external entity using a data source. The processor type: <strong>XpathEntityProcessor</strong> used for XML files have the following attributes:</p>
<ul>
<li><strong>url</strong> – the input data</li>
<li><strong>useSolrAddSchema</strong> – information, that the input data is in Solr XML format</li>
<li><strong>stream</strong> – should we use stream for document processing. In case of large XML files, it's good to use <em>stream="true"</em> which will&nbsp; use far less memory and won't try to load the whole XML file into the memory.</li>
</ul>
<p>Additional parameters are not useful in our case and we describe them on another occasion:)</p>
<h2>But why all this?</h2>
<p>The example allows to read all XML files from the selected directory. We used exactly the same file format as the "classical" method of sending documents to Solr using HTTP POST. So why use this method?</p>
<h2>Push and Pull</h2>
<p>The first argument can be the control of the connections between the Solr server and the system that is responsible for generating the files that are going to be indexed. When we do not have full control over the data source, it's better to retrieve the data from the source than to provide additional services that could potentially become a target of attack.</p>
<h2>Prototyping and change testing</h2>
<p>How does it work in practice? In my case I decided to add the possibility of a more advanced search and an ability to do faceting on the category tree. The document contained the "category" field, where it stored the path such as "Cars / Four sits / Audi". To create a new queries, we should have an additional fields in the index that hold the category name, the level of the category and also how many levels we have.</p>
<p>To add the required fields we used the ability to define scripts. Quoted previously imported file now looks like this:
</p>
<pre class="brush:xml">&lt;dataConfig&gt;
  &lt;script&gt;&lt;![CDATA[
    function CategoryPieces(row) {
      var pieces = row.get('category').split('/');
      var arr = new Array();
      for (var i=0; i &lt; pieces.length; i++) {
        row.put('category_level_' + i, pieces[i].trim());
        arr[i] = pieces[i].trim();
      }
      row.put('category_level_max', (pieces.length - 1).toFixed());
      row.put('category', arr.join('/'));
      return row;
  }
  ]]&gt;&lt;/script&gt;
  &lt;dataSource type="FileDataSource" /&gt;
  &lt;document&gt;
    &lt;entity
      name="document"
      processor="FileListEntityProcessor"
      baseDir="/home/import/data/2011-06-27"
      fileName=".*\.xml
<h2>Note at the end</h2>
<p>When using DIH we need to be aware that it works a little differently. In particular, the attempt to load multiple values ​​to a field that is not multivalued (in the schema.xml file) in DIH succeeds - the excess value will be ignored. In the case of "classical" indexing method Solr will return an error.</p>
"
      recursive="false"
      rootEntity="false"
      dataSource="null"&gt;
      &lt;entity
        processor="XPathEntityProcessor"
        url="
<h2>Explanation of the example</h2>
<p>In comparison with the examples from the earlier articles a new type appeared - the FileDataSource. Example of a complete call:
</p>
<pre wp-pre-tag-1=""></pre>
<p>The additional, not mandatory attributes are obvious:</p>
<ul>
<li><strong>basePath</strong> – the directory which will be used to calculate the relative path of the "entity" tag</li>
<li><strong>encoding</strong> – the file encoding (default: the OS default encoding)</li>
</ul>
<p>After the source definition, we have document definition with two nested entities.</p>
<p>The purpose of the main entity is to generate the files list. To do that, we use the <strong>FileListEntityProcessor</strong>.This entity is self-supporting and doesn't need any data source (thus <em>dataSource="null"</em>). The used attributes:</p>
<ul>
<li><strong>fileName</strong> (mandatory) – regular expression that says which files to choose</li>
<li><strong>recursive</strong> – should subdirectories be checked&nbsp; (default: no)</li>
<li><strong>rootEntity</strong> – says about if the data from the entity should be treated as documents source. Because we don't want to index files list, which this entity provides, we need to set this attribute to <em>false</em>. After setting this attribute to <em>false</em> the next entity will be treated as main entity and its data will be indexed.</li>
<li><strong>baseDir</strong> (mandatory) - the directory where the files should be located</li>
<li><strong>dataSource</strong> – in this case we set this parameter to "null" because the entity doesn't use data source&nbsp; (we can ommit this parameter in Solr &gt; 1.3)</li>
<li><strong>excludes</strong> – regular expression which says which files to exclude from indexing</li>
<li><strong>newerThan</strong> – says, that only files newer that the parameter value will be taken into consideration. This parameter can take a value in the format: YYYY-MM-dd HH:mm:ss or a single quoted string, like: 'NOW – 7DAYS' or an variable that contains the data, for example: ${variable}</li>
<li><strong>olderThan</strong> – the same as above, but says about older files</li>
<li><strong>biggerThan</strong> – only files bigger than the value of the parameter will be taken into consideration</li>
<li><strong>smallerThan</strong> –only files smaller than the value of this parameter will be taken into consideration</li>
</ul>
<p>If you already have a list of files we can go further, the inner entity - its task is to download specific data contained in the files. The data is taken from the file specified by an external entity using a data source. The processor type: <strong>XpathEntityProcessor</strong> used for XML files have the following attributes:</p>
<ul>
<li><strong>url</strong> – the input data</li>
<li><strong>useSolrAddSchema</strong> – information, that the input data is in Solr XML format</li>
<li><strong>stream</strong> – should we use stream for document processing. In case of large XML files, it's good to use <em>stream="true"</em> which will&nbsp; use far less memory and won't try to load the whole XML file into the memory.</li>
</ul>
<p>Additional parameters are not useful in our case and we describe them on another occasion:)</p>
<h2>But why all this?</h2>
<p>The example allows to read all XML files from the selected directory. We used exactly the same file format as the "classical" method of sending documents to Solr using HTTP POST. So why use this method?</p>
<h2>Push and Pull</h2>
<p>The first argument can be the control of the connections between the Solr server and the system that is responsible for generating the files that are going to be indexed. When we do not have full control over the data source, it's better to retrieve the data from the source than to provide additional services that could potentially become a target of attack.</p>
<h2>Prototyping and change testing</h2>
<p>How does it work in practice? In my case I decided to add the possibility of a more advanced search and an ability to do faceting on the category tree. The document contained the "category" field, where it stored the path such as "Cars / Four sits / Audi". To create a new queries, we should have an additional fields in the index that hold the category name, the level of the category and also how many levels we have.</p>
<p>To add the required fields we used the ability to define scripts. Quoted previously imported file now looks like this:
</p>
<pre wp-pre-tag-2=""></pre>
<h2>Note at the end</h2>
<p>When using DIH we need to be aware that it works a little differently. In particular, the attempt to load multiple values ​​to a field that is not multivalued (in the schema.xml file) in DIH succeeds - the excess value will be ignored. In the case of "classical" indexing method Solr will return an error.</p>
<p>{document.fileAbsolutePath}"<br />
        useSolrAddSchema="true"<br />
        stream="true"&gt;<br />
      &lt;/entity&gt;<br />
    &lt;/entity&gt;<br />
  &lt;/document&gt;<br />
&lt;/dataConfig&gt;</p>
<h2>Explanation of the example</h2>
<p>In comparison with the examples from the earlier articles a new type appeared - the FileDataSource. Example of a complete call:
</p>
<pre wp-pre-tag-1=""></pre>
<p>The additional, not mandatory attributes are obvious:</p>
<ul>
<li><strong>basePath</strong> – the directory which will be used to calculate the relative path of the "entity" tag</li>
<li><strong>encoding</strong> – the file encoding (default: the OS default encoding)</li>
</ul>
<p>After the source definition, we have document definition with two nested entities.</p>
<p>The purpose of the main entity is to generate the files list. To do that, we use the <strong>FileListEntityProcessor</strong>.This entity is self-supporting and doesn't need any data source (thus <em>dataSource="null"</em>). The used attributes:</p>
<ul>
<li><strong>fileName</strong> (mandatory) – regular expression that says which files to choose</li>
<li><strong>recursive</strong> – should subdirectories be checked&nbsp; (default: no)</li>
<li><strong>rootEntity</strong> – says about if the data from the entity should be treated as documents source. Because we don't want to index files list, which this entity provides, we need to set this attribute to <em>false</em>. After setting this attribute to <em>false</em> the next entity will be treated as main entity and its data will be indexed.</li>
<li><strong>baseDir</strong> (mandatory) - the directory where the files should be located</li>
<li><strong>dataSource</strong> – in this case we set this parameter to "null" because the entity doesn't use data source&nbsp; (we can ommit this parameter in Solr &gt; 1.3)</li>
<li><strong>excludes</strong> – regular expression which says which files to exclude from indexing</li>
<li><strong>newerThan</strong> – says, that only files newer that the parameter value will be taken into consideration. This parameter can take a value in the format: YYYY-MM-dd HH:mm:ss or a single quoted string, like: 'NOW – 7DAYS' or an variable that contains the data, for example: ${variable}</li>
<li><strong>olderThan</strong> – the same as above, but says about older files</li>
<li><strong>biggerThan</strong> – only files bigger than the value of the parameter will be taken into consideration</li>
<li><strong>smallerThan</strong> –only files smaller than the value of this parameter will be taken into consideration</li>
</ul>
<p>If you already have a list of files we can go further, the inner entity - its task is to download specific data contained in the files. The data is taken from the file specified by an external entity using a data source. The processor type: <strong>XpathEntityProcessor</strong> used for XML files have the following attributes:</p>
<ul>
<li><strong>url</strong> – the input data</li>
<li><strong>useSolrAddSchema</strong> – information, that the input data is in Solr XML format</li>
<li><strong>stream</strong> – should we use stream for document processing. In case of large XML files, it's good to use <em>stream="true"</em> which will&nbsp; use far less memory and won't try to load the whole XML file into the memory.</li>
</ul>
<p>Additional parameters are not useful in our case and we describe them on another occasion:)</p>
<h2>But why all this?</h2>
<p>The example allows to read all XML files from the selected directory. We used exactly the same file format as the "classical" method of sending documents to Solr using HTTP POST. So why use this method?</p>
<h2>Push and Pull</h2>
<p>The first argument can be the control of the connections between the Solr server and the system that is responsible for generating the files that are going to be indexed. When we do not have full control over the data source, it's better to retrieve the data from the source than to provide additional services that could potentially become a target of attack.</p>
<h2>Prototyping and change testing</h2>
<p>How does it work in practice? In my case I decided to add the possibility of a more advanced search and an ability to do faceting on the category tree. The document contained the "category" field, where it stored the path such as "Cars / Four sits / Audi". To create a new queries, we should have an additional fields in the index that hold the category name, the level of the category and also how many levels we have.</p>
<p>To add the required fields we used the ability to define scripts. Quoted previously imported file now looks like this:
</p>
<pre wp-pre-tag-2=""></pre>
<h2>Note at the end</h2>
<p>When using DIH we need to be aware that it works a little differently. In particular, the attempt to load multiple values ​​to a field that is not multivalued (in the schema.xml file) in DIH succeeds - the excess value will be ignored. In the case of "classical" indexing method Solr will return an error.</p>
<p>"<br />
      recursive="false"<br />
      rootEntity="false"<br />
      dataSource="null"&gt;<br />
      &lt;entity<br />
        processor="XPathEntityProcessor"<br />
        transformer=”script:CategoryPieces”<br />
        url="</p>
<h2>Note at the end</h2>
<p>When using DIH we need to be aware that it works a little differently. In particular, the attempt to load multiple values ​​to a field that is not multivalued (in the schema.xml file) in DIH succeeds - the excess value will be ignored. In the case of "classical" indexing method Solr will return an error.</p>
<p>"<br />
      recursive="false"<br />
      rootEntity="false"<br />
      dataSource="null"&gt;<br />
      &lt;entity<br />
        processor="XPathEntityProcessor"<br />
        url="</p>
<h2>Explanation of the example</h2>
<p>In comparison with the examples from the earlier articles a new type appeared - the FileDataSource. Example of a complete call:
</p>
<pre wp-pre-tag-1=""></pre>
<p>The additional, not mandatory attributes are obvious:</p>
<ul>
<li><strong>basePath</strong> – the directory which will be used to calculate the relative path of the "entity" tag</li>
<li><strong>encoding</strong> – the file encoding (default: the OS default encoding)</li>
</ul>
<p>After the source definition, we have document definition with two nested entities.</p>
<p>The purpose of the main entity is to generate the files list. To do that, we use the <strong>FileListEntityProcessor</strong>.This entity is self-supporting and doesn't need any data source (thus <em>dataSource="null"</em>). The used attributes:</p>
<ul>
<li><strong>fileName</strong> (mandatory) – regular expression that says which files to choose</li>
<li><strong>recursive</strong> – should subdirectories be checked&nbsp; (default: no)</li>
<li><strong>rootEntity</strong> – says about if the data from the entity should be treated as documents source. Because we don't want to index files list, which this entity provides, we need to set this attribute to <em>false</em>. After setting this attribute to <em>false</em> the next entity will be treated as main entity and its data will be indexed.</li>
<li><strong>baseDir</strong> (mandatory) - the directory where the files should be located</li>
<li><strong>dataSource</strong> – in this case we set this parameter to "null" because the entity doesn't use data source&nbsp; (we can ommit this parameter in Solr &gt; 1.3)</li>
<li><strong>excludes</strong> – regular expression which says which files to exclude from indexing</li>
<li><strong>newerThan</strong> – says, that only files newer that the parameter value will be taken into consideration. This parameter can take a value in the format: YYYY-MM-dd HH:mm:ss or a single quoted string, like: 'NOW – 7DAYS' or an variable that contains the data, for example: ${variable}</li>
<li><strong>olderThan</strong> – the same as above, but says about older files</li>
<li><strong>biggerThan</strong> – only files bigger than the value of the parameter will be taken into consideration</li>
<li><strong>smallerThan</strong> –only files smaller than the value of this parameter will be taken into consideration</li>
</ul>
<p>If you already have a list of files we can go further, the inner entity - its task is to download specific data contained in the files. The data is taken from the file specified by an external entity using a data source. The processor type: <strong>XpathEntityProcessor</strong> used for XML files have the following attributes:</p>
<ul>
<li><strong>url</strong> – the input data</li>
<li><strong>useSolrAddSchema</strong> – information, that the input data is in Solr XML format</li>
<li><strong>stream</strong> – should we use stream for document processing. In case of large XML files, it's good to use <em>stream="true"</em> which will&nbsp; use far less memory and won't try to load the whole XML file into the memory.</li>
</ul>
<p>Additional parameters are not useful in our case and we describe them on another occasion:)</p>
<h2>But why all this?</h2>
<p>The example allows to read all XML files from the selected directory. We used exactly the same file format as the "classical" method of sending documents to Solr using HTTP POST. So why use this method?</p>
<h2>Push and Pull</h2>
<p>The first argument can be the control of the connections between the Solr server and the system that is responsible for generating the files that are going to be indexed. When we do not have full control over the data source, it's better to retrieve the data from the source than to provide additional services that could potentially become a target of attack.</p>
<h2>Prototyping and change testing</h2>
<p>How does it work in practice? In my case I decided to add the possibility of a more advanced search and an ability to do faceting on the category tree. The document contained the "category" field, where it stored the path such as "Cars / Four sits / Audi". To create a new queries, we should have an additional fields in the index that hold the category name, the level of the category and also how many levels we have.</p>
<p>To add the required fields we used the ability to define scripts. Quoted previously imported file now looks like this:
</p>
<pre wp-pre-tag-2=""></pre>
<h2>Note at the end</h2>
<p>When using DIH we need to be aware that it works a little differently. In particular, the attempt to load multiple values ​​to a field that is not multivalued (in the schema.xml file) in DIH succeeds - the excess value will be ignored. In the case of "classical" indexing method Solr will return an error.</p>
<p>{document.fileAbsolutePath}"<br />
        useSolrAddSchema="true"<br />
        stream="true"&gt;<br />
      &lt;/entity&gt;<br />
    &lt;/entity&gt;<br />
  &lt;/document&gt;<br />
&lt;/dataConfig&gt;</p>
<h2>Explanation of the example</h2>
<p>In comparison with the examples from the earlier articles a new type appeared &#8211; the FileDataSource. Example of a complete call:
</p>
<pre wp-pre-tag-1=""></pre>
<p>The additional, not mandatory attributes are obvious:</p>
<ul>
<li><strong>basePath</strong> – the directory which will be used to calculate the relative path of the &#8220;entity&#8221; tag</li>
<li><strong>encoding</strong> – the file encoding (default: the OS default encoding)</li>
</ul>
<p>After the source definition, we have document definition with two nested entities.</p>
<p>The purpose of the main entity is to generate the files list. To do that, we use the <strong>FileListEntityProcessor</strong>.This entity is self-supporting and doesn&#8217;t need any data source (thus <em>dataSource=&#8221;null&#8221;</em>). The used attributes:</p>
<ul>
<li><strong>fileName</strong> (mandatory) – regular expression that says which files to choose</li>
<li><strong>recursive</strong> – should subdirectories be checked&nbsp; (default: no)</li>
<li><strong>rootEntity</strong> – says about if the data from the entity should be treated as documents source. Because we don&#8217;t want to index files list, which this entity provides, we need to set this attribute to <em>false</em>. After setting this attribute to <em>false</em> the next entity will be treated as main entity and its data will be indexed.</li>
<li><strong>baseDir</strong> (mandatory) &#8211; the directory where the files should be located</li>
<li><strong>dataSource</strong> – in this case we set this parameter to &#8220;null&#8221; because the entity doesn&#8217;t use data source&nbsp; (we can ommit this parameter in Solr &gt; 1.3)</li>
<li><strong>excludes</strong> – regular expression which says which files to exclude from indexing</li>
<li><strong>newerThan</strong> – says, that only files newer that the parameter value will be taken into consideration. This parameter can take a value in the format: YYYY-MM-dd HH:mm:ss or a single quoted string, like: &#8216;NOW – 7DAYS&#8217; or an variable that contains the data, for example: ${variable}</li>
<li><strong>olderThan</strong> – the same as above, but says about older files</li>
<li><strong>biggerThan</strong> – only files bigger than the value of the parameter will be taken into consideration</li>
<li><strong>smallerThan</strong> –only files smaller than the value of this parameter will be taken into consideration</li>
</ul>
<p>If you already have a list of files we can go further, the inner entity &#8211; its task is to download specific data contained in the files. The data is taken from the file specified by an external entity using a data source. The processor type: <strong>XpathEntityProcessor</strong> used for XML files have the following attributes:</p>
<ul>
<li><strong>url</strong> – the input data</li>
<li><strong>useSolrAddSchema</strong> – information, that the input data is in Solr XML format</li>
<li><strong>stream</strong> – should we use stream for document processing. In case of large XML files, it&#8217;s good to use <em>stream=&#8221;true&#8221;</em> which will&nbsp; use far less memory and won&#8217;t try to load the whole XML file into the memory.</li>
</ul>
<p>Additional parameters are not useful in our case and we describe them on another occasion:)</p>
<h2>But why all this?</h2>
<p>The example allows to read all XML files from the selected directory. We used exactly the same file format as the &#8220;classical&#8221; method of sending documents to Solr using HTTP POST. So why use this method?</p>
<h2>Push and Pull</h2>
<p>The first argument can be the control of the connections between the Solr server and the system that is responsible for generating the files that are going to be indexed. When we do not have full control over the data source, it&#8217;s better to retrieve the data from the source than to provide additional services that could potentially become a target of attack.</p>
<h2>Prototyping and change testing</h2>
<p>How does it work in practice? In my case I decided to add the possibility of a more advanced search and an ability to do faceting on the category tree. The document contained the &#8220;category&#8221; field, where it stored the path such as &#8220;Cars / Four sits / Audi&#8221;. To create a new queries, we should have an additional fields in the index that hold the category name, the level of the category and also how many levels we have.</p>
<p>To add the required fields we used the ability to define scripts. Quoted previously imported file now looks like this:
</p>
<pre wp-pre-tag-2=""></pre>
<h2>Note at the end</h2>
<p>When using DIH we need to be aware that it works a little differently. In particular, the attempt to load multiple values ​​to a field that is not multivalued (in the schema.xml file) in DIH succeeds &#8211; the excess value will be ignored. In the case of &#8220;classical&#8221; indexing method Solr will return an error.</p>
<p>{document.fileAbsolutePath}&#8221;<br />
        useSolrAddSchema=&#8221;true&#8221;<br />
        stream=&#8221;true&#8221;&gt;<br />
      &lt;/entity&gt;<br />
    &lt;/entity&gt;<br />
  &lt;/document&gt;<br />
&lt;/dataConfig&gt;</p>
<h2>Note at the end</h2>
<p>When using DIH we need to be aware that it works a little differently. In particular, the attempt to load multiple values ​​to a field that is not multivalued (in the schema.xml file) in DIH succeeds &#8211; the excess value will be ignored. In the case of &#8220;classical&#8221; indexing method Solr will return an error.</p>
<p>&#8221;<br />
      recursive=&#8221;false&#8221;<br />
      rootEntity=&#8221;false&#8221;<br />
      dataSource=&#8221;null&#8221;&gt;<br />
      &lt;entity<br />
        processor=&#8221;XPathEntityProcessor&#8221;<br />
        url=&#8221;</p>
<h2>Explanation of the example</h2>
<p>In comparison with the examples from the earlier articles a new type appeared &#8211; the FileDataSource. Example of a complete call:
</p>
<pre wp-pre-tag-1=""></pre>
<p>The additional, not mandatory attributes are obvious:</p>
<ul>
<li><strong>basePath</strong> – the directory which will be used to calculate the relative path of the &#8220;entity&#8221; tag</li>
<li><strong>encoding</strong> – the file encoding (default: the OS default encoding)</li>
</ul>
<p>After the source definition, we have document definition with two nested entities.</p>
<p>The purpose of the main entity is to generate the files list. To do that, we use the <strong>FileListEntityProcessor</strong>.This entity is self-supporting and doesn&#8217;t need any data source (thus <em>dataSource=&#8221;null&#8221;</em>). The used attributes:</p>
<ul>
<li><strong>fileName</strong> (mandatory) – regular expression that says which files to choose</li>
<li><strong>recursive</strong> – should subdirectories be checked&nbsp; (default: no)</li>
<li><strong>rootEntity</strong> – says about if the data from the entity should be treated as documents source. Because we don&#8217;t want to index files list, which this entity provides, we need to set this attribute to <em>false</em>. After setting this attribute to <em>false</em> the next entity will be treated as main entity and its data will be indexed.</li>
<li><strong>baseDir</strong> (mandatory) &#8211; the directory where the files should be located</li>
<li><strong>dataSource</strong> – in this case we set this parameter to &#8220;null&#8221; because the entity doesn&#8217;t use data source&nbsp; (we can ommit this parameter in Solr &gt; 1.3)</li>
<li><strong>excludes</strong> – regular expression which says which files to exclude from indexing</li>
<li><strong>newerThan</strong> – says, that only files newer that the parameter value will be taken into consideration. This parameter can take a value in the format: YYYY-MM-dd HH:mm:ss or a single quoted string, like: &#8216;NOW – 7DAYS&#8217; or an variable that contains the data, for example: ${variable}</li>
<li><strong>olderThan</strong> – the same as above, but says about older files</li>
<li><strong>biggerThan</strong> – only files bigger than the value of the parameter will be taken into consideration</li>
<li><strong>smallerThan</strong> –only files smaller than the value of this parameter will be taken into consideration</li>
</ul>
<p>If you already have a list of files we can go further, the inner entity &#8211; its task is to download specific data contained in the files. The data is taken from the file specified by an external entity using a data source. The processor type: <strong>XpathEntityProcessor</strong> used for XML files have the following attributes:</p>
<ul>
<li><strong>url</strong> – the input data</li>
<li><strong>useSolrAddSchema</strong> – information, that the input data is in Solr XML format</li>
<li><strong>stream</strong> – should we use stream for document processing. In case of large XML files, it&#8217;s good to use <em>stream=&#8221;true&#8221;</em> which will&nbsp; use far less memory and won&#8217;t try to load the whole XML file into the memory.</li>
</ul>
<p>Additional parameters are not useful in our case and we describe them on another occasion:)</p>
<h2>But why all this?</h2>
<p>The example allows to read all XML files from the selected directory. We used exactly the same file format as the &#8220;classical&#8221; method of sending documents to Solr using HTTP POST. So why use this method?</p>
<h2>Push and Pull</h2>
<p>The first argument can be the control of the connections between the Solr server and the system that is responsible for generating the files that are going to be indexed. When we do not have full control over the data source, it&#8217;s better to retrieve the data from the source than to provide additional services that could potentially become a target of attack.</p>
<h2>Prototyping and change testing</h2>
<p>How does it work in practice? In my case I decided to add the possibility of a more advanced search and an ability to do faceting on the category tree. The document contained the &#8220;category&#8221; field, where it stored the path such as &#8220;Cars / Four sits / Audi&#8221;. To create a new queries, we should have an additional fields in the index that hold the category name, the level of the category and also how many levels we have.</p>
<p>To add the required fields we used the ability to define scripts. Quoted previously imported file now looks like this:
</p>
<pre wp-pre-tag-2=""></pre>
<h2>Note at the end</h2>
<p>When using DIH we need to be aware that it works a little differently. In particular, the attempt to load multiple values ​​to a field that is not multivalued (in the schema.xml file) in DIH succeeds &#8211; the excess value will be ignored. In the case of &#8220;classical&#8221; indexing method Solr will return an error.</p>
<p>{document.fileAbsolutePath}&#8221;<br />
        useSolrAddSchema=&#8221;true&#8221;<br />
        stream=&#8221;true&#8221;&gt;<br />
      &lt;/entity&gt;<br />
    &lt;/entity&gt;<br />
  &lt;/document&gt;<br />
&lt;/dataConfig&gt;</p>
<h2>Explanation of the example</h2>
<p>In comparison with the examples from the earlier articles a new type appeared &#8211; the FileDataSource. Example of a complete call:
</p>
<pre wp-pre-tag-1=""></pre>
<p>The additional, not mandatory attributes are obvious:</p>
<ul>
<li><strong>basePath</strong> – the directory which will be used to calculate the relative path of the &#8220;entity&#8221; tag</li>
<li><strong>encoding</strong> – the file encoding (default: the OS default encoding)</li>
</ul>
<p>After the source definition, we have document definition with two nested entities.</p>
<p>The purpose of the main entity is to generate the files list. To do that, we use the <strong>FileListEntityProcessor</strong>.This entity is self-supporting and doesn&#8217;t need any data source (thus <em>dataSource=&#8221;null&#8221;</em>). The used attributes:</p>
<ul>
<li><strong>fileName</strong> (mandatory) – regular expression that says which files to choose</li>
<li><strong>recursive</strong> – should subdirectories be checked&nbsp; (default: no)</li>
<li><strong>rootEntity</strong> – says about if the data from the entity should be treated as documents source. Because we don&#8217;t want to index files list, which this entity provides, we need to set this attribute to <em>false</em>. After setting this attribute to <em>false</em> the next entity will be treated as main entity and its data will be indexed.</li>
<li><strong>baseDir</strong> (mandatory) &#8211; the directory where the files should be located</li>
<li><strong>dataSource</strong> – in this case we set this parameter to &#8220;null&#8221; because the entity doesn&#8217;t use data source&nbsp; (we can ommit this parameter in Solr &gt; 1.3)</li>
<li><strong>excludes</strong> – regular expression which says which files to exclude from indexing</li>
<li><strong>newerThan</strong> – says, that only files newer that the parameter value will be taken into consideration. This parameter can take a value in the format: YYYY-MM-dd HH:mm:ss or a single quoted string, like: &#8216;NOW – 7DAYS&#8217; or an variable that contains the data, for example: ${variable}</li>
<li><strong>olderThan</strong> – the same as above, but says about older files</li>
<li><strong>biggerThan</strong> – only files bigger than the value of the parameter will be taken into consideration</li>
<li><strong>smallerThan</strong> –only files smaller than the value of this parameter will be taken into consideration</li>
</ul>
<p>If you already have a list of files we can go further, the inner entity &#8211; its task is to download specific data contained in the files. The data is taken from the file specified by an external entity using a data source. The processor type: <strong>XpathEntityProcessor</strong> used for XML files have the following attributes:</p>
<ul>
<li><strong>url</strong> – the input data</li>
<li><strong>useSolrAddSchema</strong> – information, that the input data is in Solr XML format</li>
<li><strong>stream</strong> – should we use stream for document processing. In case of large XML files, it&#8217;s good to use <em>stream=&#8221;true&#8221;</em> which will&nbsp; use far less memory and won&#8217;t try to load the whole XML file into the memory.</li>
</ul>
<p>Additional parameters are not useful in our case and we describe them on another occasion:)</p>
<h2>But why all this?</h2>
<p>The example allows to read all XML files from the selected directory. We used exactly the same file format as the &#8220;classical&#8221; method of sending documents to Solr using HTTP POST. So why use this method?</p>
<h2>Push and Pull</h2>
<p>The first argument can be the control of the connections between the Solr server and the system that is responsible for generating the files that are going to be indexed. When we do not have full control over the data source, it&#8217;s better to retrieve the data from the source than to provide additional services that could potentially become a target of attack.</p>
<h2>Prototyping and change testing</h2>
<p>How does it work in practice? In my case I decided to add the possibility of a more advanced search and an ability to do faceting on the category tree. The document contained the &#8220;category&#8221; field, where it stored the path such as &#8220;Cars / Four sits / Audi&#8221;. To create a new queries, we should have an additional fields in the index that hold the category name, the level of the category and also how many levels we have.</p>
<p>To add the required fields we used the ability to define scripts. Quoted previously imported file now looks like this:
</p>
<pre wp-pre-tag-2=""></pre>
<h2>Note at the end</h2>
<p>When using DIH we need to be aware that it works a little differently. In particular, the attempt to load multiple values ​​to a field that is not multivalued (in the schema.xml file) in DIH succeeds &#8211; the excess value will be ignored. In the case of &#8220;classical&#8221; indexing method Solr will return an error.</p>
]]></content:encoded>
					
					<wfw:commentRss>https://solr.pl/en/2011/08/16/data-import-handler-import-from-solr-xml-files/feed/</wfw:commentRss>
			<slash:comments>0</slash:comments>
		
		
			</item>
		<item>
		<title>Data Import Handler – sharding</title>
		<link>https://solr.pl/en/2010/12/27/data-import-handler-sharding/</link>
					<comments>https://solr.pl/en/2010/12/27/data-import-handler-sharding/#respond</comments>
		
		<dc:creator><![CDATA[Marek Rogoziński]]></dc:creator>
		<pubDate>Mon, 27 Dec 2010 08:04:17 +0000</pubDate>
				<category><![CDATA[Solr]]></category>
		<category><![CDATA[data import handler]]></category>
		<category><![CDATA[database]]></category>
		<category><![CDATA[dih]]></category>
		<category><![CDATA[import]]></category>
		<category><![CDATA[sharding]]></category>
		<guid isPermaLink="false">http://sematext.solr.pl/?p=184</guid>

					<description><![CDATA[Our reader (greetings!) reported us a problem with the cooperation of DIH and sharding mechanism. The Solr project wiki, in my opinion, discuss the solution to this issue, but makes it a little around and on the occasion. What is]]></description>
										<content:encoded><![CDATA[<p>Our reader (greetings!) reported us a problem with the cooperation of DIH and sharding mechanism. The Solr project wiki, in my opinion, discuss the solution to this issue, but makes it a little around and on the occasion.</p>
<p><span id="more-184"></span></p>
<h2>What is sharding?</h2>
<p>Sharding means the division of data into several parts and the storage and processing of the data independently. The additional logic within the application allows you to select the appropriate part of data and/or pooling results from various sources. In the case of DIH and sarding we have to deal with the following case:</p>
<ul>
<li>sharding on the side of the data source &#8211; this means multiple locations/tables with different parts of the data set</li>
<li>sharding on the SOLR side &#8211; that is, dividing the data from a source on many independent instances of SOLR</li>
<li>both of these simultaneously</li>
</ul>
<p>In our case we have one set of data and we want to create a lot of sets (called shards) on the SOLR side.</p>
<h2>When to use sharding?</h2>
<p>A very important question: <strong>why we use sharding mechanism ?</strong> In my opinion sharding happens to be abused too often and thus generate lots of additional complications and limitations. The main reason o use sharding is the large volume of data that make SOLR index <strong>does not fall within one machine</strong>. If it does not &#8211; it often means that sharding is redundant. Another reason is performance. But sharding can help here only if other optimization fails and the queries are so complicated that the same addintional cost of sharding (forward queries to the individual Shards and combining their answers) is less than the profit performance that can be achieved.</p>
<h2>Test data</h2>
<p>Let&#8217;s assume that we need sharding. In the example below, I used data from the MusicBrainz creating a simple postgresql table:
</p>
<pre>Table "public.albums"</pre>
<pre> Column |  Type   |                      Modifiers
--------+---------+-----------------------------------------------------</pre>
<pre> id     | integer | not null default nextval('albums_id_seq'::regclass)</pre>
<pre> name   | text    | not null</pre>
<pre> author | text    | not null</pre>
<pre>Indexes:</pre>
<pre>"albums_pk" PRIMARY KEY, btree (id)</pre>
<p>The table contains 825,661 records. I stress here that both the structure and amount of data is so small that the practical usefulness of using sharding here is negligible.</p>
<h2>Test instalation</h2>
<p>For the tests we use three instances of SOLR. All instances are identical, the difference is related only to the number of ports (8983, 7872, 6761) &#8211; Tests will be performed on one physical machine.</p>
<p>Definition at schema.xml:
</p>
<pre class="brush:xml">&lt;fields&gt;
 &lt;field name="id" type="string" indexed="true" stored="true" required="true" /&gt;
 &lt;field name="album" type="text" indexed="true" stored="true" multiValued="true"/&gt;
 &lt;field name="author" type="text" indexed="true" stored="true" multiValued="true"/&gt;
&lt;/fields&gt;
&lt;uniqueKey&gt;id&lt;/uniqueKey&gt;
&lt;defaultSearchField&gt;album&lt;/defaultSearchField&gt;</pre>
<p>Definition of DIH in solrconfig.xml:
</p>
<pre class="brush:xml">&lt;requestHandler name="/dataimport" class="org.apache.solr.handler.dataimport.DataImportHandler"&gt;
 &lt;lst name="defaults"&gt;
  &lt;str name="config"&gt;db-data-config.xml&lt;/str&gt;
 &lt;/lst&gt;
&lt;/requestHandler&gt;</pre>
<p>And the file DIH db-data-config.xml:
</p>
<pre class="brush:xml">&lt;dataConfig&gt;
 &lt;dataSource driver="org.postgresql.Driver" url="jdbc:postgresql://localhost:5432/shardtest" user="solr" password="secret" /&gt;
 &lt;document&gt;
  &lt;entity name="album" query="SELECT * from albums"&gt;
   &lt;field column="id" name="id" /&gt;
   &lt;field column="name" name="album" /&gt;
   &lt;field column="author" name="author" /&gt;
  &lt;/entity&gt;
 &lt;/document&gt;
&lt;/dataConfig&gt;</pre>
<p>At this point, each instance is unable to complete the data import.</p>
<h2>So let&#8217;s setup sharding</h2>
<p>Our goal is to modify the configuration such that each instance of DIH index only &#8220;their&#8221; part of the data. The easiest way to do this is by modifying the query retrieving data to the one like this:</p>
<p>SELECT * from albums where id % NUMBER_OF_INSTANCES = INSTANCE_NUMBER</p>
<p>where:</p>
<ul>
<li>NUMBER_OF_INSTANCES &#8211; the number of Solr servers that store the number of unique parts of the data set</li>
<li>INSTANCE_NUMBER &#8211; instance number (starting from zero)</li>
</ul>
<p>such query does not guarantee exactly and perfectly equal distribution but satisfies two necessary conditions:</p>
<ul>
<li>the record will always go to a specific and<strong> always the same</strong> instance</li>
<li>single record will always go to <strong>only one</strong> instance</li>
</ul>
<p>so the db-data-config.xml on each machine is different now and looks like this:</p>
<ul>
<li>SELECT * from albums where id % 3 = 0</li>
<li>SELECT * from albums where id % 3 = 1</li>
<li>SELECT * from albums where id % 3 = 2</li>
</ul>
<h2>How it works</h2>
<p>After starting up each of the Solr instances we run the following query on each of them:</p>
<p><em>/solr/dataimport?command=full-import</em></p>
<p>When DIH command ends we send the following command:</p>
<p><em> /solr/dataimport?command=status</em></p>
<p>We should get the following responses:</p>
<ul>
<li>Added/Updated: 	275220 documents.</li>
<li>Added/Updated: 	275221 documents.</li>
<li>Added/Updated: 	275220 documents.</li>
</ul>
<p>Performing a simple insert operation, we see that in all instances we have a total of 825,661 documents, as much as there should be <img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f642.png" alt="🙂" class="wp-smiley" style="height: 1em; max-height: 1em;" /><br />
Make another request &#8211; ask for all document. Using sharding we can send the following query to any instance:</p>
<p><em>/solr/select/?q=*:*&amp;shards=localhost:6761/solr,localhost:7872/solr,localhost:8983/solr</em></p>
<p>Result:  825661.</p>
<p>It works! <img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f642.png" alt="🙂" class="wp-smiley" style="height: 1em; max-height: 1em;" /></p>
]]></content:encoded>
					
					<wfw:commentRss>https://solr.pl/en/2010/12/27/data-import-handler-sharding/feed/</wfw:commentRss>
			<slash:comments>0</slash:comments>
		
		
			</item>
		<item>
		<title>Data Import Handler – How to import data from SQL databases (part 3)</title>
		<link>https://solr.pl/en/2010/11/22/data-import-handler-how-to-import-data-from-sql-databases-part-3/</link>
					<comments>https://solr.pl/en/2010/11/22/data-import-handler-how-to-import-data-from-sql-databases-part-3/#respond</comments>
		
		<dc:creator><![CDATA[Marek Rogoziński]]></dc:creator>
		<pubDate>Mon, 22 Nov 2010 22:38:15 +0000</pubDate>
				<category><![CDATA[Solr]]></category>
		<category><![CDATA[data import handler]]></category>
		<category><![CDATA[dih]]></category>
		<category><![CDATA[import]]></category>
		<category><![CDATA[integration]]></category>
		<guid isPermaLink="false">http://sematext.solr.pl/?p=154</guid>

					<description><![CDATA[In previous episodes&#160; (part 1 i part 2) we were able to import data from a database in a both wyas full and incremental. Today is the time for a short summary. Setting dataSource Recall the line with our setup:]]></description>
										<content:encoded><![CDATA[<p>In previous episodes&nbsp; (<a title="Data Import Handler - How to import data from SQL databases (part 1)" href="http://solr.pl/en/2010/10/11/data-import-handler-%E2%80%93-how-to-import-data-from-sql-databases-part-1/">part 1</a> i <a title="Data Import Handler - How to import data from SQL databases (part 2)" href="http://solr.pl/en/2010/11/01/data-import-handler-%E2%80%93-how-to-import-data-from-sql-databases-part-2/">part 2</a>) we were able to import data from a database in a both wyas full and incremental. Today is the time for a short summary.</p>
<p><span id="more-154"></span></p>
<h2>Setting dataSource</h2>
<p>Recall the line with our setup:
</p>
<pre class="brush:xml">&lt;dataSource
    driver="org.postgresql.Driver"
    url="jdbc:postgresql://localhost:5432/wikipedia"
    user="wikipedia"
    password="secret" /&gt;</pre>
<p>These are not all attributes that can appear. For readability let&#8217;s mention them all:</p>
<ul>
<li><strong>name</strong> – the name of the source &#8211; you can define many different sources and refer to them by attribute &#8220;<em>DataSource</em>&#8221; tag &#8220;<em>entity</em>&#8220;</li>
<li><strong>driver</strong> – JDBC driver class name</li>
<li><strong>url</strong> – JDBC database url</li>
<li><strong>user</strong> – database user name (if not defined, or empty, the connection to the database occurs without a pair user/password)</li>
<li><strong>password</strong> – user password</li>
<li><strong>jndiName</strong> – instead of giving elements: driver/url/user/password, you can specify the JNDI name under which the data source implementation (<em>javax.sql.DataSource</em>) is made available by the container (eg Jetty/Tomcat)</li>
</ul>
<p>Advanced arguments:</p>
<ul>
<li><strong>batchSize</strong> (default: 500) &#8211; sets the maximum number (or rather a suggestion for the driver) records retrieved from the database in one query to the database. Changing this parameter can help in situations where queries return to much results. It may not help, since implementation of this mechanism depends on the JDBC driver.</li>
<li><strong>convertType</strong> (default: false) &#8211; Applies an additional conversion from the field type returned by the database to the field type defined in the schema.xml. The default value seems to be safer, because it does not cause extra, magical conversion. However, in special cases (eg BLOB fields), that conversion is one of the ways of solving the problem.</li>
<li><strong>maxRows</strong> (default: 0 &#8211; no limit) &#8211; sets the maximum number of results returned by a query to the database.</li>
<li><strong>readOnly</strong> – set the connection to the database in read mode. In principle, this could mean that the driver will be able to perform additional optimizations. At the same time it means the default (!)<em> transactionIsolation</em> setting the <em>TRANSACTION_READ_UNCOMMITTED</em>, <em>holdability</em> the <em>CLOSE_CURSORS_AT_COMMIT</em>, <em>autoCommit</em> to <em>true</em>.</li>
<li><strong>autoCommit</strong> – set autocommit transaction after each query.</li>
<li><strong>transactionIsolation </strong>(<em>TRANSACTION_READ_UNCOMMITTED</em>, <em>TRANSACTION_READ_COMMITTED</em>, <em>TRANSACTION_REPEATABLE_READ</em>, <em>TRANSACTION_SERIALIZABLE</em>, <em>TRANSACTION_NONE</em>) &#8211; sets the transaction isolation (ie, the visibility of the data changed within a transaction)</li>
<li><strong>holdability</strong> (<em>CLOSE_CURSORS_AT_COMMIT</em>, <em>HOLD_CURSORS_OVER_COMMIT</em>) &#8211; defines how the results will closed (ResultSet) when the transaction is closed</li>
<li><strong>&#8230;</strong> &#8211; The important thing is that there may be any other attributes. All of them will be forwarded by DIH to the JDBC driver, which allows you to define special behavior defined by a specific JDBC driver.</li>
<li><strong>type</strong> – type of source. The default value (JdbcDataSource) is sufficient, so the tag can forgotten (I remind you of him on the occasion of the definition of non-SQLowych source)</li>
</ul>
<h2>The „entity” element</h2>
<p>Let us now turn to the description of the &#8220;entity&#8221; item.</p>
<p>As a reminder:
</p>
<pre class="brush:xml">&lt;entity name="page" query="SELECT page_id as id, page_title as name from page"&gt;
    &lt;entity name="revision" query="select rev_id from revision where rev_page=${page.id}"&gt;
        &lt;entity name="pagecontent" query="select old_text as text from pagecontent where old_id=${revision.rev_id}"&gt;
        &lt;/entity&gt;
    &lt;/entity&gt;
&lt;/entity&gt;</pre>
<p>And all the attributes:</p>
<p>Primary:</p>
<ul>
<li><strong>name</strong> – the name of the entity</li>
<li><strong>query</strong> – SQL query used to retrieve data associated with that entity.</li>
<li><strong>deltaQuery &#8211; </strong>query responsible for returning the IDs of those records that have changed since the last crawl (full or incremental) &#8211; the last crawl time is provided by DIH in the variable: ${dataimporter.last_index_time}. This query is used by Solr to find those records that have changed.</li>
<li><strong>parentDeltaQuery &#8211; </strong>query requesting data for the parent entity record. With these queries Solr is able to retrieve all the data that make up the document, regardless of the entity from which they originate. This is necessary because the indexing engine is not able to modify the indexed data &#8211; so we need to index the entire document, regardless of the fact that some data has not changed.</li>
<li><strong>deletedPkQuery &#8211; </strong>provides identifiers of deleted items.</li>
<li><strong>deltaImportQuery &#8211; </strong>query requesting data for a given record identified by ID that is avaiable as a DIH variable: <em>${dataimport.delta.id}</em>.</li>
<li><strong>dataSource</strong> – the name of the source, the definitions used in several sources (see dataSource.name)</li>
</ul>
<p>and advanced:</p>
<ul>
<li><strong>processor &#8211; </strong><em>SQLEntityProcessor</em> by default. Element whose function is to provide the data source further elements to a crawl. In the case of databases, usually the default implementation is sufficient</li>
<li><strong>transformer </strong>&#8211; the data retrieved from the source can be further modified before transmission to a crawl. In particular, the transformer may return additional records, which makes it a very powerful tool</li>
<li><strong>rootEntity</strong> &#8211; default true for entity element below the document element. This marks the element, which is treated as a root, that is, it will be used to create new items in the index</li>
<li><strong>threads </strong>&#8211;<em> </em>the number of threads used in the service component entity</li>
<li><strong>onError</strong> (<em>abort</em>, <em>skip</em>, <em>continue</em>) – a way to respond to issues: to stop working (abort, the default behavior), ignoring the document (skip), ignore the error (continue)</li>
<li><strong>preImportDeleteQuery</strong> &#8211; used instead of &#8220;*:*&#8221; to delete data from the index. (Note: The query to the index, does not query database) &#8211; makes sense only in the main entity element</li>
<li><strong>postImportDeleteQuery</strong> &#8211; used after a full import. (Like <em>preImportDeleteQuery</em> query to the index) &#8211; makes sense only in the main entity element</li>
<li><strong>pk </strong>&#8211; primary key (database, not to be confused with the unique key of the document) &#8211; is relevant only in incremental indexing, if we let DIH <em>deltaImportQuery</em> guess based on the <em>query</em></li>
</ul>
<p>In the text above the word &#8220;guess&#8221; appeared. DIH is trying to streamline the work, by adopting reasonable defaults. For example, as mentioned above, during the incremental import is able to try to determine deltaImportQuery. Actually, it was the only behavior in earlier versions, it was realized, that the generated queries does not always work. Hence, I suggest caution and the limited principle of trust <img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f642.png" alt="🙂" class="wp-smiley" style="height: 1em; max-height: 1em;" /></p>
<p>Another thing is the ability to override the definition of the field in a situation where the column names returned by the query correspond to the names of fields in the schema.xml.&nbsp; (Hand up: who noted that the above example is not a copy of the second part but is using that mechanism?)</p>
<p>Yet another example of that DIH is very flexible is to draw attention to the fact that having a structure:
</p>
<pre>${dataimporter.last_index_time}</pre>
<p>we can write the full import of this definition that when the import has already been carried out, it will be preserved as an incremental import! I think this functionality, &#8220;came&#8221; a little by accident <img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f642.png" alt="🙂" class="wp-smiley" style="height: 1em; max-height: 1em;" /></p>
]]></content:encoded>
					
					<wfw:commentRss>https://solr.pl/en/2010/11/22/data-import-handler-how-to-import-data-from-sql-databases-part-3/feed/</wfw:commentRss>
			<slash:comments>0</slash:comments>
		
		
			</item>
		<item>
		<title>Solr: data indexing for fun and profit</title>
		<link>https://solr.pl/en/2010/09/06/solr-data-indexing-for-fun-and-profit/</link>
					<comments>https://solr.pl/en/2010/09/06/solr-data-indexing-for-fun-and-profit/#respond</comments>
		
		<dc:creator><![CDATA[Marek Rogoziński]]></dc:creator>
		<pubDate>Mon, 06 Sep 2010 12:10:35 +0000</pubDate>
				<category><![CDATA[Solr]]></category>
		<category><![CDATA[acf]]></category>
		<category><![CDATA[cell]]></category>
		<category><![CDATA[import]]></category>
		<category><![CDATA[indexing]]></category>
		<category><![CDATA[lcf]]></category>
		<category><![CDATA[tika]]></category>
		<guid isPermaLink="false">http://sematext.solr.pl/?p=73</guid>

					<description><![CDATA[Solr is not very friendly to novice users. Preparing good schema file requires some experience. Assuming that we have prepared the configuration files, what remains for us is to share our data with the search server and take care of]]></description>
										<content:encoded><![CDATA[<p>Solr is not very friendly to novice users. Preparing good schema file requires some experience. Assuming that we have prepared the configuration files, what remains for us is to share our data with the search server and take care of update ability.</p>
<p><span id="more-73"></span></p>
<p>There are a few ways to import data:</p>
<ul>
<li> Update Handler</li>
<li>Cvs Request Handler</li>
<li>Data Import Handler</li>
<li>Extracting Request Handler (Solr Cell)</li>
<li>Client libraries (for example Solrj)</li>
<li>Apache Connector Framework (formerly Lucene Connector Framework)</li>
<li>Apache Nutch</li>
</ul>
<p>In addition to the mentioned above you casn stream your data to search server.  As you can see, there is some confusion here and its to provide the best method to use in a particular case at first glance.</p>
<h2>Update Handler</h2>
<p>Perhaps the most popular method because of simplicity. It requires the preparation of the corresponding XML file and then You must send it via HTTP to a Solr server. It enables document and individual fields boosting.</p>
<h2>CSV Request Handler</h2>
<p>When we have data in CSV format (Coma Separated Values) or in TSV format (Tab Separated Values) this option may be most convenient. Unfortunately, in contrast to the Update Handler is not possible to boost documents or fields.</p>
<h2>Data Import Handler</h2>
<p>This method is less common, requires additional and sometimes quite complicated configuration, but allows direct linking to the data source. Using DIH we do not need any additional scripts for data exporting from a source to the format required by Solr. What we get out of the box is: integration with databases (based on JDBC),  integration with sources available in XML (for example RSS), e-mail integration (via IMAP protocol) and integration with documents which can be parsed by Apache Tika (like OpenOffice documents, Microsoft Word, RTF, HTML, and many, many more).   In addition it is possible to develop your own sources and transformations.</p>
<h2>Extracting Request Handler (Solre Cell)</h2>
<p>Specialized handler for indexing the content of documents stored in files of different formats. List of supported formats is quite extensive and the indexing is performed by Apache Tika.  The drawback of this method is the need of building additional solutions that provide Solr  the information about the document and its identifier and that there is  no support for providing additional meta data, external to the  document.</p>
<h2>Client Libraries</h2>
<p>Solr provides client libraries for many programming languages. Their capabilities differ, but if the data are generated onboard  by the application and the time after in which the data must be  available for searching is very low, this way of indexing is often the  only available option.</p>
<h2>Apache Connector Framework</h2>
<p>ACF is a relatively new project, which revealed a wider audience in early 2010. The project was initially an internal project run by the company MetaCarta, and was donated to the open source community and is currently being developed within Apache incubator. The idea is to build a system that allows making connection to the data source with a help of a series of plug-ins. At  the moment there is no published version, but the system itself is  already worth of interest in the case of the need to integrate with such  systems as: FileNet P8 (IBM), Documentum (EMC), LiveLink (OpenText), Patriarch (Memex), Meridio (Autonomy) Windows shares (Microsoft) and SharePoint (Microsoft).</p>
<h2>Apache Nutch</h2>
<p>Nutch is in fact, a separate project run by the Apache (previously under the Apache Lucene, now a top level project). For the person using Solr Nutch is interesting as it allows to crawl through Web pages and index them by Solr.</p>
<h2>Word about streaming</h2>
<p>Streaming means the ability to notice Solr, where to download the data to be indexed. This avoids unnecessary data transmission over the network, if the data is on the same server as indexer, or double data transmission (from the source to the importer and from the importer to Solr).</p>
<h2>And a word about security</h2>
<p>Solr, bye design, is intended to be used in a architecture assuming safe environment. It is very important to note, who and how is able to query solr. While  the returned data can be easily reduced, by forcing the use of filters  in the definition of the handler, then in the case of indexing is not so  easy. In particular, the most dangerous seems to be Solr Cell &#8211; it will not only allow to read any file to which Solr have access(eg. files with passwords), but will also will provide a convenient method of searching in those files <img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f609.png" alt="😉" class="wp-smiley" style="height: 1em; max-height: 1em;" /></p>
<h2>Other options</h2>
<p>I tried to mention all the methods that does not require any additional work to make indexing work. The  problem may be the definition of this additional work, because  sometimes it is easier to write additional plug-in than break through  numerous configuration options and create a giant XML file. Therefore,  the choice of methods was guided by my own sense, which resulted in  skipping of some methods (like fetching data from WWW pages with the use  of Apache Droids or Heritrix, or solutionsa based on Open Pipeline or Open Pipe).</p>
<p>Certainly in this short article I managed to miss some interesting methods. If so, please comment, I`ll be glad update this entry <img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f642.png" alt="🙂" class="wp-smiley" style="height: 1em; max-height: 1em;" /></p>
]]></content:encoded>
					
					<wfw:commentRss>https://solr.pl/en/2010/09/06/solr-data-indexing-for-fun-and-profit/feed/</wfw:commentRss>
			<slash:comments>0</slash:comments>
		
		
			</item>
	</channel>
</rss>
