<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>integration &#8211; Solr.pl</title>
	<atom:link href="https://solr.pl/en/tag/integration/feed/" rel="self" type="application/rss+xml" />
	<link>https://solr.pl/en/</link>
	<description>All things to be found - Blog related to Apache Solr &#38; Lucene projects - https://solr.apache.org</description>
	<lastBuildDate>Wed, 11 Nov 2020 08:05:56 +0000</lastBuildDate>
	<language>en-US</language>
	<sy:updatePeriod>
	hourly	</sy:updatePeriod>
	<sy:updateFrequency>
	1	</sy:updateFrequency>
	<generator>https://wordpress.org/?v=6.9</generator>
	<item>
		<title>Data Import Handler – removing data from index</title>
		<link>https://solr.pl/en/2011/01/03/data-import-handler-removing-data-from-index/</link>
					<comments>https://solr.pl/en/2011/01/03/data-import-handler-removing-data-from-index/#respond</comments>
		
		<dc:creator><![CDATA[Rafał Kuć]]></dc:creator>
		<pubDate>Mon, 03 Jan 2011 08:05:13 +0000</pubDate>
				<category><![CDATA[Solr]]></category>
		<category><![CDATA[data import handler]]></category>
		<category><![CDATA[databse]]></category>
		<category><![CDATA[dih]]></category>
		<category><![CDATA[integration]]></category>
		<guid isPermaLink="false">http://sematext.solr.pl/?p=186</guid>

					<description><![CDATA[Deleting data from an index using DIH incremental indexing, on Solr wiki, is residually treated as something that works similarly to update the records. Similarly, in a previous article, I used this shortcut, the more that I have given an]]></description>
										<content:encoded><![CDATA[<p>Deleting  data from an index using DIH incremental indexing, on Solr wiki, is  residually treated as something that works similarly to update the  records. Similarly,  in a previous article, I used this shortcut, the more that I have given  an example of indexing wikipedia data that does not need to delete  data.</p>
<p>Having at hand a sample data of the albums and performers, I decided to show my way of dealing with such cases. For simplicity and clarity, I assume that after the first import, the data can only decrease.</p>
<p><span id="more-186"></span></p>
<h2>Test data</h2>
<p>My test data are located in the PostgreSQL database table defined as follows:
</p>
<pre>Table "public.albums"
Column |  Type   |                      Modifiers
--------+---------+-----------------------------------------------------
id     | integer | not null default nextval('albums_id_seq'::regclass)
name   | text    | not null
author | text    | not null
Indexes:
"albums_pk" PRIMARY KEY, btree (id)</pre>
<p>The table has 825,661 records.</p>
<h2>Test installation</h2>
<p>For testing purposes I used the Solr instance having the following characteristics:</p>
<p>Definition at schema.xml:
</p>
<pre class="brush:xml">&lt;fields&gt;
 &lt;field name="id" type="string" indexed="true" stored="true" required="true" /&gt;
 &lt;field name="album" type="text" indexed="true" stored="true" multiValued="true"/&gt;
 &lt;field name="author" type="text" indexed="true" stored="true" multiValued="true"/&gt;
&lt;/fields&gt;
&lt;uniqueKey&gt;id&lt;/uniqueKey&gt;
&lt;defaultSearchField&gt;album&lt;/defaultSearchField&gt;</pre>
<p>Definition of DIH in solrconfig.xml:
</p>
<pre class="brush:xml">&lt;requestHandler name="/dataimport" class="org.apache.solr.handler.dataimport.DataImportHandler"&gt;
 &lt;lst name="defaults"&gt;
  &lt;str name="config"&gt;db-data-config.xml&lt;/str&gt;
 &lt;/lst&gt;
&lt;/requestHandler&gt;</pre>
<p>And the file DIH db-data-config.xml:
</p>
<pre class="brush:xml">&lt;dataConfig&gt;
 &lt;dataSource driver="org.postgresql.Driver" url="jdbc:postgresql://localhost:5432/shardtest" user="solr" password="secret" /&gt;
 &lt;document&gt;
  &lt;entity name="album" query="SELECT * from albums"&gt;
   &lt;field column="id" name="id" /&gt;
   &lt;field column="name" name="album" /&gt;
   &lt;field column="author" name="author" /&gt;
  &lt;/entity&gt;
 &lt;/document&gt;
&lt;/dataConfig&gt;</pre>
<p>Before our test, I imported all the data from the table <em>albums</em>.</p>
<h2>Deleting Data</h2>
<p>Looking at the table shows that when we remove the record, he is deleted without leaving a trace, and the only way to update our index would be to compare the documents identifiers in the index to the identifiers in the database and deleting those that no longer exist in the database. Slow and cumbersome. Another way is adding a column <em>deleted_at</em>: instead of physically deleting the record, only add information to this column. DIH can then retrieve all records from the set date later than the last crawl. The disadvantage of this solution may be necessary to modify the application to take such information into consideration.</p>
<p>I apply a different solution, transparent to applications. Let&#8217;s create a new table:
</p>
<pre class="brush:sql">CREATE TABLE deletes
(
id serial NOT NULL,
deleted_id bigint,
deleted_at timestamp without time zone NOT NULL,
CONSTRAINT deletes_pk PRIMARY KEY (id)
);</pre>
<p>This table will automagically add an identifier of those items that were removed from the table <em>albums </em>and information when they were removed.</p>
<p>Now we add the function:
</p>
<pre class="brush:sql">CREATE OR REPLACE FUNCTION insert_after_delete()
RETURNS trigger AS
$BODY$BEGIN
IF tg_op = 'DELETE' THEN
INSERT INTO deletes(deleted_id, deleted_at)
VALUES (old.id, now());
RETURN old;
END IF;
END$BODY$
LANGUAGE plpgsql VOLATILE;</pre>
<p>and a trigger:
</p>
<pre class="brush:sql">CREATE TRIGGER deleted_trg
BEFORE DELETE
ON albums
FOR EACH ROW
EXECUTE PROCEDURE insert_after_delete();</pre>
<h2>How it works</h2>
<p>Each entry deleted from the <em>albums </em>table should result in addition to the table <em>deletes</em>. Let&#8217;s check it out. Remove a few records:
</p>
<pre class="brush:sql">=&gt; DELETE FROM albums where id &lt; 37;
DELETE 2
=&gt; SELECT * from deletes;
id | deleted_id |         deleted_at
----+------------+----------------------------
26 |         35 | 2010-12-23 13:53:18.034612
27 |         36 | 2010-12-23 13:53:18.034612
(2 rows)</pre>
<p>So the database part works.</p>
<p>We fill up the DIH configuration file so that the <em>entity </em>has been defined as follows:
</p>
<pre class="brush:xml">&lt;entity name="album" query="SELECT * from albums"
  deletedPkQuery="SELECT deleted_id as id FROM deletes WHERE deleted_at &gt; '
<p>This allows the import DIH incremental import to use the <em>deletedPkQuery </em>attribute to get the identifiers of the documents which should be removed.</p>
<p>A clever reader will probably begin to wonder, are you sure we need the column with the date of deletion. We could delete all records that are found in the table <em>deletes </em>and then delete the contents of this table. Theoretically this is true, but in the event of a problem with the Solr indexing server we can easily replace it with another - the degree of synchronization with the database is not very important - just the next incremental imports will sync with the database. If we would delete the contents of the <em>deletes </em>table such possibility does not exist.</p>
<p>We can now do the incremental import by calling the following address:&nbsp; <em>/solr/dataimport?command=delta-import</em><br>
In the logs you should see a line similar to this:<br>
<em>INFO: {delete=[35, 36],optimize=} 0 2</em><br>
Which means that DIH properly removed from the index the documents, which were previously removed from the database.</p>
{dataimporter.last_index_time}'"&gt;</pre>
<p>This allows the import DIH incremental import to use the <em>deletedPkQuery </em>attribute to get the identifiers of the documents which should be removed.</p>
<p>A clever reader will probably begin to wonder, are you sure we need the column with the date of deletion. We could delete all records that are found in the table <em>deletes </em>and then delete the contents of this table. Theoretically this is true, but in the event of a problem with the Solr indexing server we can easily replace it with another &#8211; the degree of synchronization with the database is not very important &#8211; just the next incremental imports will sync with the database. If we would delete the contents of the <em>deletes </em>table such possibility does not exist.</p>
<p>We can now do the incremental import by calling the following address:&nbsp; <em>/solr/dataimport?command=delta-import</em><br />
In the logs you should see a line similar to this:<br />
<em>INFO: {delete=[35, 36],optimize=} 0 2</em><br />
Which means that DIH properly removed from the index the documents, which were previously removed from the database.</p>
]]></content:encoded>
					
					<wfw:commentRss>https://solr.pl/en/2011/01/03/data-import-handler-removing-data-from-index/feed/</wfw:commentRss>
			<slash:comments>0</slash:comments>
		
		
			</item>
		<item>
		<title>Data Import Handler – How to import data from SQL databases (part 3)</title>
		<link>https://solr.pl/en/2010/11/22/data-import-handler-how-to-import-data-from-sql-databases-part-3/</link>
					<comments>https://solr.pl/en/2010/11/22/data-import-handler-how-to-import-data-from-sql-databases-part-3/#respond</comments>
		
		<dc:creator><![CDATA[Marek Rogoziński]]></dc:creator>
		<pubDate>Mon, 22 Nov 2010 22:38:15 +0000</pubDate>
				<category><![CDATA[Solr]]></category>
		<category><![CDATA[data import handler]]></category>
		<category><![CDATA[dih]]></category>
		<category><![CDATA[import]]></category>
		<category><![CDATA[integration]]></category>
		<guid isPermaLink="false">http://sematext.solr.pl/?p=154</guid>

					<description><![CDATA[In previous episodes&#160; (part 1 i part 2) we were able to import data from a database in a both wyas full and incremental. Today is the time for a short summary. Setting dataSource Recall the line with our setup:]]></description>
										<content:encoded><![CDATA[<p>In previous episodes&nbsp; (<a title="Data Import Handler - How to import data from SQL databases (part 1)" href="http://solr.pl/en/2010/10/11/data-import-handler-%E2%80%93-how-to-import-data-from-sql-databases-part-1/">part 1</a> i <a title="Data Import Handler - How to import data from SQL databases (part 2)" href="http://solr.pl/en/2010/11/01/data-import-handler-%E2%80%93-how-to-import-data-from-sql-databases-part-2/">part 2</a>) we were able to import data from a database in a both wyas full and incremental. Today is the time for a short summary.</p>
<p><span id="more-154"></span></p>
<h2>Setting dataSource</h2>
<p>Recall the line with our setup:
</p>
<pre class="brush:xml">&lt;dataSource
    driver="org.postgresql.Driver"
    url="jdbc:postgresql://localhost:5432/wikipedia"
    user="wikipedia"
    password="secret" /&gt;</pre>
<p>These are not all attributes that can appear. For readability let&#8217;s mention them all:</p>
<ul>
<li><strong>name</strong> – the name of the source &#8211; you can define many different sources and refer to them by attribute &#8220;<em>DataSource</em>&#8221; tag &#8220;<em>entity</em>&#8220;</li>
<li><strong>driver</strong> – JDBC driver class name</li>
<li><strong>url</strong> – JDBC database url</li>
<li><strong>user</strong> – database user name (if not defined, or empty, the connection to the database occurs without a pair user/password)</li>
<li><strong>password</strong> – user password</li>
<li><strong>jndiName</strong> – instead of giving elements: driver/url/user/password, you can specify the JNDI name under which the data source implementation (<em>javax.sql.DataSource</em>) is made available by the container (eg Jetty/Tomcat)</li>
</ul>
<p>Advanced arguments:</p>
<ul>
<li><strong>batchSize</strong> (default: 500) &#8211; sets the maximum number (or rather a suggestion for the driver) records retrieved from the database in one query to the database. Changing this parameter can help in situations where queries return to much results. It may not help, since implementation of this mechanism depends on the JDBC driver.</li>
<li><strong>convertType</strong> (default: false) &#8211; Applies an additional conversion from the field type returned by the database to the field type defined in the schema.xml. The default value seems to be safer, because it does not cause extra, magical conversion. However, in special cases (eg BLOB fields), that conversion is one of the ways of solving the problem.</li>
<li><strong>maxRows</strong> (default: 0 &#8211; no limit) &#8211; sets the maximum number of results returned by a query to the database.</li>
<li><strong>readOnly</strong> – set the connection to the database in read mode. In principle, this could mean that the driver will be able to perform additional optimizations. At the same time it means the default (!)<em> transactionIsolation</em> setting the <em>TRANSACTION_READ_UNCOMMITTED</em>, <em>holdability</em> the <em>CLOSE_CURSORS_AT_COMMIT</em>, <em>autoCommit</em> to <em>true</em>.</li>
<li><strong>autoCommit</strong> – set autocommit transaction after each query.</li>
<li><strong>transactionIsolation </strong>(<em>TRANSACTION_READ_UNCOMMITTED</em>, <em>TRANSACTION_READ_COMMITTED</em>, <em>TRANSACTION_REPEATABLE_READ</em>, <em>TRANSACTION_SERIALIZABLE</em>, <em>TRANSACTION_NONE</em>) &#8211; sets the transaction isolation (ie, the visibility of the data changed within a transaction)</li>
<li><strong>holdability</strong> (<em>CLOSE_CURSORS_AT_COMMIT</em>, <em>HOLD_CURSORS_OVER_COMMIT</em>) &#8211; defines how the results will closed (ResultSet) when the transaction is closed</li>
<li><strong>&#8230;</strong> &#8211; The important thing is that there may be any other attributes. All of them will be forwarded by DIH to the JDBC driver, which allows you to define special behavior defined by a specific JDBC driver.</li>
<li><strong>type</strong> – type of source. The default value (JdbcDataSource) is sufficient, so the tag can forgotten (I remind you of him on the occasion of the definition of non-SQLowych source)</li>
</ul>
<h2>The „entity” element</h2>
<p>Let us now turn to the description of the &#8220;entity&#8221; item.</p>
<p>As a reminder:
</p>
<pre class="brush:xml">&lt;entity name="page" query="SELECT page_id as id, page_title as name from page"&gt;
    &lt;entity name="revision" query="select rev_id from revision where rev_page=${page.id}"&gt;
        &lt;entity name="pagecontent" query="select old_text as text from pagecontent where old_id=${revision.rev_id}"&gt;
        &lt;/entity&gt;
    &lt;/entity&gt;
&lt;/entity&gt;</pre>
<p>And all the attributes:</p>
<p>Primary:</p>
<ul>
<li><strong>name</strong> – the name of the entity</li>
<li><strong>query</strong> – SQL query used to retrieve data associated with that entity.</li>
<li><strong>deltaQuery &#8211; </strong>query responsible for returning the IDs of those records that have changed since the last crawl (full or incremental) &#8211; the last crawl time is provided by DIH in the variable: ${dataimporter.last_index_time}. This query is used by Solr to find those records that have changed.</li>
<li><strong>parentDeltaQuery &#8211; </strong>query requesting data for the parent entity record. With these queries Solr is able to retrieve all the data that make up the document, regardless of the entity from which they originate. This is necessary because the indexing engine is not able to modify the indexed data &#8211; so we need to index the entire document, regardless of the fact that some data has not changed.</li>
<li><strong>deletedPkQuery &#8211; </strong>provides identifiers of deleted items.</li>
<li><strong>deltaImportQuery &#8211; </strong>query requesting data for a given record identified by ID that is avaiable as a DIH variable: <em>${dataimport.delta.id}</em>.</li>
<li><strong>dataSource</strong> – the name of the source, the definitions used in several sources (see dataSource.name)</li>
</ul>
<p>and advanced:</p>
<ul>
<li><strong>processor &#8211; </strong><em>SQLEntityProcessor</em> by default. Element whose function is to provide the data source further elements to a crawl. In the case of databases, usually the default implementation is sufficient</li>
<li><strong>transformer </strong>&#8211; the data retrieved from the source can be further modified before transmission to a crawl. In particular, the transformer may return additional records, which makes it a very powerful tool</li>
<li><strong>rootEntity</strong> &#8211; default true for entity element below the document element. This marks the element, which is treated as a root, that is, it will be used to create new items in the index</li>
<li><strong>threads </strong>&#8211;<em> </em>the number of threads used in the service component entity</li>
<li><strong>onError</strong> (<em>abort</em>, <em>skip</em>, <em>continue</em>) – a way to respond to issues: to stop working (abort, the default behavior), ignoring the document (skip), ignore the error (continue)</li>
<li><strong>preImportDeleteQuery</strong> &#8211; used instead of &#8220;*:*&#8221; to delete data from the index. (Note: The query to the index, does not query database) &#8211; makes sense only in the main entity element</li>
<li><strong>postImportDeleteQuery</strong> &#8211; used after a full import. (Like <em>preImportDeleteQuery</em> query to the index) &#8211; makes sense only in the main entity element</li>
<li><strong>pk </strong>&#8211; primary key (database, not to be confused with the unique key of the document) &#8211; is relevant only in incremental indexing, if we let DIH <em>deltaImportQuery</em> guess based on the <em>query</em></li>
</ul>
<p>In the text above the word &#8220;guess&#8221; appeared. DIH is trying to streamline the work, by adopting reasonable defaults. For example, as mentioned above, during the incremental import is able to try to determine deltaImportQuery. Actually, it was the only behavior in earlier versions, it was realized, that the generated queries does not always work. Hence, I suggest caution and the limited principle of trust <img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f642.png" alt="🙂" class="wp-smiley" style="height: 1em; max-height: 1em;" /></p>
<p>Another thing is the ability to override the definition of the field in a situation where the column names returned by the query correspond to the names of fields in the schema.xml.&nbsp; (Hand up: who noted that the above example is not a copy of the second part but is using that mechanism?)</p>
<p>Yet another example of that DIH is very flexible is to draw attention to the fact that having a structure:
</p>
<pre>${dataimporter.last_index_time}</pre>
<p>we can write the full import of this definition that when the import has already been carried out, it will be preserved as an incremental import! I think this functionality, &#8220;came&#8221; a little by accident <img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f642.png" alt="🙂" class="wp-smiley" style="height: 1em; max-height: 1em;" /></p>
]]></content:encoded>
					
					<wfw:commentRss>https://solr.pl/en/2010/11/22/data-import-handler-how-to-import-data-from-sql-databases-part-3/feed/</wfw:commentRss>
			<slash:comments>0</slash:comments>
		
		
			</item>
	</channel>
</rss>
