Data Import Handler – removing data from index

Deleting data from an index using DIH incremental indexing, on Solr wiki, is residually treated as something that works similarly to update the records. Similarly, in a previous article, I used this shortcut, the more that I have given an example of indexing wikipedia data that does not need to delete data.

Having at hand a sample data of the albums and performers, I decided to show my way of dealing with such cases. For simplicity and clarity, I assume that after the first import, the data can only decrease.

Test data

My test data are located in the PostgreSQL database table defined as follows:

Table "public.albums"
Column |  Type   |                      Modifiers
id     | integer | not null default nextval('albums_id_seq'::regclass)
name   | text    | not null
author | text    | not null
"albums_pk" PRIMARY KEY, btree (id)

The table has 825,661 records.

Test installation

For testing purposes I used the Solr instance having the following characteristics:

Definition at schema.xml:

 <field name="id" type="string" indexed="true" stored="true" required="true" />
 <field name="album" type="text" indexed="true" stored="true" multiValued="true"/>
 <field name="author" type="text" indexed="true" stored="true" multiValued="true"/>

Definition of DIH in solrconfig.xml:

<requestHandler name="/dataimport" class="org.apache.solr.handler.dataimport.DataImportHandler">
 <lst name="defaults">
  <str name="config">db-data-config.xml</str>

And the file DIH db-data-config.xml:

 <dataSource driver="org.postgresql.Driver" url="jdbc:postgresql://localhost:5432/shardtest" user="solr" password="secret" />
  <entity name="album" query="SELECT * from albums">
   <field column="id" name="id" />
   <field column="name" name="album" />
   <field column="author" name="author" />

Before our test, I imported all the data from the table albums.

Deleting Data

Looking at the table shows that when we remove the record, he is deleted without leaving a trace, and the only way to update our index would be to compare the documents identifiers in the index to the identifiers in the database and deleting those that no longer exist in the database. Slow and cumbersome. Another way is adding a column deleted_at: instead of physically deleting the record, only add information to this column. DIH can then retrieve all records from the set date later than the last crawl. The disadvantage of this solution may be necessary to modify the application to take such information into consideration.

I apply a different solution, transparent to applications. Let’s create a new table:

id serial NOT NULL,
deleted_id bigint,
deleted_at timestamp without time zone NOT NULL,

This table will automagically add an identifier of those items that were removed from the table albums and information when they were removed.

Now we add the function:

CREATE OR REPLACE FUNCTION insert_after_delete()
RETURNS trigger AS
IF tg_op = 'DELETE' THEN
INSERT INTO deletes(deleted_id, deleted_at)
VALUES (, now());

and a trigger:

CREATE TRIGGER deleted_trg
ON albums
EXECUTE PROCEDURE insert_after_delete();

How it works

Each entry deleted from the albums table should result in addition to the table deletes. Let’s check it out. Remove a few records:

=> DELETE FROM albums where id < 37;
=> SELECT * from deletes;
id | deleted_id |         deleted_at
26 |         35 | 2010-12-23 13:53:18.034612
27 |         36 | 2010-12-23 13:53:18.034612
(2 rows)

So the database part works.

We fill up the DIH configuration file so that the entity has been defined as follows:

<entity name="album" query="SELECT * from albums"
  deletedPkQuery="SELECT deleted_id as id FROM deletes WHERE deleted_at > '${dataimporter.last_index_time}'">

This allows the import DIH incremental import to use the deletedPkQuery attribute to get the identifiers of the documents which should be removed.

A clever reader will probably begin to wonder, are you sure we need the column with the date of deletion. We could delete all records that are found in the table deletes and then delete the contents of this table. Theoretically this is true, but in the event of a problem with the Solr indexing server we can easily replace it with another – the degree of synchronization with the database is not very important – just the next incremental imports will sync with the database. If we would delete the contents of the deletes table such possibility does not exist.

We can now do the incremental import by calling the following address:  /solr/dataimport?command=delta-import
In the logs you should see a line similar to this:
INFO: {delete=[35, 36],optimize=} 0 2
Which means that DIH properly removed from the index the documents, which were previously removed from the database.

This post is also available in: Polish

This entry was posted on Monday, January 3rd, 2011 at 09:06 and is filed under Bez kategorii. You can follow any responses to this entry through the RSS 2.0 feed. You can leave a response, or trackback from your own site.

6 Responses to “Data Import Handler – removing data from index”

  1. Ksenia Khailenko Says:


    First of all, thanks for the nice article, it is perfectly understandable and clear, really helpful.

    the opportunity of deleting data from index providing by “deletedPkQuery” makes working with solr much more flexible.
    I just wonder if is it possible to use this attribute for subentities?

    for the top-level entities it works greatly, but for the subentity, it doesn’t delete data, and I see in log the following:

    INFO: Completed DeletedRowKey for Entity: subentity rows obtained : 2
    Mar 18, 2011 3:40:21 PM org.apache.solr.handler.dataimport.SqlEntityProcessor nextModifiedParentRowKey
    INFO: Running parentDeltaQuery for Entity: subentity
    Mar 18, 2011 3:40:21 PM org.apache.solr.handler.dataimport.SqlEntityProcessor nextModifiedParentRowKey
    INFO: Running parentDeltaQuery for Entity: subentity

    Could you be so kind to tell me what am I doing wrong?

    Thanks in Advance, Ksenia

  2. negativ Says:

    At the beginning sorry for the delay.

    Note that for solr is not possible to update the part of document such a delete information from subentity.
    Therefore, this functionality has a rather limited usefulness . Maybe I do not understand your case, could you give more details about the database schema?

  3. Karron Says:

    Thank you for your post. Here I have a question, once you performed the deletion, how do you delete the data from deletes table? because the data in deletes table may increase rapidly.

  4. negativ Says:

    This is a good question. I prefer to periodicaly clean the table based on “deleted_at” and “last_index_time” field from

    Another option is to define event listener in DIH.

  5. Prerna Says:


    Like you said “only way to update our index would be to compare the documents identifiers in the index to the identifiers in the database and deleting those that no longer exist in the database.”.

    But shouldn’t solr be smart enough to compare the IDs and find out if a document does not exist anymore, hence delete the corresponding document. (And may be same for new entries added in DB).

    Is there anything facility like that in Solr?


  6. Floux Says:


    Thanks for this article !! Everything work fine for me.

    But i got a little question.

    My DB ‘album’ receive some data at the same time that other are deleted.

    How to refresh my Solr index (add/delete) with the same DIH ?