Automatically Generate Document Identifiers – Solr 4.x

A few days ago I got a question regarding the automatic identifiers of documents in Solr 4.0, because the method from Solr 3 was deprecated. Because of that we decided to write a quick post about how to use Solr to generate documents unique identifier in Solr 4.x.

Data structure

Our simple data structure (fields section of the schema.xml file) looks as follows:

<fields>
 <field name="id" type="string" indexed="true" stored="true" required="true" multiValued="false" />
 <field name="name" type="text_general" indexed="true" stored="true"/>
 <field name="_version_" type="long" indexed="true" stored="true"/>
</fields>

In addition to that we’ve added the information about which field is the one that should contain unique identifiers. This was also done in schema.xml file:

<uniqueKey>id</uniqueKey>

Solr configuration

In addition to changes in the schema.xml file, we need to modify the solrconfig.xml file and introduce a proper UpdateRequestProcessorChain, like the following one:

<updateRequestProcessorChain>
 <processor class="solr.UUIDUpdateProcessorFactory">
  <str name="fieldName">id</str>
 </processor>
 <processor class="solr.LogUpdateProcessorFactory" />
 <processor class="solr.RunUpdateProcessorFactory" />
</updateRequestProcessorChain>

By doing this we inform Solr that we want the id field contents to be automatically generated.

A simple test

Let’s test what we did. In order to do that we will index a simple document by using the following command:

curl -XPOST 'localhost:8983/solr/update?commit=true' --data-binary '<add><doc><field name="name">Test</field></doc></add>' -H 'Content-type:application/xml'

If everything went well the above document was indexed. In order to check what happened we will send a simple query and look at the results. In order to do that we use the following comand:

curl -XGET 'localhost:8983/solr/select?q=*:*&indent=true'

The result returned by Solr for the above command is as follows:

<?xml version="1.0" encoding="UTF-8"?>
<response>
 <lst name="responseHeader">
  <int name="status">0</int>
  <int name="QTime">0</int>
  <lst name="params">
   <str name="indent">true</str>
   <str name="q">*:*</str>
  </lst>
 </lst>
 <result name="response" numFound="1" start="0">
  <doc>
   <str name="name">Test</str>
   <str name="id">1cdee8b4-c42d-4101-8301-4dc350a4d522</str>
   <long name="_version_">1439726523307261952</long>
  </doc>
 </result>
</response>

As we can see the unique identifier was automatically generated. Now if we would send the same indexing command once again:

curl -XPOST 'localhost:8983/solr/update?commit=true' --data-binary '<add><doc><field name="name">Test</field></doc></add>' -H 'Content-type:application/xml'

And run the same query again:

curl -XGET 'localhost:8983/solr/select?q=*:*&indent=true'

We would get two documents in results, just like the following:

<?xml version="1.0" encoding="UTF-8"?>
<response>
 <lst name="responseHeader">
  <int name="status">0</int>
  <int name="QTime">1</int>
  <lst name="params">
   <str name="indent">true</str>
   <str name="q">*:*</str>
  </lst>
 </lst>
 <result name="response" numFound="2" start="0">
  <doc>
   <str name="name">Test</str>
   <str name="id">1cdee8b4-c42d-4101-8301-4dc350a4d522</str>
   <long name="_version_">1439726523307261952</long>
  </doc>
  <doc>
   <str name="name">Test</str>
   <str name="id">9bedcb5f-1b71-4ab7-80a9-9882a6bf319e</str>
   <long name="_version_">1439726693819351040</long>
  </doc>
 </result>
</response>

As you can see, the two above documents have different unique identifiers, so the functionality works.

Leave a Reply

Your email address will not be published. Required fields are marked *