A few days ago I got a question regarding the automatic identifiers of documents in Solr 4.0, because the method from Solr 3 was deprecated. Because of that we decided to write a quick post about how to use Solr to generate documents unique identifier in Solr 4.x.
Data structure
Our simple data structure (fields section of the schema.xml file) looks as follows:
<fields> <field name="id" type="string" indexed="true" stored="true" required="true" multiValued="false" /> <field name="name" type="text_general" indexed="true" stored="true"/> <field name="_version_" type="long" indexed="true" stored="true"/> </fields>
In addition to that we’ve added the information about which field is the one that should contain unique identifiers. This was also done in schema.xml file:
<uniqueKey>id</uniqueKey>
Solr configuration
In addition to changes in the schema.xml file, we need to modify the solrconfig.xml file and introduce a proper UpdateRequestProcessorChain, like the following one:
<updateRequestProcessorChain> <processor class="solr.UUIDUpdateProcessorFactory"> <str name="fieldName">id</str> </processor> <processor class="solr.LogUpdateProcessorFactory" /> <processor class="solr.RunUpdateProcessorFactory" /> </updateRequestProcessorChain>
By doing this we inform Solr that we want the id field contents to be automatically generated.
A simple test
Let’s test what we did. In order to do that we will index a simple document by using the following command:
curl -XPOST 'localhost:8983/solr/update?commit=true' --data-binary '<add><doc><field name="name">Test</field></doc></add>' -H 'Content-type:application/xml'
If everything went well the above document was indexed. In order to check what happened we will send a simple query and look at the results. In order to do that we use the following comand:
curl -XGET 'localhost:8983/solr/select?q=*:*&indent=true'
The result returned by Solr for the above command is as follows:
<?xml version="1.0" encoding="UTF-8"?> <response> <lst name="responseHeader"> <int name="status">0</int> <int name="QTime">0</int> <lst name="params"> <str name="indent">true</str> <str name="q">*:*</str> </lst> </lst> <result name="response" numFound="1" start="0"> <doc> <str name="name">Test</str> <str name="id">1cdee8b4-c42d-4101-8301-4dc350a4d522</str> <long name="_version_">1439726523307261952</long> </doc> </result> </response>
As we can see the unique identifier was automatically generated. Now if we would send the same indexing command once again:
curl -XPOST 'localhost:8983/solr/update?commit=true' --data-binary '<add><doc><field name="name">Test</field></doc></add>' -H 'Content-type:application/xml'
And run the same query again:
curl -XGET 'localhost:8983/solr/select?q=*:*&indent=true'
We would get two documents in results, just like the following:
<?xml version="1.0" encoding="UTF-8"?> <response> <lst name="responseHeader"> <int name="status">0</int> <int name="QTime">1</int> <lst name="params"> <str name="indent">true</str> <str name="q">*:*</str> </lst> </lst> <result name="response" numFound="2" start="0"> <doc> <str name="name">Test</str> <str name="id">1cdee8b4-c42d-4101-8301-4dc350a4d522</str> <long name="_version_">1439726523307261952</long> </doc> <doc> <str name="name">Test</str> <str name="id">9bedcb5f-1b71-4ab7-80a9-9882a6bf319e</str> <long name="_version_">1439726693819351040</long> </doc> </result> </response>
As you can see, the two above documents have different unique identifiers, so the functionality works.