As Lucene and Solr are slowly showing up on the horizon I decided to take a look at another Solr feature which may be very useful for users – partial document update.
Solr Version
In order to test partial documents update functionality I used the upcoming Apache Solr 4.0 alpha.
Assumptions
Lets assume, that we need to update a single field in the index and we don’t want to send the whole document. Lets say, that we need to update product price, which is updated a few times a day. We don’t want to index the whole document again and again, because its not only its name, but also the binary files that are processed by Tika and because of that the indexing is taking quite long. Is there something we can do ? Yes, we can, but first lets look at the index structure.
Index Structure
The index structure is very simple – it contains the product identifier (id), its name (title), price (price) and description (description). So, the fields section of the schema.xml file would look like this:
<fields> <field name="id" type="string" indexed="true" stored="true" required="true" /> <field name="name" type="text_general" indexed="true" stored="true"/> <field name="price" type="float" indexed="true" stored="true" /> <field name="description" type="text_general" indexed="true" stored="true" /> <field name="_version_" type="long" indexed="true" stored="true"/> </fields>
Its worth to notice two things: first of all, all the fields are mark as stored=”true”. Why such a setup – I’ll talk about it later. The second thing is the _version_ field, which is used internally by Solr and is needed.
Index Contents
In order to make a test of the new functionality I indexed a single document. The q=*:* query returned the following:
<?xml version="1.0" encoding="UTF-8"?> <response> <lst name="responseHeader"> <int name="status">0</int> <int name="QTime">0</int> <lst name="params"> <str name="indent">true</str> <str name="q">*:*</str> </lst> </lst> <result name="response" numFound="1" start="0"> <doc> <str name="id">1</str> <str name="name">Test 1</str> <float name="price">479.95</float> <str name="description">Description 1</str> <long name="_version_">1406418192301031424</long> </doc> </result> </response>
Partial Update
In order to update documents price field we need to run the following command:
curl 'localhost:8983/solr/update?commit=true' -H 'Content-type:application/json' -d '[{"id":"1","price":{"set":100}}]'
The above command tells Solr to update document with id equal 1 and set the price field to 100. So, what does q=*:* return now ? Lets take a look:
<?xml version="1.0" encoding="UTF-8"?> <response> <lst name="responseHeader"> <int name="status">0</int> <int name="QTime">0</int> <lst name="params"> <str name="indent">true</str> <str name="q">*:*</str> </lst> </lst> <result name="response" numFound="1" start="0"> <doc> <str name="id">1</str> <str name="name">Test 1</str> <float name="price">100.0</float> <str name="description">Description 1</str> <long name="_version_">1406418399028838400</long> </doc> </result> </response>
As you can see the price field was updated and the rest of the fields are still there with their original values. In addition to that, the _version_ field was also updated, so we got exactly what we wanted.
Only Fields Updates ?
Not only field updates are supported – you can for example add values to multivalued fields. If you are interested in this functionality I suggest testing it with add command (in addition to the set command presented in the above example). As you can see this new Solr functionality is not only about changing a single field value and lets you do a bit more with your indexed documents.
Things Worth Remembering
Lets get back to the index structure. As I wrote before, all the fields of our document are marked as stored=”true”. We need that in order to update single fields of those documents. Solr uses stored fields to get data from them and than uses this data to reconstruct the document that will be removed and indexed again. Of course the newly indexed document will have the changes we told Solr we want.
To Sum Up
It’s clear that the partial update in Solr 4.0 doesn’t partially update Lucene index, but instead removes a document from the index and indexes an updated one. Of course all of that is done on Solr side, so we don’t have to worry about a single (or multiple) fields in the index. You need to remember though that this comes with a cost of larger index as you need to store all the fields in order for Solr to be able to use them. We gain simplicity, time and networks usage, but we need sacrifice the index size because of that.