Solr 4.0: Partial documents update

As Lucene and Solr are slowly showing up on the horizon I decided to take a look at another Solr feature which may be very useful for users – partial document update.

Solr Version

In order to test partial documents update functionality I used the upcoming Apache Solr 4.0 alpha.

Assumptions

Lets assume, that we need to update a single field in the index and we don’t want to send the whole document. Lets say, that we need to update product price, which is updated a few times a day. We don’t want to index the whole document again and again, because its not only its name, but also the binary files that are processed by Tika and because of that the indexing is taking quite long. Is there something we can do ? Yes, we can, but first lets look at the index structure.

Index Structure

The index structure is very simple – it contains the product identifier (id), its name (title), price (price) and description (description). So, the fields section of the schema.xml file would look like this:

<fields>
  <field name="id" type="string" indexed="true" stored="true" required="true" />
  <field name="name" type="text_general" indexed="true" stored="true"/>
  <field name="price" type="float" indexed="true" stored="true" />
  <field name="description" type="text_general" indexed="true" stored="true" />
  <field name="_version_" type="long" indexed="true" stored="true"/>
</fields>

Its worth to notice two things: first of all, all the fields are mark as stored=”true”. Why such a setup – I’ll talk about it later. The second thing is the  _version_ field, which is used internally by Solr and is needed.

Index Contents

In order to make a test of the new functionality I indexed a single document. The q=*:* query returned the following:

<?xml version="1.0" encoding="UTF-8"?>
<response>
<lst name="responseHeader">
  <int name="status">0</int>
  <int name="QTime">0</int>
  <lst name="params">
    <str name="indent">true</str>
    <str name="q">*:*</str>
  </lst>
</lst>
<result name="response" numFound="1" start="0">
  <doc>
    <str name="id">1</str>
    <str name="name">Test 1</str>
    <float name="price">479.95</float>
    <str name="description">Description 1</str>
    <long name="_version_">1406418192301031424</long>
  </doc>
</result>
</response>

Partial Update

In order to update documents price field we need to run the following command:

curl 'localhost:8983/solr/update?commit=true' -H 'Content-type:application/json' -d '[{"id":"1","price":{"set":100}}]'

The above command tells Solr to update document with id equal 1 and set the price field to 100. So, what does q=*:* return now ? Lets take a look:

<?xml version="1.0" encoding="UTF-8"?>
<response>
<lst name="responseHeader">
  <int name="status">0</int>
  <int name="QTime">0</int>
  <lst name="params">
    <str name="indent">true</str>
    <str name="q">*:*</str>
  </lst>
</lst>
<result name="response" numFound="1" start="0">
  <doc>
    <str name="id">1</str>
    <str name="name">Test 1</str>
    <float name="price">100.0</float>
    <str name="description">Description 1</str>
    <long name="_version_">1406418399028838400</long>
  </doc>
</result>
</response>

As you can see the price field was updated and the rest of the fields are still there with their original values. In addition to that, the _version_ field was also updated, so we got exactly what we wanted.

Only Fields Updates ?

Not only field updates are supported – you can for example add values to multivalued fields. If you are interested in this functionality I suggest testing it with add command (in addition to the set command presented in the above example). As you can see this new Solr functionality is not only about changing a single field value and lets you do a bit more with your indexed documents.

Things Worth Remembering

Lets get back to the index structure. As I wrote before, all the fields of our document are marked as stored=”true”. We need that in order to update single fields of those documents. Solr uses stored fields to get data from them and than uses this data to reconstruct the document that will be removed and indexed again. Of course the newly indexed document will have the changes we told Solr we want.

To Sum Up

It’s clear that the partial update in Solr 4.0 doesn’t partially update Lucene index, but instead removes a document from the index and indexes an updated one. Of course all of that is done on Solr side, so we don’t have to worry about a single (or multiple) fields in the index. You need to remember though that this comes with a cost of larger index as you need to store all the fields in order for Solr to be able to use them. We gain simplicity, time and networks usage, but we need sacrifice the index size because of that.

This post is also available in: Polish

This entry was posted on Monday, July 9th, 2012 at 07:58 and is filed under Indexing, Solr. You can follow any responses to this entry through the RSS 2.0 feed. You can leave a response, or trackback from your own site.

66 Responses to “Solr 4.0: Partial documents update”

  1. David Schönfeld Says:

    Hi,
    thank you for this post, it was very helpful.

  2. dhanesh Says:

    Hi,
    Thanks a lot. Very nice article. Can you suggest any php library that supports partial document update?

  3. gr0 Says:

    I’m glad you find that post useful. As for the PHP library I’m afraid I don’t know any :(

  4. dhanesh Says:

    Hi,
    I’ve around 50 fields in my schema and having 20 fields are stored=”true” and rest of them stored=”false”
    In this partial update, you have mentioned that documents should be stored=”true”.
    Which means I’ve to change all all my fields to stored=”true”.Right?
    Will it affect the performance of the Solr?
    Thanks
    dhanesh

  5. gr0 Says:

    Your index will be bigger, that’s for sure, you shouldn’t see drastic performance degradation though.

  6. rocky Says:

    That’s a nice little article – was very useful! Do you have an example on how to do this in a csv style update instead of json? Is it possible at all?

  7. gr0 Says:

    I think only xml and json is possible, but I’m not 100% sure :(

  8. bnm Says:

    What happans if I defined a dynamic field say AT*.
    In my document there are fields with AT_test, AT_test1.
    Now I want to add the field AT_test2 (it hasn´t been in the document so far). Is this possible?

  9. bnm Says:

    how do I have to execute the command for update?
    curl ‘localhost:8983/solr/update?commit=true’ -H ‘Content-type:application/json’ -d ‘[{"id":"1","price":{"set":100}}]‘

    where do I have to put this? I´m confused

  10. gr0 Says:

    The example command is bash command that use ‘curl’ command to send the request to Solr.

  11. gr0 Says:

    As for the adding new field – it should be possible, just haven’t test it.

  12. dm Says:

    curl ‘localhost:8983/solr/update?commit=true’ -H ‘Content-type:application/json’ -d ‘[{"id":"1","price":{"set":100}}]‘

    How we post an XML instead of JSON for the partial update?

  13. dm Says:

    Here is how I got the XML partial update working

    1
    100

  14. dm Says:

    curl ‘localhost:8983/solr/update?commit=true’ -H ‘Content-type:application/xml’ -d ‘1100‘

  15. Yoni Says:

    Any idea of the solrj java api is up-to-speed with this feature?
    I am using the latest solrj (4 beta) and I can’t figure out how to use the api for partial updates.
    Thanks,

  16. gr0 Says:

    @dm – XML is stripped from comments by default ;(

  17. gr0 Says:

    @Yoni – will have to check that, last time I looked there was no “update” methods, but you could do some trick to make it work.

  18. Yoni Says:

    I played a bit with solrj api and I found out that you can add a hashmap as the value of the SolrInputField. It causes solrj to add an “update” attribute to the field tag. However, it didn’t get me anywhere, solr didn’t update that field, so I am still stuck

  19. n00b Says:

    Any idea how I could update a field in a custom made component?

  20. deniz Says:

    do you know how to the reverse for updating fields, for multiValueds?
    for adding another value to a multiValued field, simply modifying your query above works (assuming price_f is multivalued)

    ‘localhost:8983/solr/update?commit=true’ -H ‘Content-type:application/json’ -d ‘[{"id":"1","price":{"add":250}}]‘

    now price_f is something like

    100
    250

    so how to remove 250 from here again?

    these dont work at all:

    ‘localhost:8983/solr/update?commit=true’ -H ‘Content-type:application/json’ -d ‘[{"id":"1","price":{"delete":250}}]‘

    ‘localhost:8983/solr/update?commit=true’ -H ‘Content-type:application/json’ -d ‘[{"id":"1","price":{"remove":250}}]‘

  21. Gavin Says:

    thank u ,the article is very useful !

    but ‘[{"id":"1","price":{"set":100}}]‘ seems a multi-dimensional array, would you tell me how call it with php cURL?

  22. Gavin Says:

    Can you give me a example of PHP ?THANKS!

  23. gr0 Says:

    I’m afraid I can’t help you with PHP :(

  24. redguy Says:

    Hi, is there *any* possibility to update documents with fields not marked as ‘stored’? the updated field is not copied or included into those fields…

    We have have case like this: documents with schema: name (stored), score (stored), description (stored), text (not stored, only index).
    ‘name’ and ‘description’ are copied into ‘text’ field (but it can have also other value assigned explicite). I want to change value of ‘score’ field without of modifying any other fields…

  25. gr0 Says:

    So you fill the ‘text’ field from fields that are stored with the use of copyField right ? That should work, because what Solr do is total reindex of the document with the changed value. Solr will do the same as you would send the document with a single value updated (of course is all the fields you copy data from are stored).

  26. redguy Says:

    so to summarize: when non stored fields are just copied from other stored fields – this will work, but when I have non stored fields with some value assigned – this will not work, because during updating process solr will not have the previous value…

    that a pity in fact.. I wanted to achieve some kind of ‘scoring’ mechanism when users can update ‘score’ for indexed documents, but main content of those documents is fetched with tika from document repository… And this means that I need to access document file again just to update ‘score’…

    do you know how could I achieve something like that without needing to reindex whole documents? TIA

  27. gr0 Says:

    I don’t know how you want to achieve scoring and how you want to access it. You can see if ExternalFieldType matches your needs.

  28. redguy Says:

    scoring is just a process of update single field in index through UI (comparing to SQL I would like to have some interface to run UPDATE docuemnts SET score=XXX WHERE document_id=YYY, which modifies only single field without touching others).
    Access is a little more important, because I would like to search on that field and get facets (which ExternalFileType does not allow…)

  29. gr0 Says:

    I suppose you can’t have your fields set to stored, because they have a lot of data in right ? If your are worried about index size, maybe its worth looking at 4.1 and compressed stored fields ? I don’t see any other possibility than reindexing the whole document in your case. There may be a ‘real’ partial document update in Lucene somewhere in the future, but for now its just reindexing done on Solr side.

  30. karunaker reddy v Says:

    Thank you for your time and valuable share

  31. JerryH Says:

    How would you update a document with multiple required fields? I have two required fields and if I add two required fields and one field to be updated, it overwrites the document with only those three values and delete old ones.

  32. gr0 Says:

    Are other fields stored in your documents or only indexed ?

  33. JerryH Says:

    Many are stored, but not all are stored. I was able to do it using set on value that needs to be updated and query against the other keys. It just was not so obvious in RSolr library. Based on what you are saying, that I assume all values that are not set to be stored will loose index value.

  34. gr0 Says:

    Yes, if you don’t have all your fields as stored, than you loose that ones during update. Remember that the current update feature is just an ordinary indexing, but done on Solr side. So Solr needs reconstruct the whole document and it can’t from inverted fields and that’s why it needs their original values in stored fields.

  35. JerryH Says:

    I am searching for a ways to store all fields, but display only certain fields. I will look for it, but if you know the answer already, I’d really appreciate it.

  36. gr0 Says:

    You can use the fl parameter and specify the list of fields (separated by comma) that will be returned by Solr.

  37. Raghav Says:

    Thanks for the great post!

    We have observed that partial updates to solr fields only works when you have the transaction logging, currently used by the NRT Handler, enabled in solrconfig.xml, by including in the section the following

    ${solr.data.dir:}

    We disabled transaction logging and NRT because it generated very large transaction log files in production and caused long solr restart times.

    Any idea about this?

  38. Raghav Says:

    my previous post was garbled – the section to be included is in the updateHandler section an updateLog section.

  39. Brad Says:

    Thanks for the article. Anyway, it would be useful an updateByQuery method, the update method described here requires an uniqueKey. Any suggestion?
    Thanks

  40. gr0 Says:

    As far as I know, if you don’t do hard commit, transaction log can be quite big. To avoid that its good to send commit with openSearcher=false or just set autocommit for Solr to do that for us. I suppose this is why you have experienced big transaction logs. And as far as partial document updates, I think transaction log is needed, but I can’t recall now

  41. gr0 Says:

    I’m afraid tags are not allowed and that’s why your examples are trunked.

  42. gr0 Says:

    Currently Solr only supports partial document update with the use of primary key – at least that is what I’m aware of. If that’s not true I would like to see an example :)

  43. Nim Says:

    I get a:

    4000Unexpected character ‘[' (code 91) in prolog; expected '<'
    at [row,col {unknown-source}]: [1,1]400

    I use: text/xml for ‘Content-type’ in the curl command. Do you have any idea about this error?

  44. gr0 Says:

    By using the text/xml content type you told Solr to expect XML, but you still send JSON file. If you change the content type to application/json it should work without a problem.

  45. Mary Says:

    Hi,

    I tried use partial update by your example. I added _version_ field and I see it in results.

    Next I used
    “curl -i -H ‘Content-type:application/json’ -d ‘[{"id":"1","price":{"set":5}}]‘ http://localhost:8080/solr-4.1.0/update/json?commit=true

    _version_ was not changed and price was not changed. I received {“responseHeader”:{“status”:0,”QTime”:297}} response. Could you advise me why it does not work?

  46. gr0 Says:

    Sorry, but this is not enough information to see what is happening. What is shown Solr logs ?

  47. Mary Says:

    10:16:38,511 INFO [org.apache.solr.core.SolrCore] (http–127.0.0.1-8080-4) [collection1] webapp=/solr-4.1.0 path=/admin/logging params={since=1362384853120&wt=json} status=0 QTime=0
    10:16:39,745 INFO [org.apache.solr.update.UpdateHandler] (http–127.0.0.1-8080-5) start commit{,optimize=false,openSearcher=true,waitSearcher=true,expungeDeletes=false,softCommit=false}
    10:16:39,792 INFO [org.apache.solr.core.SolrCore] (http–127.0.0.1-8080-5) SolrDeletionPolicy.onCommit: commits:num=2
    commit{dir=D:\Job\Tmp\solr4\home\data\index,segFN=segments_s,generation=28,filenames=[_3_nrm.cfe, _3.fdx, _3_nrm.cfs, segments_s, _3_Lucene41_0.pos, _3_Lucene41_0.doc, _3_Lucene41_0.tip, _3.si, _3.fnm, _3_Lucene41_0.tim, _3.fdt]
    commit{dir=D:\Job\Tmp\solr4\home\data\index,segFN=segments_t,generation=29,filenames=[_3_nrm.cfe, _3.fdx, _3_nrm.cfs, segments_t, _3_Lucene41_0.pos, _3_Lucene41_0.doc, _3_Lucene41_0.tip, _3.si, _3.fnm, _3_Lucene41_0.tim, _3.fdt]
    10:16:39,807 INFO [org.apache.solr.core.SolrCore] (http–127.0.0.1-8080-5) newest commit = 29[_3_nrm.cfe, _3.fdx, _3_nrm.cfs, segments_t, _3_Lucene41_0.pos, _3_Lucene41_0.doc, _3_Lucene41_0.tip, _3.si, _3.fnm, _3_Lucene41_0.tim, _3.fdt]
    10:16:39,823 INFO [org.apache.solr.core.CachingDirectoryFactory] (http–127.0.0.1-8080-5) Releasing directory:D:\Job\Tmp\solr4\home\.\data
    10:16:39,823 INFO [org.apache.solr.search.SolrIndexSearcher] (http–127.0.0.1-8080-5) Opening Searcher@313f4e main
    10:16:39,823 INFO [org.apache.solr.core.SolrCore] (searcherExecutor-4-thread-1) QuerySenderListener sending requests to Searcher@313f4e main{StandardDirectoryReader(segments_r:13 _3(4.1):C6)}
    10:16:39,823 INFO [org.apache.solr.update.UpdateHandler] (http–127.0.0.1-8080-5) end_commit_flush
    10:16:39,839 INFO [org.apache.solr.core.SolrCore] (searcherExecutor-4-thread-1) QuerySenderListener done.
    10:16:39,839 INFO [org.apache.solr.core.SolrCore] (searcherExecutor-4-thread-1) [collection1] Registered new searcher Searcher@313f4e main{StandardDirectoryReader(segments_r:13 _3(4.1):C6)}
    10:16:39,839 INFO [org.apache.solr.core.CachingDirectoryFactory] (searcherExecutor-4-thread-1) Releasing directory:D:\Job\Tmp\solr4\home\.\data\index
    10:16:39,854 INFO [org.apache.solr.update.processor.LogUpdateProcessor] (http–127.0.0.1-8080-5) [collection1] webapp=/solr-4.1.0 path=/update/json params={commit=true&’[{id::1,price:{set:5}}]‘=} {commit=} 0 109

  48. gr0 Says:

    The last line of your logs looks suspicious, I’m talking about params={commit=true&’[{id::1,price:{set:5}}]‘=}

    It looks like you are using wrong quote character with curl command.

  49. Barry C Says:

    Can you please update this article to include the correct terminology for this, it makes it so much easier to find documentation on it;
    The correct term is “Atomic updates” as shown in the solr documentation here: http://wiki.apache.org/solr/Atomic_Updates

    It includes additional information about making sure your schema is properly set up to allow atomic updates including things like making sure your config has an upldatelog set and any copy field destination fields are set to stored=”false”

  50. Avneet Says:

    Hi
    Any one could help me with updating the feild using SolrJ i am using 4.2.1 version.
    Thanks in advance

  51. Dan Says:

    Just wanted to say thanks.
    I just batched 25000 updates inside a script following the instructions here.
    http://stackoverflow.com/questions/8829167/use-curl-with-a-file-of-commands

    Took about 15 mins to process through, but all updates executed perfectly, very very happy.

  52. gr0 Says:

    Thanks ! I’m glad we could help :)

  53. vibhor Says:

    here is the code to update a multivalued field:

    curl ‘localhost:8984/solr/ads/update?commit=true’ -H ‘Content-type:application/json’ -d ‘[{"id":"108851417","localities_ss":{"set":["5u0","300","400"]}}]’

  54. gr0 Says:

    Thanks ! :)

  55. snehalata Says:

    Hi,
    I want to update employee id from 10 to 20.
    Please give me solution for me

  56. anupam Says:

    I did the same thing but not on solr 4….I am using solr 3.4 and to update a field for a particular document, copied all the fields, created a new solr document and the field vales with updated one and commited the document. But the issue arises with multivalued fields. after commit new documents sows fields with duplicate field values and that too with leading and trailing square brackets. pls help me out on this.

  57. Hamed Says:

    Hey guys,
    Is there any way to update fields through DIH?

    I want to import Dbpedia instance_types to Solr. My schema is . Types and prettyTypes are multivalue but DIH just adds last field to types.

    Is there any solution?
    http://stackoverflow.com/questions/17788282/solrs-data-import-request-handler-for-update-the-index

  58. Hamed Says:

    Opps My schema is ignored. (uri,label,prettyTypes,types)

  59. gr0 Says:

    Solr 3.4 doesn’t support the described functionality, so I assume you are merging the documents somewhere in your application. In such case your application is the one responsible for the merge and you should look there.

  60. gr0 Says:

    I don’t think DIH is able to use partial document updates.

  61. Gupta Says:

    Does this works for SOLR 3.5.0 version too?

  62. gr0 Says:

    No it doesn’t. Its only a > 4.0 feature.

  63. NR Says:

    How does it can be performed by using Solrj 4.4?

  64. gr0 Says:

    I didn’t look at SolrJ recently I’m afraid.

  65. snehalata Says:

    Hi,
    without curl can we update field?

  66. Behzad Qureshi Says:

    Hi,

    @snehalata: It can be done in 4.6 version as update is available from admin panel in this version. See “Documents” tab.

    Further, I think when real update will be available(means no issue of stored “TRUE”), we will be able to update using any field instead of only unique key.