The question I asked myself recently what seems to be one of those for which the response should be quick and painless. So, when to send the commit command to Solr (or Lucene)? Despite the simplicity of the questions, the answer is not clear, at least in my opinion.
To answer the question of when to send the commit command, you must look at several different variants of data indexing and how quickly you want the data to be available on the slave servers. Looking at a typical implementations, which I had a pleasure to work with we can distinguish the following categories:
Data can be made available only after a total index update
The simplest situation theoretically and practically. We send the commit command only when you run out documents to be indexed.
The data may be available in batches, without waiting for a full update of the index
Here we have three possibilities:
- If it does not matter whether the data will be made available in batches or not, we can send the commit command after sending the last document.
- If you want to share data in batches, our application can send a commit command from time to time.
- If you do not want to send the commit commands from the indexing application, we can tell Solr to do it for us by setting up the autocommit mechanism.
Data must be indexed as fast as possible
If your data should be indexed as fast as possible the commit operation should be sent only after sending all the data. Commit is quite expensive in terms of performance and therefore, in this case, should be used only at the end of the indexation process.
It is important that the data should be published as soon as possible
This is probably the most difficult of the described cases. It all depends on how quickly we want the data to be available on slave servers. For example, in the case of CMS, when the user saves the edited page, we want its updated content to be available right away – then commit after every document, and fast replication is needed.When you add items to an online store, you may add some delay to commit and replication. Such cases can be multiplied indefinitely. But remember to set up your warming queries properly to prepare Solr fot the usual load during querying.
Persons interested in very frequent updating of the index should observe what is happening in Lucene and Solr for NRT (near real time).
It is worth remembering also to optimize the index. If we send the commit command only once, at the end of the indexing is worth considering whether or not to send optimize instead of commit. Our slaves will get an optimized version of the index along with the newest data. Note, however, that the optimization of the index is longer than commit.
It is also worth remembering that the waiting indefinitely with commit operations can lead to the danger of data loss that have not been physically written to the index files. Of course, nothing with the data does not happen if the Solr will be properly turned off, while in case of machine failure situation we can lost the data tha we were indexing since the last commit operation.
To sum up
As you can see, there is no clear answer to when to send the commit command because it depends on the situation and individual needs. Note, however, that the actions that are performed by Lucene / Solr after sending the commit command is costly in terms of system resources. Do not use this command frequently as instead of indexing data Lucene/Solr may spend most of their time processing those commands.
This post is also available in: Polish