SolrCloud: What happens when ZooKeeper fails – part two

In the previous blog post about SolrCloud we’ve talked about the situation when ZooKeeper connection failed and how Solr handles that situation. However, we only talked about query time behavior of SolrCloud and we said that we will get back to the topic of indexing in the future. That future is finally here – let’s see what happens to indexing when ZooKeeper connection is not available.

Looking back at the old post

In the SolrCloud – What happens when ZooKeeper fails? blog post, we’ve shown that Solr can handle querying without any issues when connection to ZooKeeper has been lost (which can be caused by different reasons). Of course this is true until we change the cluster topology. Unfortunately, in case of indexing or cluster change operations, we can’t change the cluster state or index documents when ZooKeeper connection is not working or ZooKeeper failed to read/write the data we want.

Why we can run queries?

The situation is quite simple – querying is not an operation that needs to alter SolrCloud cluster state. The only thing Solr needs to do is accept the query, run it against known shards/replicas and gather the results. Of course cluster topology is not retrieved with each query, so when there is no active ZooKeeper connection (or ZooKeeper failed) we don’t have a problem with running queries.

There is also one important and not widely know feature of SolrCloud – the ability to return partial results. By adding the shards.tolerant=true parameter to our queries we inform Solr, that we can live with partial results and it should ignore shards that are not available. This means that Solr will return results even if some of the shards from our collection is not available. By default, when this parameter is not present or set to false, Solr will just return error when running a query against collection that doesn’t have all the shards available.

Why we can’t index data?

So, we can’t we index data, when ZooKeeper connection is not available or when ZooKeeper doesn’t have a quorum? Because there is potentially not enough information about the cluster state to process the indexing operation. Solr just may not have the fresh information about all the shards, replicas, etc. Because of that, indexing operation may be pointed to incorrect shard (like not to the current leader), which can lead to data corruption. And because of that indexing (or cluster change) operation is jus not possible.

It is generally worth remembering, that all operations that can lead to cluster state update or collections update won’t be possible when ZooKeeper quorum is not visible by Solr (in our test case, it will be a lack of connectivity of a single ZooKeeper server).

Of course, we could leave you with what we wrote above, but let’s check if all that is true.

Running ZooKeeper

A very simple step. For the purpose of the test we will only need a single ZooKeeper instance which is run using the following command from ZooKeeper installation directory:

bin/zkServer.sh start

We should see the following information on the console:

JMX enabled by default
Using config: /Users/gro/Solry/zookeeper/bin/../conf/zoo.cfg
Starting zookeeper ... STARTED

And that means that we have a running ZooKeeper server.

Starting two Solr instances

To run the test we’ve used the newest available Solr version – the 5.2.1 when this blog post was published. To run two Solr instances we’ve used the following command:

bin/solr start -e cloud -z localhost:2181

Solr asked us a few questions when it was starting and the answers where the following:

number of instances: 2
collection name: gettingstarted
number of shards: 2
replication count: 1
configuration name: data_driven_schema_configs

Cluster topology after Solr started was as follows:

Zrzut ekranu 2015-06-21 o 11.13.31

Let’s index a few documents

To see that Solr is really running, we’ve indexed a few documents by running the following command:

bin/post -c gettingstarted docs/

If everything went well, after running the following command:

curl -XGET 'localhost:8983/solr/gettingstarted/select?indent=true&q=*:*&rows=0'

we should see Solr responding with similar XML:

<?xml version="1.0" encoding="UTF-8"?>
<response>
 <lst name="responseHeader">
  <int name="status">0</int>
  <int name="QTime">38</int>
  <lst name="params">
   <str name="q">*:*</str>
   <str name="indent">true</str>
   <str name="rows">0</str>
  </lst>
 </lst>
 <result name="response" numFound="3577" start="0" maxScore="1.0">
 </result>
</response>

We’ve indexed our documents, we have Solr running.

Let’s stop ZooKeeper and index data

To stop ZooKeeper server we will just run the following command in the ZooKeeper installation directory:

bin/zkServer.sh stop

And now, let’s again try to index our data:

bin/post -c gettingstarted docs/

This time, instead of data being written into the collection we will get an error response similar to the following one:

POSTing file index.html (text/html) to [base]/extract
SimplePostTool: WARNING: Solr returned an error #503 (Service Unavailable) for url: http://localhost:8983/solr/gettingstarted/update/extract?resource.name=%2FUsers%2Fgro%2FSolry%2F5.2.1%2Fdocs%2Findex.html&literal.id=%2FUsers%2Fgro%2FSolry%2F5.2.1%2Fdocs%2Findex.html
SimplePostTool: WARNING: Response: <?xml version="1.0" encoding="UTF-8"?>
<response>
<lst name="responseHeader"><int name="status">503</int><int name="QTime">3</int></lst><lst name="error"><str name="msg">Cannot talk to ZooKeeper - Updates are disabled.</str><int name="code">503</int></lst>
</response>

As we can see, the lack of ZooKeeper connectivity resulted in Solr not being able to index data. Of course querying still works. Turning on ZooKeeper again and retrying indexing will be successful, because Solr will automatically reconnect to ZooKeeper and will start working again.

Short summary

Of course this and the previous blog post related to ZooKeeper and SolrCloud are only touching the surface of what is happening when ZooKeeper connection is not available. A very good test that shows us data consistency related information can be found at http://lucidworks.com/blog/call-maybe-solrcloud-jepsen-flaky-networks/. I really recommend it if you would like to know what will happen with SolrCloud in various emergency situations.

Solr.pl