SolrCloud – What happens when ZooKeeper fails?

One of the questions I tend to get is what happens with SolrCloud cluster when ZooKeeper fails. Of course we are not talking about a single ZooKeeper instance failure, but the whole ensemble not being accessible and so the quorum not present. Because the answer to this question is very easy to verify i decided to make a simple blog post to show what happens when ZooKeeper fails.

Test environment

The test environment was very simple:

A single virtual machine running under Linux operating system
A single instance of ZooKeeper (which will be suitable for our test)
Two Solr instances with a single collection deployed
Solr 4.6

In order to create our test collection I’ve uploaded the configuration to ZooKeeper and used the following command:

curl 'http://localhost:8983/solr/admin/collections?action=CREATE&name=collection1&numShards=2&replicationFactor=1'

The cloud view of the example cluster was as follows:

Test data indexing

The next step in our test will be indexing. We will index a few example documents that are provided with Solr in the exampledocs directory. The following commands were used to index the data:

curl 'localhost:8983/solr/collection1/update?commit=true' --data-binary @mem.xml -H 'Content-type:application/xml'
curl 'localhost:8983/solr/collection1/update?commit=true' --data-binary @monitor.xml -H 'Content-type:application/xml'
curl 'localhost:8983/solr/collection1/update?commit=true' --data-binary @monitor2.xml -H 'Content-type:application/xml'

After executing the above commands we get the following number of documents:

The whole collection holds 5 documents
Shard located on Solr running on port 8983 host 1 document
Shard located on Solr running on port 7983 has 4 documents

Querying with ZooKeeper not present

Now we go to the next step – we shutdown our ZooKeeper instance and we try to run a simple query by sending the following command:

curl 'localhost:8983/solr/collection1/select?q=*:*&indent=true'

In result we get the following response:

<?xml version="1.0" encoding="UTF-8"?>
<response>
 <lst name="responseHeader">
  <int name="status">0</int>
  <int name="QTime">16</int>
  <lst name="params">
   <str name="indent">true</str>
   <str name="q">*:*</str>
  </lst>
 </lst>
<result name="response" numFound="5" start="0" maxScore="1.0">
<doc>
 <str name="id">TWINX2048-3200PRO</str> 
 <str name="name">CORSAIR  XMS 2GB (2 x 1GB) 184-Pin DDR SDRAM Unbuffered DDR 400 (PC 3200) Dual Channel Kit System Memory - Retail</str>
 <str name="manu">Corsair Microsystems Inc.</str>
 <str name="manu_id_s">corsair</str>
 <arr name="cat">
  <str>electronics</str>
  <str>memory</str>
 </arr>
 <arr name="features">
  <str>CAS latency 2,    2-3-3-6 timing, 2.75v, unbuffered, heat-spreader</str>
 </arr>
 <float name="price">185.0</float>
 <str name="price_c">185,USD</str>
 <int name="popularity">5</int>
 <bool name="inStock">true</bool>
 <str name="store">37.7752,-122.4232</str>
 <date name="manufacturedate_dt">2006-02-13T15:26:37Z</date>
 <str name="payloads">electronics|6.0 memory|3.0</str>
 <long name="_version_">1453219034197655552</long>
</doc>
<doc>
 <str name="id">VS1GB400C3</str>
 <str name="name">CORSAIR ValueSelect 1GB 184-Pin DDR SDRAM Unbuffered DDR 400 (PC 3200) System Memory - Retail</str>
 <str name="manu">Corsair Microsystems Inc.</str>
 <str name="manu_id_s">corsair</str>
 <arr name="cat">
  <str>electronics</str>
  <str>memory</str>
 </arr>
 <float name="price">74.99</float>
 <str name="price_c">74.99,USD</str>
 <int name="popularity">7</int>
 <bool name="inStock">true</bool>
 <str name="store">37.7752,-100.0232</str>
 <date name="manufacturedate_dt">2006-02-13T15:26:37Z</date>
 <str name="payloads">electronics|4.0 memory|2.0</str>
 <long name="_version_">1453219034252181504</long>
</doc>
<doc>
 <str name="id">VDBDB1A16</str>
 <str name="name">A-DATA V-Series 1GB 184-Pin DDR SDRAM Unbuffered DDR 400 (PC 3200) System Memory - OEM</str>
 <str name="manu">A-DATA Technology Inc.</str>
 <str name="manu_id_s">corsair</str>
 <arr name="cat">
  <str>electronics</str>
  <str>memory</str>
 </arr>
 <arr name="features">
  <str>CAS latency 3,     2.7v</str>
 </arr>
 <int name="popularity">0</int>
 <bool name="inStock">true</bool>
 <str name="store">45.18414,-93.88141</str>
 <date name="manufacturedate_dt">2006-02-13T15:26:37Z</date>
 <str name="payloads">electronics|0.9 memory|0.1</str>
 <long name="_version_">1453219034255327232</long>
</doc>
<doc>
 <str name="id">3007WFP</str>
 <str name="name">Dell Widescreen UltraSharp 3007WFP</str>
 <str name="manu">Dell, Inc.</str>
 <str name="manu_id_s">dell</str>
 <arr name="cat">
  <str>electronics</str>
  <str>monitor</str>
 </arr>
 <arr name="features">
  <str>30" TFT active matrix LCD, 2560 x 1600, .25mm dot pitch, 700:1 contrast</str>
 </arr>
 <str name="includes">USB cable</str>
 <float name="weight">401.6</float>
 <float name="price">2199.0</float>
 <str name="price_c">2199,USD</str>
 <int name="popularity">6</int>
 <bool name="inStock">true</bool>
 <str name="store">43.17614,-90.57341</str>
 <long name="_version_">1453219041357332480</long>
</doc>
<doc>
 <str name="id">VA902B</str>
 <str name="name">ViewSonic VA902B - flat panel display - TFT - 19"</str>
 <str name="manu">ViewSonic Corp.</str>
 <str name="manu_id_s">viewsonic</str>
 <arr name="cat">
  <str>electronics</str>
  <str>monitor</str>
 </arr>
 <arr name="features">
  <str>19" TFT active matrix LCD, 8ms response time, 1280 x 1024 native resolution</str>
 </arr>
 <float name="weight">190.4</float>
 <float name="price">279.95</float>
 <str name="price_c">279.95,USD</str>
 <int name="popularity">6</int>
 <bool name="inStock">true</bool>
 <str name="store">45.18814,-93.88541</str>
 <long name="_version_">1453219045997281280</long></doc>
</result>
</response>

As we can see Solr responded correctly. This is because Solr already has the clusterstate.json file cached. To search Solr doesn’t need to update that file, so search should and is working as we could see.

Indexing with failed ZooKeeper

Without turning on our ZooKeeper instance we try to run the following command:

curl 'localhost:8983/solr/collection1/update?commit=true' --data-binary @hd.xml -H 'Content-type:application/xml'

The above command should result in indexing the contents of the hd.xml file. After a longer period of time Solr responds with the following information:

<?xml version="1.0" encoding="UTF-8"?>
<response>
<lst name="responseHeader"><int name="status">503</int><int name="QTime">15096</int></lst><lst name="error"><str name="msg">Cannot talk to ZooKeeper - Updates are disabled.</str><int name="code">503</int></lst>
</response>

So as you can see we are not able to index data without working ZooKeeper ensemble.

Starting ZooKeeper again

So let’s see what will happen when we start our ZooKeeper instance again without restarting Solr nodes. After starting ZooKeeper we try to run the same indexing command, we just did, once again:

curl 'localhost:8983/solr/collection1/update?commit=true' --data-binary @hd.xml -H 'Content-type:application/xml'

And this time the response is different:

<?xml version="1.0" encoding="UTF-8"?>
<response>
<lst name="responseHeader"><int name="status">0</int><int name="QTime">118</int></lst>
</response>

As we can see the indexing request was successful this time. This allows us to assume that the connection to ZooKeeper was re-established by Solr. We can see that in Solr and ZooKeeper logs.

Short summary

As you can see, our short test allowed to see what happens when our ZooKeeper ensemble fails and what we can expect from Solr in such rare cases. I hope this blog entry will help you with some doubts about SolrCloud and its usefulnesses.

Please also remember that during the test, the cluster state did not change – all shards were accessible and working. We will see what will be happening when shards or replicas fails when ZooKeeeper is down in the next blog entry about SolrCloud.

Solr.pl