SolrCloud HOWTO

What is the most important change in 4.x version of Apache Solr? I think there are many of them but Solr Cloud is definitely something that changed a lot in Solr architecture. Until now, bigger installations suffered from single point of failure (SPOF) – there was only the one master server and when this server was going down, the whole cluster lose the ability to receive new data. Of course you could go for multiple masters, where a single master was responsible for indexing some part of the data, but still, there was a SPOF present in your deployment. Even if everything worked, due to commit interval and the fact that slave instances checked the presence of new data periodically, the solution was far from ideal – the new data in the cluster appeared minutes after commit.

Solr Cloud changed this behavior. In this article we will setup a new SolrCloud cluster from the scratch and we will see how it work.

Our example cluster

In our example we will use three Solr servers. Every server in the cluster is capable of handling both the index and the query requests. This is the main difference from the old-fashioned Solr architecture with single master and multiple slave servers. In the new architecture there is one additional element present: Zookeeper, which is responsible for holding configuration of the cluster and for synchronization of its work. It is crucial to understand that Solr relies on information stored in Zookeeper – if Zookeeper will fail, the whole cluster is useless. Because of this it is very important to have a fault tolerant Zookeeper ensemble and because of this we use three independent instances of Zookeeper that will form the ensemble.

Zookeeper installation

As we said previously, Zookeeper is a vital part of SolrCloud cluster. Although we can use embedded Zookeeper, this is only handy for testing. For production you definitely want your Zookeeper to be installed independently from Solr and run in a different Java virtual machine process to avoid those two interrupting each other and influencing each others work.

The installation of Apache Zookeeper is straight forward and may be described by the following steps:

  1. Download Zookeeper archive from: http://www.apache.org/dyn/closer.cgi/zookeeper/
  2. Unpack downloaded archive and copy conf/zoo_sample.cfg to conf/zoo.cfg
  3. Modify zoo.cfg:
    1. Change dataDir to directory where you want to hold all cluster configuration data
    2. Add information about all Zookeeper servers (see below)

After mentioned changes my zoo.cfg looks like the following one:

tickTime=2000
initLimit=10
syncLimit=5
dataDir=/var/zookeeper/data
clientPort=2181
server.1=zk1:2888:3888
server.2=zk2:2888:3888
server.3=zk3:2888:3888
  1. Copy this archive to the all servers, where Zookeeper service should be run
  2. Create file /var/zookeeper/data/myid with server identifier. This identifier is different for each instance (for example on zk2 this file should contain 2 number)
  3. Start all instances using “bin/zkServer.sh start-foreground” and verify validity of the installation
  4. Add “bin/zkServer.sh start” to starting scripts and make sure that operation system monitors that Zookeeper service is available.

Solr installation

The installation of Solr is the following:

  1. Download Solr archive from: http://www.apache.org/dyn/closer.cgi/lucene/solr/4.1.0
  2. Unpack downloaded archive
  3. In this tutorial we will use the ready Solr installation from the example directory and all changes are made to this example installation
  4. Copy archive to all servers which are the part of the cluster
  5. Install to Zookeeper configuration data, which will be used by the Solr cluster. For this run the first instance with:
    java -Dbootstrap_confdir=./solr/collection1/conf -Dcollection.configName=solr1 -DzkHost=zk1:2181 -DnumShards=2 -jar start.jar

This should be run only once. The next run will use configuration from Zookeeper cluster and local configuration files are not needed.

  1. Run all instances using
    java –DzkHost=zk1:2181 –jar start.jar

Verify the installation

Go into administration panel on any Solr instance. For our deployment the URL should be like http://solr1:8983/solr. When you click on cloud tab, and graph, you should see something similar to the following screen shot:

cloud

Collection

Our first collection – the collection1 is divided into two shards (shard1 and shard2). Each of those shards is placed on two Solr instances (OK, on the picture you see that every Solr is placed on the same host – I have currently only one physical server available for tests – any volunteers for donation? ;)). You can see that type of the dot tell us if it is a primary shard or replica.

Summary

I hope this is the first note about solrCloud. I know it is very short and skips details and information about shards, replicas and architecture of this solution. Treat this as a simple checklist for basic, (but real) configuration of your cloud.

This post is also available in: Polish

This entry was posted on Monday, March 11th, 2013 at 09:05 and is filed under Solr, Tutorials. You can follow any responses to this entry through the RSS 2.0 feed. You can leave a response, or trackback from your own site.

5 Responses to “SolrCloud HOWTO”

  1. Chris Says:

    Great tutorial. Why are both shards no placed on each host? Is that something that can be configured?

  2. gr0 Says:

    We have two primary shard and each of those has a replica, an exact copy of itself. This is why we have a total number of 4 shards – two primaries and two replicas. And yes, we can configure the number of shards (during index creation) and the number of replicas.

  3. Kamil Leduchowski Says:

    First you show how to run an zk ensemble, and then on solr startup you use only single instance instead giving all instancess in zkHost parameter.

    Stat command on both nodes should be:
    java –DzkHost=zk1:2181,zk2:2181 –jar start.jar

  4. Kamil Leduchowski Says:

    Forgot to add third one:
    java –DzkHost=zk1:2181,zk2:2181,zk3:2181 –jar start.jar

  5. gr0 Says:

    You are right, thanks for the comment! :)