indexing – Solr.pl

SolrCloud – write and read tolerance

Rafał Kuć — Mon, 31 Dec 2018 14:27:21 +0000

SolrCloud similar to most of the distributed systems is designed with some rules in mind. There are also rules that each distributed system is subject to. For example the CAP theorem tells that a system can’t achieve availability, data consistency and network partition split tolerance at the same time – you can have two out of three at most. Of course, in this blog entry, we will not be discussing principles of the distributed systems, but we will focus on write and read tolerance in SolrCloud.

Write time tolerance

Write tolerance is not a simple topic. First of all, with the introduction of Solr 7.0 we got a variety of replica types. We have NRT replicas which write data to transaction log and index data on each replica. We have TLOG type replicas which write to transaction log, but instead of indexing the data on their own they use the replication mechanism to pull the data. And finally we have the PULL replicas which do not use transaction log and only use replication mechanism to pull the data periodically from the leader shard.

However, we won’t be analyzing how each of the replica types work, but we will focus on the NRT replicas, because this type was here from the beginning of SolrCloud and what’s more this is still the default type of replicas in SolrCloud.

When it comes to NRT replicas, the indexing process is as follows. The leader accepts the data, writes it into the transaction log and sends it to all its replicas (assuming all are of NRT type). Then each of the replicas writes the data to the transaction log and return the acknowledgment. At this point we know that the data is safe. Of course, somewhere in the meantime the data will also be written to the inverted index. But the question is – what will happen when not all shards will be available? I would bet on indexing not succeeding, but to be perfectly sure – let’s check that by starting two Solr instances by using the following commands

$ bin/solr start -c

$ bin/solr start -z localhost:9983 -p 6983

Next, let’s create a collection built of two shards:

$ bin/solr create_collection -c test_index -shards 2 -replicationFactor 1

Once the collection is created let’s stop one of the instances

$ bin/solr stop -p 6983

And finally let’s try indexing some data by using the following command:

$ curl -XPOST -H 'Content-type:application/json' 'localhost:8983/solr/test_index/update' -d '{
 "id" : 2,
 "name" : "Test document"
}'

As we would expect Solr returns an error:

{
  "responseHeader":{
    "status":503,
    "QTime":4011},
  "error":{
    "metadata":[
      "error-class","org.apache.solr.common.SolrException",
      "root-error-class","org.apache.solr.common.SolrException"],
    "msg":"No registered leader was found after waiting for 4000ms , collection: test_index slice: shard2 saw state=DocCollection(test_index//collections/test_index/state.json/8)={\n  \"pullReplicas\":\"0\",\n  \"replicationFactor\":\"1\",\n  \"shards\":{\n    \"shard1\":{\n      \"range\":\"80000000-ffffffff\",\n      \"state\":\"active\",\n      \"replicas\":{\"core_node3\":{\n          \"core\":\"test_index_shard1_replica_n1\",\n          \"base_url\":\"http://192.168.1.11:8983/solr\",\n          \"node_name\":\"192.168.1.11:8983_solr\",\n          \"state\":\"active\",\n          \"type\":\"NRT\",\n          \"force_set_state\":\"false\",\n          \"leader\":\"true\"}}},\n    \"shard2\":{\n      \"range\":\"0-7fffffff\",\n      \"state\":\"active\",\n      \"replicas\":{\"core_node4\":{\n          \"core\":\"test_index_shard2_replica_n2\",\n          \"base_url\":\"http://192.168.1.11:6983/solr\",\n          \"node_name\":\"192.168.1.11:6983_solr\",\n          \"state\":\"down\",\n          \"type\":\"NRT\",\n          \"force_set_state\":\"false\",\n          \"leader\":\"true\"}}}},\n  \"router\":{\"name\":\"compositeId\"},\n  \"maxShardsPerNode\":\"-1\",\n  \"autoAddReplicas\":\"false\",\n  \"nrtReplicas\":\"1\",\n  \"tlogReplicas\":\"0\"} with live_nodes=[192.168.1.11:8983_solr]",
    "code":503}}

In that case we can’t really do anything. We don’t want to manually route the data and even if we would, we don’t have a guarantee that the data would end up in one of the available shards. The best we can do is bring the missing shards back to life as soon as possible

What will happen if we have multiple replicas and only some of them are missing? In that case the write should be successful and Solr should inform us how many replicas written the data (at least in its newest versions) by including the rf parameter in the response. Let’s check that out.

Let’s create another collection, this time with a single shard and two replicas on our two Solr nodes:

$ bin/solr create_collection -c test_index_2 -shards 1 -replicationFactor 2

If we would try to index data using exactly the same Solr would return the following response (when using with Solr 7.6.0):

{
  "responseHeader":{
    "rf":2,
    "status":0,
    "QTime":316}}

As we can see the rf parameter is set to 2. This means that the replication factor of 2 was achieved. In the scope of our collection it means that the write was successful both on the leader shards and the replica shard. If we would stop the Solr instance running on port 6983 and try to index the same data once again, we would get the following response:

{
  "responseHeader":{
    "rf":1,
    "status":0,
    "QTime":4}}

In the earlier Solr versions in order to get the information about the achieved replication factor we had to include the min_rf parameter in our indexing request and set it to a value higher then 1.

Read time tolerance

When it comes to reads the situation is a bit different. If we don’t have all shards available we will loose visibility over a portion of the data. For example, having collection with 10 shards and loosing one of them means that we lost approximately 10% of the data. And of course during query, by default, Solr will not show the remaining 90% of the documents, but will just throw an error. Let’s check if that is true. To do that we will create two instances of Solr by using the following command:

$ bin/solr start -c

$ bin/solr start -z localhost:9983 -p 6983

Next, let’s create a simple collection built of two shards:

$ bin/solr create_collection -c test -shards 2 -replicationFactor 1

And now, without indexing the data let’s just stop one instance, the one that is running on port 6983:

$ bin/solr stop -p 6983

Now all it takes to get an error is to run the following query:

http://localhost:8983/solr/test/select?q=*:*

In response instead of empty results list we will get an error similar to the following one:

{
  "responseHeader":{
    "status":503,
    "QTime":6,
    "params":{
      "q":"*:*"}},
  "error":{
    "metadata":[
      "error-class","org.apache.solr.common.SolrException",
      "root-error-class","org.apache.solr.common.SolrException"],
    "msg":"no servers hosting shard: shard2",
    "code":503}}

OK the default behavior is good – we have an error, because we don’t have full consistency of the data. But what if we would like to show the partial results taking the risk of not delivering the most valuable results, but also not showing error or empty pages. To achieve that Solr gives us two parameters, the shards.tolerant and shards.info. If we would like to have partial results returned we should set the first one to true, if we would like to have detailed information about shards we should set the second one to true. For example:

http://localhost:8983/solr/test/select?q=*:*&shards.tolerant=true&shards.info=true

In case of the above query Solr will not return an error, partial results will be returned and an information about error on one of the shards:

{
  "responseHeader":{
    "zkConnected":true,
    "partialResults":true,
    "status":0,
    "QTime":45,
    "params":{
      "q":"*:*",
      "shards.tolerant":"true",
      "shards.info":"true"}},
  "shards.info":{
    "":{
      "error":"org.apache.solr.common.SolrException: no servers hosting shard: ",
      "trace":"org.apache.solr.common.SolrException: no servers hosting shard: \n\tat org.apache.solr.handler.component.HttpShardHandler.lambda$submit$0(HttpShardHandler.java:165)\n\tat java.util.concurrent.FutureTask.run(FutureTask.java:266)\n\tat java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)\n\tat java.util.concurrent.FutureTask.run(FutureTask.java:266)\n\tat com.codahale.metrics.InstrumentedExecutorService$InstrumentedRunnable.run(InstrumentedExecutorService.java:176)\n\tat org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:209)\n\tat java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)\n\tat java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)\n\tat java.lang.Thread.run(Thread.java:748)\n",
      "time":0},
    "http://192.168.1.11:8983/solr/test_shard1_replica_n1/":{
      "numFound":0,
      "maxScore":0.0,
      "shardAddress":"http://192.168.1.11:8983/solr/test_shard1_replica_n1/",
      "time":18}},
  "response":{"numFound":0,"start":0,"maxScore":0.0,"docs":[]
  }}

As you can see everything works as we wanted. We got the results, we got the information that the results are partial (the partialResults property set to true in the response header) so our application would know that the results are not full and something went wrong. What’s more, we also got full information about which shard is to blame, because we added the shards.info=true parameter to our query.

SolrCloud: What happens when ZooKeeper fails – part two

Rafał Kuć — Mon, 29 Jun 2015 12:54:28 +0000

In the previous blog post about SolrCloud we’ve talked about the situation when ZooKeeper connection failed and how Solr handles that situation. However, we only talked about query time behavior of SolrCloud and we said that we will get back to the topic of indexing in the future. That future is finally here – let’s see what happens to indexing when ZooKeeper connection is not available.

Looking back at the old post

In the SolrCloud – What happens when ZooKeeper fails? blog post, we’ve shown that Solr can handle querying without any issues when connection to ZooKeeper has been lost (which can be caused by different reasons). Of course this is true until we change the cluster topology. Unfortunately, in case of indexing or cluster change operations, we can’t change the cluster state or index documents when ZooKeeper connection is not working or ZooKeeper failed to read/write the data we want.

Why we can run queries?

The situation is quite simple – querying is not an operation that needs to alter SolrCloud cluster state. The only thing Solr needs to do is accept the query, run it against known shards/replicas and gather the results. Of course cluster topology is not retrieved with each query, so when there is no active ZooKeeper connection (or ZooKeeper failed) we don’t have a problem with running queries.

There is also one important and not widely know feature of SolrCloud – the ability to return partial results. By adding the shards.tolerant=true parameter to our queries we inform Solr, that we can live with partial results and it should ignore shards that are not available. This means that Solr will return results even if some of the shards from our collection is not available. By default, when this parameter is not present or set to false, Solr will just return error when running a query against collection that doesn’t have all the shards available.

Why we can’t index data?

So, we can’t we index data, when ZooKeeper connection is not available or when ZooKeeper doesn’t have a quorum? Because there is potentially not enough information about the cluster state to process the indexing operation. Solr just may not have the fresh information about all the shards, replicas, etc. Because of that, indexing operation may be pointed to incorrect shard (like not to the current leader), which can lead to data corruption. And because of that indexing (or cluster change) operation is jus not possible.

It is generally worth remembering, that all operations that can lead to cluster state update or collections update won’t be possible when ZooKeeper quorum is not visible by Solr (in our test case, it will be a lack of connectivity of a single ZooKeeper server).

Of course, we could leave you with what we wrote above, but let’s check if all that is true.

Running ZooKeeper

A very simple step. For the purpose of the test we will only need a single ZooKeeper instance which is run using the following command from ZooKeeper installation directory:

bin/zkServer.sh start

We should see the following information on the console:

JMX enabled by default
Using config: /Users/gro/Solry/zookeeper/bin/../conf/zoo.cfg
Starting zookeeper ... STARTED

And that means that we have a running ZooKeeper server.

Starting two Solr instances

To run the test we’ve used the newest available Solr version – the 5.2.1 when this blog post was published. To run two Solr instances we’ve used the following command:

bin/solr start -e cloud -z localhost:2181

Solr asked us a few questions when it was starting and the answers where the following:

number of instances: 2
collection name: gettingstarted
number of shards: 2
replication count: 1
configuration name: data_driven_schema_configs

Cluster topology after Solr started was as follows:

Let’s index a few documents

To see that Solr is really running, we’ve indexed a few documents by running the following command:

bin/post -c gettingstarted docs/

If everything went well, after running the following command:

curl -XGET 'localhost:8983/solr/gettingstarted/select?indent=true&q=*:*&rows=0'

we should see Solr responding with similar XML:



 
  0
  38
  
   *:*
   true
   0

We’ve indexed our documents, we have Solr running.

Let’s stop ZooKeeper and index data

To stop ZooKeeper server we will just run the following command in the ZooKeeper installation directory:

bin/zkServer.sh stop

And now, let’s again try to index our data:

bin/post -c gettingstarted docs/

This time, instead of data being written into the collection we will get an error response similar to the following one:

POSTing file index.html (text/html) to [base]/extract
SimplePostTool: WARNING: Solr returned an error #503 (Service Unavailable) for url: http://localhost:8983/solr/gettingstarted/update/extract?resource.name=%2FUsers%2Fgro%2FSolry%2F5.2.1%2Fdocs%2Findex.html&literal.id=%2FUsers%2Fgro%2FSolry%2F5.2.1%2Fdocs%2Findex.html
SimplePostTool: WARNING: Response: 

5033Cannot talk to ZooKeeper - Updates are disabled.503

As we can see, the lack of ZooKeeper connectivity resulted in Solr not being able to index data. Of course querying still works. Turning on ZooKeeper again and retrying indexing will be successful, because Solr will automatically reconnect to ZooKeeper and will start working again.

Short summary

Of course this and the previous blog post related to ZooKeeper and SolrCloud are only touching the surface of what is happening when ZooKeeper connection is not available. A very good test that shows us data consistency related information can be found at http://lucidworks.com/blog/call-maybe-solrcloud-jepsen-flaky-networks/. I really recommend it if you would like to know what will happen with SolrCloud in various emergency situations.

Solr: data indexing for fun and profit

Marek Rogoziński — Mon, 06 Sep 2010 12:10:35 +0000

Solr is not very friendly to novice users. Preparing good schema file requires some experience. Assuming that we have prepared the configuration files, what remains for us is to share our data with the search server and take care of update ability.

There are a few ways to import data:

Update Handler
Cvs Request Handler
Data Import Handler
Extracting Request Handler (Solr Cell)
Client libraries (for example Solrj)
Apache Connector Framework (formerly Lucene Connector Framework)
Apache Nutch

In addition to the mentioned above you casn stream your data to search server. As you can see, there is some confusion here and its to provide the best method to use in a particular case at first glance.

Update Handler

Perhaps the most popular method because of simplicity. It requires the preparation of the corresponding XML file and then You must send it via HTTP to a Solr server. It enables document and individual fields boosting.

CSV Request Handler

When we have data in CSV format (Coma Separated Values) or in TSV format (Tab Separated Values) this option may be most convenient. Unfortunately, in contrast to the Update Handler is not possible to boost documents or fields.

Data Import Handler

This method is less common, requires additional and sometimes quite complicated configuration, but allows direct linking to the data source. Using DIH we do not need any additional scripts for data exporting from a source to the format required by Solr. What we get out of the box is: integration with databases (based on JDBC), integration with sources available in XML (for example RSS), e-mail integration (via IMAP protocol) and integration with documents which can be parsed by Apache Tika (like OpenOffice documents, Microsoft Word, RTF, HTML, and many, many more). In addition it is possible to develop your own sources and transformations.

Extracting Request Handler (Solre Cell)

Specialized handler for indexing the content of documents stored in files of different formats. List of supported formats is quite extensive and the indexing is performed by Apache Tika. The drawback of this method is the need of building additional solutions that provide Solr the information about the document and its identifier and that there is no support for providing additional meta data, external to the document.

Client Libraries

Solr provides client libraries for many programming languages. Their capabilities differ, but if the data are generated onboard by the application and the time after in which the data must be available for searching is very low, this way of indexing is often the only available option.

Apache Connector Framework

ACF is a relatively new project, which revealed a wider audience in early 2010. The project was initially an internal project run by the company MetaCarta, and was donated to the open source community and is currently being developed within Apache incubator. The idea is to build a system that allows making connection to the data source with a help of a series of plug-ins. At the moment there is no published version, but the system itself is already worth of interest in the case of the need to integrate with such systems as: FileNet P8 (IBM), Documentum (EMC), LiveLink (OpenText), Patriarch (Memex), Meridio (Autonomy) Windows shares (Microsoft) and SharePoint (Microsoft).

Apache Nutch

Nutch is in fact, a separate project run by the Apache (previously under the Apache Lucene, now a top level project). For the person using Solr Nutch is interesting as it allows to crawl through Web pages and index them by Solr.

Word about streaming

Streaming means the ability to notice Solr, where to download the data to be indexed. This avoids unnecessary data transmission over the network, if the data is on the same server as indexer, or double data transmission (from the source to the importer and from the importer to Solr).

And a word about security

Solr, bye design, is intended to be used in a architecture assuming safe environment. It is very important to note, who and how is able to query solr. While the returned data can be easily reduced, by forcing the use of filters in the definition of the handler, then in the case of indexing is not so easy. In particular, the most dangerous seems to be Solr Cell – it will not only allow to read any file to which Solr have access(eg. files with passwords), but will also will provide a convenient method of searching in those files

Other options

I tried to mention all the methods that does not require any additional work to make indexing work. The problem may be the definition of this additional work, because sometimes it is easier to write additional plug-in than break through numerous configuration options and create a giant XML file. Therefore, the choice of methods was guided by my own sense, which resulted in skipping of some methods (like fetching data from WWW pages with the use of Apache Droids or Heritrix, or solutionsa based on Open Pipeline or Open Pipe).

Certainly in this short article I managed to miss some interesting methods. If so, please comment, I`ll be glad update this entry

5 sins of schema.xml modifications

Rafał Kuć — Mon, 30 Aug 2010 12:08:35 +0000

I made a promise and here it is – the entry on the most common mistakes when designing Solr index, which is when You create or modify the schema.xml file for Your system implementation. Feel free to read on

Each of us knows what is schema.xml file and what is (if not, I invite you to read the entry located at: http://solr.pl/2010/08/16/what-is-schema-xml/?lang=en). What are the most frequently commit errors creating or updating this file? I personally met with the following:

1. Trash in the configuration

I confess that the first principle is to keep the file schema.xml in the simplest possible form. Linked to this is a very important issue – this file should not be synonymous with chaos. In other word, do not stick with unnecessary comments, unwanted types, fields and so on. Order in the structure of the schema.xml file not only helps us to maintain this file and its modifications with ease, but also assures us that no information that is unnecessary will be stored in Solr index.

2. Cosmetic changes to the default configuration

How many of those who use Solr in their daily work took the default schema.xml file supplied in the example implementation Solr and only slightly modified the contents – for example, changing only the names of the fields ? I should raise my hand too, because I did it once. This is a pretty big mistake. Someone may ask why. Are you sure You need English stemming when implementing search for content written in Polish ? I think not. The same applies to field and type attributes like term vectors.

3. No updates

Sometimes I find the implementation of search based application, where update of Solr does not mean an update of schema.xml file. If it is a conscious decision, dictated by such costly or even impossible re-indexing of all data, I understand the situation. But there are cases where an upgrade would bring only benefits, and where costs of such upgrade would be minimal (eg less expensive re-index or slight changes in the application). Do not be afraid to update the schema.xml file – whether it is to update the fields, update types, whether the addition of newer stuff. A good example is the migration from Solr 1.3 to version 1.4 – newer version introduced significant changes associated with numeric types, where migration to the new types would result in great increase in query performance using those types (such as queries using value ranges).

4. “I`ll use it one day”

Adding new types, not removing unnecessary now, the same in the case of fields, or copyField definition. Most of us think – that old definition can be useful in the future, but remember that each type is some extra portion of memory needed by Solr, each field is a place in the index. My small advice – if you stop to use the type, field, or whatever else you have in your configuration file (not only in the schema.xml), simply remove it from this file. Applying this principle throughout the life cycle of the applications using Solr will ensure You that the index is in optimal condition, and after a few months since another feature implementation You will not need to be puzzled and as a result You will not need to dig into the application code to determine if the field is used in some forgotten code fragment.

5. Attributes, attributes and again attributes

Preservation of original values, adding term vectors and its properties are just examples of things we don`t need in every implementation. Sometimes we have more than required by the application index. A larger index, lower productivity, at least in some cases (eg, indexing). It is worth considering if you really need all this information, which we say to Solr to calculate and store. Removing some unnecessary, of course, from our point of view of information, may surprise us. Sometimes it is worth a try;)

Feel free to comment, because I will read eagerly, for what else we should pay attention to when modifying schema.xml file.

Finally, I think that it is worth to mention the article “The Seven Deadly Sins of Solr” LucidImagination published on the website at: http://www.lucidimagination.com/blog/2010/01/21/the-seven-deadly-sins-of-solr. It describes bad practices when working with Solr. In my opinion, interesting reading. I highly recommend it.