Distributed IDF

When Lucene and Solr searches through the data, each document is assigned a score that is calculated on the basis of query terms statistics. When using SolrCloud and our data inside the collection is distributed among multiple shards we are hit by a problem of not exact inverse document frequency calculation. The problem can be defined in the following way – each shard stores the term statistics locally and doesn’t share that with other shards during query execution. Can we do something about it to have more precise IDF calculation? Let’s see what we can do about it.

Read more

Solr 8 – ByteBuffersDirectory – quick look

One of the new features introduced in the recently released Solr 8.0 is new implementation of the Directory interface – one that will replace not scalable RAMDirectory. The new implementation called ByteBuffersDirectory is dedicated to small, short lived data that is held only in memory. Let’s have a quick look into potential use cases, advantages and drawbacks of this new implementation.

Read more

SolrCloud – write and read tolerance

SolrCloud similar to most of the distributed systems is designed with some rules in mind. There are also rules that each distributed system is subject to. For example the CAP theorem tells that a system can’t achieve availability, data consistency and network partition split tolerance at the same time – you can have two out of three at most. Of course, in this blog entry, we will not be discussing principles of the distributed systems, but we will focus on write and read tolerance in SolrCloud.

Read more