document – Solr.pl

Solr 6.5 and large stored fields – quick look

Rafał Kuć — Mon, 01 May 2017 13:14:28 +0000

As we know Solr has a few caches, for example – filterCache for filters, queryResultCache for query results caching and of course the documentCache for caching documents for fast retrieval. Today we will focus on the last of the mentioned caches and what can be done to better utilize the cache if you use it.

The problem

When documentCache is present in solrconfig.xml after the first time a field is retrieved from Lucene Solr will cache its value along with the document and store it in the documentCache. This can be very expensive, especially for large stored fields – image a situation when you have the documents OCRed from a book and you show the content of the pages. If you don’t reuse such data, so basically a lot hit ration in the documentCache, will result in more garbage produced by Solr itself and thus JVM garbage collector having harder time to clean that up. That can lead to higher CPU usage and worse performance of Solr in general. Let’s look at what we can do with such large, stored fields.

Marking the field as large

Starting with Solr 6.5 we got the ability to add additional property to the field definition, one called large which takes a value of true or false by default being false. Field that we want to mark as large should be set as stored=”true” and multiValued=”false”. In such cases, setting the large=”true” property on the field definition will make the field value not cached inside the documentCache.

Noticing the difference

Because this is a quick look type of post, I don’t want to get into too much specifics, but I would like to compare two collections with the same data. Each collection have the same set of fields:

id – identifier of the document,
name – name of the document,
body – text of the document, which can be very, very large.

One collection will have the large=”true” for the body field and the other won’t have that property set. We will also index a few large documents and see how documentCache behaves.

So here are the commands to setup those two collections using Solr.pl Github account (https://github.com/solrpl/). First setup one collection and gather statistics and then remove all the files, restart Solr, create the second collection and gather statistics. The commands are as follows:

$ mkdir /tmp/solr
$ mkdir /tmp/solr/collection_with_large
$ mkdir /tmp/solr/collection_without_large
$ wget https://github.com/solrpl/blog/tree/master/posts/large_field/data.xml /tmp/solr/data.xml
$ wget https://github.com/solrpl/blog/tree/master/posts/large_field/collection_with_large/managed-schema /tmp/solr/collection_with_large/managed-schema
$ wget https://github.com/solrpl/blog/tree/master/posts/large_field/collection_with_large/solrconfig.xml /tmp/solr/collection_with_large/solrconfig.xml
$ wget https://github.com/solrpl/blog/tree/master/posts/large_field/collection_without_large/managed-schema /tmp/solr/collection_without_large/managed-schema
$ wget https://github.com/solrpl/blog/tree/master/posts/large_field/collection_without_large/solrconfig.xml /tmp/solr/collection_without_large/solrconfig.xml
$ bin/solr zk upconfig -z localhost:9983 -n config_with_large -d /tmp/collection_with_large
$ bin/solr create_collection -c collection_with_large -n config_with_large -shards 1 -replicationFactor 1
$ curl -XPOST 'localhost:8983/solr/collection_with_large/update?commit=true' -H 'Content-Type:application/xml' --data-binary @/tmp/solr/data.xml
$ curl 'localhost:8983/solr/collection_with_large/select?q=*:*'

And now let’s create the second collection using the downloaded data:

$ bin/solr zk upconfig -z localhost:9983 -n config_without_large -d /tmp/collection_without_large
$ bin/solr create_collection -c collection_without_large -n config_without_large -shards 1 -replicationFactor 1
$ curl -XPOST 'localhost:8983/solr/collection_without_large/update?commit=true' -H 'Content-Type:application/xml' --data-binary @/tmp/solr/data.xml
$ curl 'localhost:8983/solr/collection_without_large/select?q=*:*'

And now, let’s check the usage of the documentCache that we’ve gathered. So we have this for the collection with the body field marked as large=”true”:

And we have this for the collection with the body field without the large=”true” property:

As you can see, the field marked with large=”true” was not put into the documentCache directly, but only as a lazy loaded large field, which is what we were aiming for. This means, that we can still use the documentCache and not worry about Solr putting the large, stored fields there, which was the case in the second example.

Document language identification

Rafał Kuć — Mon, 23 Jan 2012 20:59:03 +0000

One of the functionality of the latest Solr version (3.5) is the ability to identify the language of the document during its indexation. In todays entry we will see how Apache Solr work together with Apache Tika to identify the language of the documents.

At the beginning

You should remember that the described functionality was introduced in Solr 3.5.

Assumptions

We will be using two fields to identify the document language: title and body. We want to store the information of the detected language in the lang field.

Index structure

The structure of our index is of course simplified and contain only fields needed for the test. So the field definition part of the schema.xml file looks like this:

All the fields as marked as stored=”true” for simplicity.

Update request processor configuration

In order to be able to use the language identification feature we need to configure Solr update request processor. We will be using the one that is using Apache Tika (there is a second implementation based on http://code.google.com/p/language-detection/). In order to configure the process we add the following to the solrconfig.xml file:


  
    
      title,body
      lang

Other parameters of the TikaLanguageIdentifierUpdateProcessorFactory are described on Apache Solr wiki pages available at the following URL address: http://wiki.apache.org/solr/LanguageDetection.

Additional libraries

In order for the update request processor to be working we need some additional libraries. From the dist directory from Apache Solr distribution we copy the apache-solr-langid-3.5.0.jar to tikaDir (for example), which we make on the same level as the webapps directory. Then we add the following line to the solrconfig.xml file:

The next library we will need is the Tika jar with all the goodiess (tika-app-1.0.jar) which we can download at the following URL address: http://tika.apache.org/. We place it in the same tikaDir directory and then we add the following entry to the solrconfig.xml file:

Test documents

For the testing purposes I decided to prepare three documents. The first was in English, the second one in Polish and the third one in German. Their content was downloaded from Wikipedia. They look as follows:

tika_en.xml



  1
  Water
  Water is a chemical substance with the chemical formula H2O. A water molecule contains one oxygen and two hydrogen atoms connected by covalent bonds. Water is a liquid at ambient conditions, but it often co-exists on Earth with its solid state, ice, and gaseous state (water vapor or steam). Water also exists in a liquid crystal state near hydrophilic surfaces.[1][2] Under nomenclature used to name chemical compounds, Dihydrogen monoxide is the scientific name for water, though it is almost never used.

tika_pl.xml



  2
  Woda
  Woda (tlenek wodoru; nazwa systematyczna IUPAC: oksydan) – związek chemiczny o wzorze H2O, występujący w warunkach standardowych w stanie ciekłym. W stanie gazowym wodę określa się mianem pary wodnej, a w stałym stanie skupienia – lodem. Słowo woda jako nazwa związku chemicznego może się odnosić do każdego stanu skupienia.

tika_de.xml



  3
  Wasser
  Wasser (H2O) ist eine chemische Verbindung aus den Elementen Sauerstoff (O) und Wasserstoff (H). Wasser ist die einzige chemische Verbindung auf der Erde, die in der Natur in allen drei Aggregatzuständen vorkommt. Die Bezeichnung Wasser wird dabei besonders für den flüssigen Aggregatzustand verwendet. Im festen (gefrorenen) Zustand spricht man von Eis, im gasförmigen Zustand von Wasserdampf.

More testing

To index the data I used the following shell commands:

curl 'http://localhost:8983/solr/update?update.chain=langid' --data-binary @tika_pl.xml -H 'Content-type:application/xml'
curl 'http://localhost:8983/solr/update?update.chain=langid' --data-binary @tika_en.xml -H 'Content-type:application/xml'
curl 'http://localhost:8983/solr/update?update.chain=langid' --data-binary @tika_de.xml -H 'Content-type:application/xml'
curl 'http://localhost:8983/solr/update?update.chain=langid' --data-binary '' -H 'Content-type:application/xml'

It is worth to notice the additional update.chain=langid parameter added to the request. This parameter is used to tell Solr which update processor to use when indexing the data. In the example we told Solr that it should use our defined update processor.

Indexed data

So let’s have a look at the indexed data. We will do that by running the following query: q=*:*&indent=true.




  0
  0
  
    true
    *:*
  


  
    Woda (tlenek wodoru; nazwa systematyczna IUPAC: oksydan) – związek chemiczny o wzorze H2O, występujący w warunkach standardowych w stanie ciekłym. W stanie gazowym wodę określa się mianem pary wodnej, a w stałym stanie skupienia – lodem. Słowo woda jako nazwa związku chemicznego może się odnosić do każdego stanu skupienia.
    2
    pl
    Woda
  
  
    Water is a chemical substance with the chemical formula H2O. A water molecule contains one oxygen and two hydrogen atoms connected by covalent bonds. Water is a liquid at ambient conditions, but it often co-exists on Earth with its solid state, ice, and gaseous state (water vapor or steam). Water also exists in a liquid crystal state near hydrophilic surfaces.[1][2] Under nomenclature used to name chemical compounds, Dihydrogen monoxide is the scientific name for water, though it is almost never used.
    1
    en
    Water
  
  
    Wasser (H2O) ist eine chemische Verbindung aus den Elementen Sauerstoff (O) und Wasserstoff (H). Wasser ist die einzige chemische Verbindung auf der Erde, die in der Natur in allen drei Aggregatzuständen vorkommt. Die Bezeichnung Wasser wird dabei besonders für den flüssigen Aggregatzustand verwendet. Im festen (gefrorenen) Zustand spricht man von Eis, im gasförmigen Zustand von Wasserdampf.
    3
    de
    Wasser

As you can see, Solr with the use of Tika, was able to identify the languages of the indexed documents. Of course, let’s not be too optimistic, because mistakes happen, especially when dealing with multi-language documents, but that’s understandable.

To sum up

You should remember, that the language identification feature is not perfect and can make mistakes. Also remember, that the longer the documents, the better the functionality will work. Of course the problem is that we can’t use the language identification during query time, but it’s not only problem with Solr and Tika. You can deal with that by identifying your user, it’s web browser or place he is located in.

Solr 4.0: DocTransformers first look

Rafał Kuć — Mon, 05 Dec 2011 20:55:51 +0000

In todays entry we will look at the next feature that will come with version 4.0 of Apache Solr. We will look at the functionality which enables us to modify the fields in Solr result list.

Do I need it ?

Till now, we didn’t have much choice when it comes to the results returned by Solr. When Solr 4.0 will be published we will be given a new tool, so called DocTransformers. This feature enables us to modify the fields of the documents returned in the search results by Solr. Looking at what is available now we can for example change the names of the fields returned or mark the documents that were added by the QueryElevationComponent. Right now there are only a few implementation, but implementing your own DocTranformer is not hard.

What is already available ?

At the exact moment we are writing this, the following transformers are available:

One that enables you to mark the documents that were added by the QueryElevationComponent.
One that enables you to add the explain information to the document.
One that enables you to add static value as a field of the document.
One that enables you to add the shard if from which the document was fetched.
One that enables you to add the docid as the document field (identifier used by Lucene).

How to use DocTransformers ?

Lets look at how to use DocTransformers. To do that I’ve downloaded trunk version of Apache Solr (4.0) from the svn repository and I’ve run the example deployment. Next, I’ve indexed the example data and I’ve run the following query:

http://localhost:8983/solr/select?q=encoded&fl=name,score,[docid],[explain]

If you look at the fl parameter you will notice that we told Solr that we want the name field in the results, the score of the document and two DocTransformers: [docid] and [explain]. In result I’ve got the following XML:



 
  0
  2
  
    encoded
    name,score,[docid],[explain]
  
 
 
 
  Test with some GB18030 encoded characters
  0.50524884
  0
  
  0.50524884 = (MATCH) weight(text:encoded in 0) [DefaultSimilarity], result of:
    0.50524884 = score(doc=0,freq=1.0 = termFreq=1), product of:
      1.0000001 = queryWeight, product of:
        3.2335923 = idf(docFreq=2, maxDocs=28)
        0.3092536 = queryNorm
      0.5052488 = fieldWeight in 0, product of:
        1.0 = tf(freq=1.0), with freq of:
          1.0 = termFreq=1
        3.2335923 = idf(docFreq=2, maxDocs=28)
        0.15625 = fieldNorm(doc=0)
  
 
 
  Test with some UTF-8 encoded characters
  0.4041991
  25
  
  0.4041991 = (MATCH) weight(text:encoded in 25) [DefaultSimilarity], result of:
    0.4041991 = score(doc=25,freq=1.0 = termFreq=1), product of:
      1.0000001 = queryWeight, product of:
        3.2335923 = idf(docFreq=2, maxDocs=28)
        0.3092536 = queryNorm
      0.40419903 = fieldWeight in 25, product of:
        1.0 = tf(freq=1.0), with freq of:
          1.0 = termFreq=1
        3.2335923 = idf(docFreq=2, maxDocs=28)
        0.125 = fieldNorm(doc=25)

As you can see, Solr did what we asked for.

Your own implementation

Let’s discuss, who to implement you own DocTransformer. Below, you have an example class named RenameFieldsTransformer from the org.apache.solr.response.transform package in Apache Solr source code. In general, all you have to do is override the following two methods from the DocTransformer class from org.apache.solr.response.transform package:

String getName() – method returning transformers name,
void transform(SolrDocument doc, int docid) – method which makes the actual transformation.

Implementation looks like this:

public class RenameFieldsTransformer extends DocTransformer {
 final NamedList rename;

 public RenameFieldsTransformer( NamedList rename ) {
  this.rename = rename;
 }

 @Override
 public String getName() {
  StringBuilder str = new StringBuilder();
  str.append( "Rename[" );
  for( int i=0; i< rename.size(); i++ ) {
   if( i > 0 ) {
    str.append( "," );
   }
   str.append( rename.getName(i) ).append( ">>" ).append( rename.getVal( i ) );
  }
  str.append( "]" );
  return str.toString();
 }

 @Override
 public void transform(SolrDocument doc, int docid) {
  for( int i=0; i
The code shown above enables us to rename the fields returned in the results. As you can see the transform method iterates through all the values in rename class variable. The rename variable consist of name value pairs which are field name and the name it should have after the transformation. You must also remember that in order to use your own transformer you need to add it’s configuration to the solrconfig.xml file. Here is the example which can be found on Solr wiki page:


To sum up
You should remember that the describes functionality is marked as experimental and can change its behavior when Lucene and Solr 4.0 will be released. We will get back to this topic as soon as Solr 4.0 will be released.

Optimization – document cache

Rafał Kuć — Mon, 29 Aug 2011 19:48:04 +0000

A few months ago (here) we looked at filterCache. I’ve decided to update the optimization topic and take a look at the documentCache.

What it contains ?

So let’s start with information about the information that documentCache holds. So documentCache contain Lucene documents that were fetched from the index. So little and so much.

What it is used for ?

Every object (Lucene document) stored in documentCache contains a list of references to the fields, that are stored with the document. Thanks to this, when a document is fetched and put into the cache it doesn’t have to be fetched again while processing another query. And this is why the number of I/O operations is reduces when rendering the query results list.

What to remember when using documentCache ?

When using documentCache you have to remember about to important things:

documentCache can’t be autowarmed because it operates on identifiers that change after every commit operation.
If you use lazy field loading (enableLazyFieldLoading=true) documentCache functionality is somehow limited. This means that the document stored in the documentCache will contain only those fields that were passed to the fl parameter. If the next query will try to get additional fields for the document stored in the cache, those additional fields will be fetched from the index.

Definition

The standard documentCache definition looks like this:

Let’s recall those parameters:

class – class implementing the cache,
size – the maximum cache size,
initialSize – initial size of the cache.

How to configure ?

The usual question about cache – what size should I set ? According to the information from Solr wiki (http://wiki.apache.org/solr/SolrCaching#documentCache), the maximum size shouldn’t be less than the product of concurrent queries and the maximum number of documents fetched by the query. A simple relation that should ensure that Solr won’t have to fetch documents from the index during query processing.

Last few words

In the case of documentCache we don’t have to worry about how we construct our queries to properly use this cache. But please remember that documentCache requires memory, the more memory, the more field you stored in the index.