Solr 4.1: Stored fields compression

Despite the fact that Lucene and Solr 4.0 is still very fresh we decided that its time to take a look at the changes in the approaching 4.1 version. One of those changes will be stored fields compression to decrease the size of the index size, when we use such fields. Let’s take a look and see how that works.

Some theory

In case our index consists of many stored fields they can be consuming most space when compared to other information in the index. How to know how much space the stored fields take ? Its easy – just go to the directory that holds your index and check how much space the files with .fdt extension takes. Despite the fact, that stored fields don’t influence the search performance directly, your I/O subsystem and its cache can be forced to work much harder because of the larger amount of data on the disk. Because of that your queries can be executed longer and you may need more time to index your data.

With the incoming release of Lucene 4.1 stored fields will be compressed with the use of LZ4 algorithm (http://code.google.com/p/lz4/), which should decrease the size of the index when we use high number of stored fields, but also shouldn’t be CPU demanding when it comes to compression and decompression.

Test data

For the discussed functionality tests we’ve used Polish Wikipedia articles data, from 2012.11.10 (http://dumps.wikimedia.org/plwiki/20121110/plwiki-20121110-pages-articles.xml.bz2). Unpacked XML file was about 4.7GB on disk.

Index structure

We’ve used the following index structure to index the above data:

<field name="id" type="string"  indexed="true" stored="true" required="true"/>
<field name="title" type="text" indexed="true" stored="true"/>
<field name="revision" type="int" indexed="true" stored="true"/>
<field name="user" type="string" indexed="true" stored="true"/>
<field name="userId" type="int" indexed="true" stored="true"/>
<field name="text" type="text" indexed="true" stored="true"/>
<field name="timestamp" type="date" indexed="true" stored="true"/>
<field name="_version_" type="long" indexed="true" stored="true"/>

DIH configuration

We’ve used the following DIH configuration in order to index Wikipedia data:

<dataConfig>
 <dataSource type="FileDataSource" encoding="UTF-8" />
 <document>
  <entity name="page" processor="XPathEntityProcessor" stream="true" forEach="/mediawiki/page/" url="/home/data/wikipedia/plwiki-20121110-pages-articles.xml" transformer="RegexTransformer,DateFormatTransformer">
   <field column="id" xpath="/mediawiki/page/id" />
   <field column="title" xpath="/mediawiki/page/title" />
   <field column="revision" xpath="/mediawiki/page/revision/id" />
   <field column="user" xpath="/mediawiki/page/revision/contributor/username" />
   <field column="userId" xpath="/mediawiki/page/revision/contributor/id" />
   <field column="text" xpath="/mediawiki/page/revision/text" />
   <field column="timestamp" xpath="/mediawiki/page/revision/timestamp" dateTimeFormat="yyyy-MM-dd'T'hh:mm:ss'Z'" />
   <field column="$skipDoc" regex="^#REDIRECT .*" replaceWith="true" sourceColName="text"/>
  </entity>
 </document>
</dataConfig>

Indexing time

In both cases indexing time was very similar, for the same amount of documents (there was 1.301.394 documents after indexing). In case of Solr 4.0 indexing took 14 minutes and 33 seconds. In case of Solr 4.1 indexing took 14 minutes and 43 seconds. As you can see Solr 4.1 was slightly slower, but because I made the tests on my laptop, we can assume that the indexing performance is very similar.

Index size

The size of the index is what interest us the most in this case. In case of Solr 4.0 the index created with the Wikipedia data was about 5.1GB5.464.809.863 bytes. In case of Solr 4.1 the index weighted approximately 3.24GB3.480.457.399 bytes. So when comparing index created by Solr 4.0 to the one created by Solr 4.1 we got about 35% smaller index.

Wrapping up

You can clearly see, that the gain from compressing stored fields is quite big. Despite the fact that we need additional CPU cycles for compression handling we benefit from less I/O subsystem pressure and we can be sure that the gain will be greater than the loss of a few CPU cycles. After seeing this I’m not wondering with the stored fields compressing is turned on by default in Lucene 4.1 and thus in Solr 4.1 too. However if you would like to turn off that behavior you’ll need to implement your own codec – one that doesn’t use compression, at least for now. However you don’t need to fork Lucene code to do that and this again shows how powerful the flexible indexing is.

This post is also available in: Polish

This entry was posted on Monday, November 19th, 2012 at 09:32 and is filed under About Lucene, About Solr. You can follow any responses to this entry through the RSS 2.0 feed. You can leave a response, or trackback from your own site.

13 Responses to “Solr 4.1: Stored fields compression”

  1. Jeffery Yuan Says:

    Thanks for introducing this new feature in 4.1.

    I download 4.1, but seems the index size is similar as in 3.6. – 11.6 GB.

    So I am guessing in order to use fields compression, I need to configure solr to make it use this feature, is it right? If so, can u provide us an example configuration?

    Thanks very much :)

  2. gr0 Says:

    I suppose that is because you are still using Lucene 4.0 index format. Look at your solrconfig.xml file and find the following:

    <luceneMatchVersion>LUCENE_40</luceneMatchVersion>

    And change it to:

    <luceneMatchVersion>LUCENE_41</luceneMatchVersion>

  3. Jeffery Yuan Says:

    Thanks for your reply, I checked my solrconfig.xml file, the value of luceneMatchVersion is LUCENE_41.

    i also checked https://issues.apache.org/jira/browse/SOLR-3927, seems CompressingStoredFieldsFormat is already set as default. So should no change in solrconfig.xml.

    The index size of 69696 emails is decreased from 5.83 GB in 3.6 to 5.26 in Solr 4.1. – Only 9.7% smaller.

    Also I found some string files in solr 4.1:

    Some files named as _4e_Lucene41_0.doc, 24 files, totoally 239 mb.

    Do you have any idea what these files for?

    Thanks a lot :)

  4. gr0 Says:

    Yeap its turned on by default, but only from Lucene 4.1, that’s why I asked about match version in solrconfig.xml.

    As for the 10% smaller index – that depends on how much stored fields you have and how much data is in there. So if you don’t have much data in stored fields there will be minimal difference in size of the indices with 3.6 and 4.1.

    As for the files – those are files written with Lucene41 codec.

  5. Jeffery Yuan Says:

    Thanks and sorry to ask you another question :)

    https://issues.apache.org/jira/browse/LUCENE-4226
    It mentions that we can set compression mode:
    FAST, HIGH_COMPRESSION, FAST_UNCOMPRESSION.

    Also we can see this three modes in CompressingStoredFieldsFormat, CompressionMode.

    How we can set compression mode in Solr 4.1?
    As I want to set the mode to HIGH_COMPRESSION.

  6. Adrien Grand Says:

    @gr0: Nice post! I’m very happy that people find stored fields compression useful!

    @Jeffery: To do this, you need to create a custom codec. Here is how you could do it with the current state of Lucene trunk (you might need to modify the code a bit for the release):

    public class MyCustomCodec extends FilterCodec {

    private static final String CODEC_NAME = “MyCustomCodec”;
    private static final Codec DELEGATE = new Lucene41Codec();
    private static final String STORED_FIELDS_FORMAT_NAME = “MyCustomStoredFields”;
    private static final CompressionMode COMPRESSION_MODE = CompressionMode.HIGH_COMPRESSION;
    private static final int CHUNK_SIZE = 1 << 16;

    private final StoredFieldsFormat storedFieldsFormat;

    public MyCustomCodec() {
    super(CODEC_NAME, DELEGATE);
    this.storedFieldsFormat = new CompressingStoredFieldsFormat(
    STORED_FIELDS_FORMAT_NAME,
    COMPRESSION_MODE,
    CHUNK_SIZE);
    }

    @Override
    public StoredFieldsFormat storedFieldsFormat() {
    return storedFieldsFormat;
    }

    }

    But please note that CompressingStoredFieldsFormat is experimental and might change in incompatible ways in the next releases.

  7. gr0 Says:

    @Adrien – thanks :)

  8. Rene Says:

    If we want to use the 4.1 compression, do we have to re-index the existing 4.0 indexes?

  9. gr0 Says:

    No you don’t. The transition to new stored fields format will come with time as your segments will be merged. However if your index is not changed often, than the best way would be to reindex your data.

  10. Jilles Van Gurp Says:

    We experimented with compressing things manually in solr 3.x some time ago. Basically we were using solr as a key value store for storing big blobs of xml. Basically there were only a few fields, id, timestamp, and blob.

    Using gzip compression on the blob, our index size was about 200GB where the raw xml dump we indexed from was a around 50GB (gzip compressed). Each blob was around 5KB on average (uncompressed). So, there is a bit of a problem here, as you can see where simply compressing each blob isn’t nearly as efficient as one would hope.

    The reason for this is the dictionary used for compression in gzip is stored along with the blob. Additionally, the dictionary is specific only to the blob and it gets stored along with the blob.

    Java allows you to use a custom dictionary and we so we generated our own based on frequency counts for things like tag names, namespace uris, and value strings. This cut our index size nearly in half to about 100GB relative to simply gzipping each blob. That’s still substantial but it includes indexes and it got us in a pretty good state.

    Maybe in a future version, a more plugable solution could be provided so that people can customize how their data is compressed?

  11. gr0 Says:

    Actually it is pluggable solution – stored fields compression its a part of the new, default Lucene 4.1 codec. So if you want to change it, you can develop your own codec and use it for indexing. Of course it would be perfect if we could just plug in a new compression method without developing the whole codec, but maybe that will come in the future.

  12. Rahul Says:

    Here i am using DSE-3.0. I have a column name=”DateTime” type=”timestamp” in cassandra 1.1.9.1 column family.I have to index it in solr and want to range query(fq) from DateTime. like this–>DateTime:[2011-04-12 TO 2011-05-23].Have you any idea,how it would be indexed in solr.Please help me.

  13. gr0 Says:

    Use the date type to index it, like the one used in example schema provided with Solr:

    However, please remember that you’ll need to provide the date like this:

    DateTime:[2011-04-12T00:00:00.000Z TO 2011-05-24T00:00:00Z]

    You can find more information here: http://wiki.apache.org/solr/SolrQuerySyntax