Solr 4.1: Stored fields compression

Despite the fact that Lucene and Solr 4.0 is still very fresh we decided that its time to take a look at the changes in the approaching 4.1 version. One of those changes will be stored fields compression to decrease the size of the index size, when we use such fields. Let’s take a look and see how that works.

Some theory

In case our index consists of many stored fields they can be consuming most space when compared to other information in the index. How to know how much space the stored fields take ? Its easy – just go to the directory that holds your index and check how much space the files with .fdt extension takes. Despite the fact, that stored fields don’t influence the search performance directly, your I/O subsystem and its cache can be forced to work much harder because of the larger amount of data on the disk. Because of that your queries can be executed longer and you may need more time to index your data.

With the incoming release of Lucene 4.1 stored fields will be compressed with the use of LZ4 algorithm (http://code.google.com/p/lz4/), which should decrease the size of the index when we use high number of stored fields, but also shouldn’t be CPU demanding when it comes to compression and decompression.

Test data

For the discussed functionality tests we’ve used Polish Wikipedia articles data, from 2012.11.10 (http://dumps.wikimedia.org/plwiki/20121110/plwiki-20121110-pages-articles.xml.bz2). Unpacked XML file was about 4.7GB on disk.

Index structure

We’ve used the following index structure to index the above data:

<field name="id" type="string"  indexed="true" stored="true" required="true"/>
<field name="title" type="text" indexed="true" stored="true"/>
<field name="revision" type="int" indexed="true" stored="true"/>
<field name="user" type="string" indexed="true" stored="true"/>
<field name="userId" type="int" indexed="true" stored="true"/>
<field name="text" type="text" indexed="true" stored="true"/>
<field name="timestamp" type="date" indexed="true" stored="true"/>
<field name="_version_" type="long" indexed="true" stored="true"/>

DIH configuration

We’ve used the following DIH configuration in order to index Wikipedia data:

<dataConfig>
 <dataSource type="FileDataSource" encoding="UTF-8" />
 <document>
  <entity name="page" processor="XPathEntityProcessor" stream="true" forEach="/mediawiki/page/" url="/home/data/wikipedia/plwiki-20121110-pages-articles.xml" transformer="RegexTransformer,DateFormatTransformer">
   <field column="id" xpath="/mediawiki/page/id" />
   <field column="title" xpath="/mediawiki/page/title" />
   <field column="revision" xpath="/mediawiki/page/revision/id" />
   <field column="user" xpath="/mediawiki/page/revision/contributor/username" />
   <field column="userId" xpath="/mediawiki/page/revision/contributor/id" />
   <field column="text" xpath="/mediawiki/page/revision/text" />
   <field column="timestamp" xpath="/mediawiki/page/revision/timestamp" dateTimeFormat="yyyy-MM-dd'T'hh:mm:ss'Z'" />
   <field column="$skipDoc" regex="^#REDIRECT .*" replaceWith="true" sourceColName="text"/>
  </entity>
 </document>
</dataConfig>

Indexing time

In both cases indexing time was very similar, for the same amount of documents (there was 1.301.394 documents after indexing). In case of Solr 4.0 indexing took 14 minutes and 33 seconds. In case of Solr 4.1 indexing took 14 minutes and 43 seconds. As you can see Solr 4.1 was slightly slower, but because I made the tests on my laptop, we can assume that the indexing performance is very similar.

Index size

The size of the index is what interest us the most in this case. In case of Solr 4.0 the index created with the Wikipedia data was about 5.1GB – 5.464.809.863 bytes. In case of Solr 4.1 the index weighted approximately 3.24GB – 3.480.457.399 bytes. So when comparing index created by Solr 4.0 to the one created by Solr 4.1 we got about 35% smaller index.

Wrapping up

You can clearly see, that the gain from compressing stored fields is quite big. Despite the fact that we need additional CPU cycles for compression handling we benefit from less I/O subsystem pressure and we can be sure that the gain will be greater than the loss of a few CPU cycles. After seeing this I’m not wondering with the stored fields compressing is turned on by default in Lucene 4.1 and thus in Solr 4.1 too. However if you would like to turn off that behavior you’ll need to implement your own codec – one that doesn’t use compression, at least for now. However you don’t need to fork Lucene code to do that and this again shows how powerful the flexible indexing is.

Solr.pl