CheckIndex for the rescue

While using Lucene and Solr we are used to a very high reliability of this products. However, there may come the day when Solr will inform us that our index is corrupted, and we need to do something about it. Is the only way to repair the index is to restore it from the backup or do full indexation ? Not only – there is hope in the form of CheckIndex tool.

What is CheckIndex ?

CheckIndex is a tool available in the Lucene library, which allows you to check the files and create new segments that do not contain problematic entries. This means that this tool, with little loss of data is able to repair a broken index, and thus save us from having to restore the index from the backup (of course if we have it) or do the full indexing of all documents that were stored in Solr.

Where do I start?

Please note that, according to what we find in Javadocs, this tool is experimental and may change in the future. Therefore, before starting working with it we should create a copy of the index. In addition, it is worth knowing that the tool analyzes the index byte by byte, and thus for large indexes the time of analysis and repair may be large. It is important not to run the tool with the -fix option at the moment when it is used by Solr or other application based on the Lucene library. Finally, be aware that the launch of the tool in repairing mode may result in removal of some or all documents that are stored in the index.

How to run it ?

To run the utility, go to the directory where the Lucene library files are located and run the following command:

java -ea:org.apache.lucene... org.apache.lucene.index.CheckIndex INDEX_PATH -fix

In my case, it looked as follows:

java -cp lucene-core-2.9.3.jar -ea:org.apache.lucene... org.apache.lucene.index.CheckIndex E:\\Solr\\solr\\data\\index\\ -fix

After a while I got the following information:

Opening index @ E:\Solr\solr\data\index\

Segments file=segments_2 numSegments=1 version=FORMAT_DIAGNOSTICS [Lucene 2.9]
1 of 1: name=_0 docCount=19
compound=false
hasProx=true
numFiles=11
size (MB)=0,018
diagnostics = {os.version=6.1, os=Windows 7, lucene.version=2.9.3 951790 - 2010-06-06 01:30:55, source=flush, os.arch=x86, java.version=1.6.0_23, java.vendor=Sun Microsystems Inc.}
no deletions
test: open reader.........OK
test: fields..............OK [15 fields]
test: field norms.........OK [15 fields]
test: terms, freq, prox...OK [900 terms; 1517 terms/docs pairs; 1707 tokens]
test: stored fields.......OK [232 total field count; avg 12,211 fields per doc]
test: term vectors........OK [3 total vector count; avg 0,158 term/freq vector fields per doc]

No problems were detected with this index.

It mean that the index is correct and there was no need for any corrective action. Additionally, you can learn some interesting things about the index ;)

Broken index

But what happens in the case of the broken index? There is only one way to see it – let’s try. So, I broke one of the index files and ran the CheckIndex tool. The following appeared on the console after I’ve run the CheckIndex tool:

Opening index @ E:\Solr\solr\data\index\

Segments file=segments_2 numSegments=1 version=FORMAT_DIAGNOSTICS [Lucene 2.9]
1 of 1: name=_0 docCount=19
compound=false
hasProx=true
numFiles=11
size (MB)=0,018
diagnostics = {os.version=6.1, os=Windows 7, lucene.version=2.9.3 951790 - 2010-06-06 01:30:55, source=flush, os.arch=x86, java.version=1.6.0_23, java.vendor=Sun Microsystems Inc.}
no deletions
test: open reader.........FAILED
WARNING: fixIndex() would remove reference to this segment; full exception:
org.apache.lucene.index.CorruptIndexException: did not read all bytes from file "_0.fnm": read 150 vs size 152
at org.apache.lucene.index.FieldInfos.read(FieldInfos.java:370)
at org.apache.lucene.index.FieldInfos.<init>(FieldInfos.java:71)
at org.apache.lucene.index.SegmentReader$CoreReaders.<init>(SegmentReader.java:119)
at org.apache.lucene.index.SegmentReader.get(SegmentReader.java:652)
at org.apache.lucene.index.SegmentReader.get(SegmentReader.java:605)
at org.apache.lucene.index.CheckIndex.checkIndex(CheckIndex.java:491)
at org.apache.lucene.index.CheckIndex.main(CheckIndex.java:903)

WARNING: 1 broken segments (containing 19 documents) detected
WARNING: 19 documents will be lost

NOTE: will write new segments file in 5 seconds; this will remove 19 docs from the index. THIS IS YOUR LAST CHANCE TO CTRL+C!
5...
4...
3...
2...
1...
Writing...
OK
Wrote new segments file "segments_3"

As you can see, all the 19 documents that were in the index have been removed. This is an extreme case, but you should realize that this tool might work like this.

The end

If you remember about the basisc assumptions associated with the use of the CheckIndex tool you may find yourself in a situation when this tool will come in handy and you will not have to ask yourself a question like “When the last backup was made ?”.

This post is also available in: Polish

This entry was posted on Monday, January 17th, 2011 at 07:32 and is filed under About Solr. You can follow any responses to this entry through the RSS 2.0 feed. You can leave a response, or trackback from your own site.

3 Responses to “CheckIndex for the rescue”

  1. Noman Says:

    hi
    very informative post regarding to lucene, but what happen if we use the RAMDirectory class to maintain an in-memory index in Lucene.reparing of document can be done with that
    Thanks

  2. David Smiley Says:

    Great How-To!

  3. gr0 Says:

    Thanks David :)