index – Solr.pl

Solr 4.2: Index structure reading API

Rafał Kuć — Mon, 20 May 2013 11:58:51 +0000

With the release of Solr 4.2 we’ve got the possibility to use the HTTP protocol to get information about Solr index structure. Of course, if one wanted to do that prior to Solr 4.2 it could be achieved by fetching the schema.xml file, parsing it and then getting the needed information. However when Solr 4.2 was released we’ve got a dedicated API which can return the information we need without the need of parsing the whole schema.xml file.

Possibilities

Let’s look at the new API by example.

Getting information in XML format

Many Solr users are used to getting their data in the XML format, at least when using Solr HTTP API. However, the schema API uses JSON as the default format. In order to get the data in the XML format in all the below examples, you’ll need to appeng the wt=xml parameter to the call, for example like that:

$curl 'http://localhost:8983/solr/collection1/schema/fieldtypes?wt=xml'

Defined fields information

Let’s start by looking at how to fetch information about the fields that are defined in Solr. In order to do that we have the following possibilities:

Get information about all the fields defined in the index
Get information for a one, explicitly defined field

In the first case we should use the following command:

$curl 'http://localhost:8983/solr/collection1/schema/fields'

In second case we should add the / character and the field name to the above command. For example in order to get the information about the author field we should use the following command:

$curl 'http://localhost:8983/solr/collection1/schema/fields/author'

Solr response for the first command will be similar to the following one:

{
  "responseHeader":{
    "status":0,
    "QTime":1},
  "fields":[{
      "name":"_version_",
      "type":"long",
      "indexed":true,
      "stored":true},
    {
      "name":"author",
      "type":"text_general",
      "indexed":true,
      "stored":true},
    {
      "name":"cat",
      "type":"string",
      "multiValued":true,
      "indexed":true,
      "stored":true},
    {
      "name":"category",
      "type":"text_general",
      "indexed":true,
      "stored":true},
    {
      "name":"id",
      "type":"string",
      "multiValued":false,
      "indexed":true,
      "required":true,
      "stored":true,
      "uniqueKey":true},
    {
      "name":"url",
      "type":"text_general",
      "indexed":true,
      "stored":true},
    {
      "name":"weight",
      "type":"float",
      "indexed":true,
      "stored":true}]}

On the other hand the response for the second command would be as follows:

{
  "responseHeader":{
    "status":0,
    "QTime":0},
  "field":{
    "name":"author",
    "type":"text_general",
    "indexed":true,
    "stored":true}}

Getting information about defined dynamic fields

Similar to what information we can get about the fields defined in the schema.xml we can get the information about dynamic fields. Again we have to options:

Get information about all dynamic fields
Get information about specific dynamic field pattern

In order to get all the information about dynamic fields we should use the following command:

$curl 'http://localhost:8983/solr/collection1/schema/dynamicfields'

In order to get information about a specific pattern we append the / character followed by the pattern, for example like this:

$curl 'http://localhost:8983/solr/collection1/schema/dynamicfields/random_*'

Solr will return the following response for the first query:

{
  "responseHeader":{
    "status":0,
    "QTime":2},
  "dynamicfields":[{
      "name":"*_coordinate",
      "type":"tdouble",
      "indexed":true,
      "stored":false},
    {
      "name":"ignored_*",
      "type":"ignored",
      "multiValued":true},
    {
      "name":"random_*",
      "type":"random"},
    {
      "name":"*_p",
      "type":"location",
      "indexed":true,
      "stored":true},
    {
      "name":"*_c",
      "type":"currency",
      "indexed":true,
      "stored":true}]}

And the following response will be returned for the second command:

{
  "responseHeader":{
    "status":0,
    "QTime":1},
  "dynamicfield":{
    "name":"random_*",
    "type":"random"}}

Getting field types

As you probably guess, in a way similar to the above describes examples, we can also get the information about the field types defined in our schema.xml files. We can fetch the following information:

All the field types defined in the schema.xml file
A single type

To get all the defined field types we should run the following command:

$curl 'http://localhost:8983/solr/collection1/schema/fieldtypes'

The get information about a single type we should again add the / character and append the field type name to it, for example like this:

$curl 'http://localhost:8983/solr/collection1/schema/fieldtypes/text_gl'

Solr will return the following information in response to the first command:

{
  "responseHeader":{
    "status":0,
    "QTime":3},
  "fieldTypes":[{
      "name":"alphaOnlySort",
      "class":"solr.TextField",
      "sortMissingLast":true,
      "omitNorms":true,
      "analyzer":{
        "class":"solr.TokenizerChain",
        "tokenizer":{
          "class":"solr.KeywordTokenizerFactory"},
        "filters":[{
            "class":"solr.LowerCaseFilterFactory"},
          {
            "class":"solr.TrimFilterFactory"},
          {
            "class":"solr.PatternReplaceFilterFactory",
            "replace":"all",
            "replacement":"",
            "pattern":"([^a-z])"}]},
      "fields":[],
      "dynamicFields":[]},
    {
      "name":"boolean",
      "class":"solr.BoolField",
      "sortMissingLast":true,
      "fields":["inStock"],
      "dynamicFields":["*_bs",
        "*_b"]},
    {
      "name":"text_gl",
      "class":"solr.TextField",
      "positionIncrementGap":"100",
      "analyzer":{
        "class":"solr.TokenizerChain",
        "tokenizer":{
          "class":"solr.StandardTokenizerFactory"},
        "filters":[{
            "class":"solr.LowerCaseFilterFactory"},
          {
            "class":"solr.StopFilterFactory",
            "words":"lang/stopwords_gl.txt",
            "ignoreCase":"true",
            "enablePositionIncrements":"true"},
          {
            "class":"solr.GalicianStemFilterFactory"}]},
      "fields":[],
      "dynamicFields":[]},
    {
      "name":"tlong",
      "class":"solr.TrieLongField",
      "precisionStep":"8",
      "positionIncrementGap":"0",
      "fields":[],
      "dynamicFields":["*_tl"]}]}

In response to the second command Solr will return the following:

{
  "responseHeader":{
    "status":0,
    "QTime":2},
  "fieldType":{
    "name":"text_gl",
    "class":"solr.TextField",
    "positionIncrementGap":"100",
    "analyzer":{
      "class":"solr.TokenizerChain",
      "tokenizer":{
        "class":"solr.StandardTokenizerFactory"},
      "filters":[{
          "class":"solr.LowerCaseFilterFactory"},
        {
          "class":"solr.StopFilterFactory",
          "words":"lang/stopwords_gl.txt",
          "ignoreCase":"true",
          "enablePositionIncrements":"true"},
        {
          "class":"solr.GalicianStemFilterFactory"}]},
    "fields":[],
    "dynamicFields":[]}}

As you can see, the amount information is nice as we are getting all the information about the field types and in addition to that the information which field are using give field (both dynamic and non-dynamic.

Retrieving information about copyFields

In addition to what we’ve discussed so far we are able to get information about copyFields section from the schema.xml. In order to do that one should run the following command:

$curl 'http://localhost:8983/solr/collection1/schema/copyfields'

And in response we will get the following data:

{
  "responseHeader":{
    "status":0,
    "QTime":1},
  "copyfields":[{
      "source":"author",
      "dest":"text"},
    {
      "source":"cat",
      "dest":"text"},
    {
      "source":"content",
      "dest":"text"},
    {
      "source":"content_type",
      "dest":"text"},
    {
      "source":"description",
      "dest":"text"},
    {
      "source":"features",
      "dest":"text"},
    {
      "source":"author",
      "dest":"author_s",
      "destDynamicBase":"*_s"}]}

The future

In Solr 4.3 the described API was improved and is now being prepared to enable not only reading of the index structure, but also writing modifications to it with the use of HTTP requests. We can expect that feature in one of the upcoming versions of Apache Solr, so its worth waiting in my opinion, at least by those who needs it.

W Solr 4.3 opisywane API zostało usprawnione oraz jest przygotowywane do umożliwienia zmian w strukturze indeksu za pomocą protokołu HTTP. Możemy zatem spodziewać się, iż w jednej z kolejnych wersji serwera wyszukiwania Solr otrzymamy możliwość łatwej zmiany struktury indeksu, przynajmniej takich, które nie będą powodować konfliktów z już zaindeksowanymi danymi.

Backing Up Your Index

Rafał Kuć — Mon, 13 Aug 2012 21:51:12 +0000

Did you ever wonder if you can create a backup of your index with the tools available in Solr ? For exmaple after every commit or optimize operation ? Or may you would like to create backups with the HTTP API call ? Lets see what possibilities Solr has to offer.

The Beginning

We decided to write about index backups even though this functionality is fairly simple. We noticed that many people tend to forget about this functionality, not only when it comes to Apache Solr. We hope that this blog entry, will help you remember about backup creation functionality, when you need it. But now, lets start from the beginning – before we started the tests, we looked at the directory where Solr keeps its indices and this is what we saw:

drwxrwxr-x 2 gr0 gr0 4096 2012-08-12 20:17 index
drwxrwxr-x 2 gr0 gr0 4096 2012-08-12 20:16 spellchecker

Manual Backup

In order to create a backup of your index with the use of HTTP API you have to have replication handler configured. If you have it, then you need to send the command parameter with backup value to the master server replication handler, for example like this:

curl 'http://localhost:8983/solr/replication?command=backup'

The above will tell Solr to create a new backup of the current index. Lets now look how the directory where indices live looks like after running the above command:

drwxrwxr-x 2 gr0 gr0 4096 2012-08-12 20:18 index
drwxrwxr-x 2 gr0 gr0 4096 2012-08-12 20:19 snapshot.20120812201917
drwxrwxr-x 2 gr0 gr0 4096 2012-08-12 20:16 spellchecker

As you can see, there is a new directory created – snapshot.20120812201917. We can assume, that we got what we wanted

Automatic Backup

In addition to manual backup creation, you can also configure Solr to create indices after commit or optimize operation. Please remember though, that if your index is changing rapidly it is usually a bad idea to create backup after each commit operation. But lets get back to automatic backups. In order to configure Solr to create backups for us, you need to add the following line to replication handler configuration:

commit

So, the full replication handler configuration (on the master server) would look like this:


 
  commit
  startup
  schema.xml,stopwords.txt
  commit

After sending two commit operation our dictionary with indices looks like this:

drwxrwxr-x 2 gr0 gr0 4096 2012-08-12 21:12 index
drwxrwxr-x 2 gr0 gr0 4096 2012-08-12 21:12 snapshot.20120812211203
drwxrwxr-x 2 gr0 gr0 4096 2012-08-12 21:12 snapshot.20120812211216
drwxrwxr-x 2 gr0 gr0 4096 2012-08-12 20:16 spellchecker

As you can see, Solr did what we wanted to be done.

Keeping Order

It is possible to control the maximum amount of backups that should be stored on disk. In order to configure that number you need to add the following line to your replication handler configuration:

The above configuration value tells Solr to keep maximum of ten backups of your index. Of course you can delete created backups (manually for example) if you don’t need them anymore.

Deep paging problem

Rafał Kuć — Mon, 18 Jul 2011 19:45:36 +0000

Imagine the following problem – we have an application that expects Solr to return the results sorted on the basis of some field. Those results will be than paged in the GUI. However, if the person using the GUI application immediately selects the tenth, twentieth, or fiftieth page of search results there is a problem – the wait time. Is there anything we can do about this? Yes, we can help Solr a bit.

A few numbers at the beginning

Let’s start with the query and statistics. Imagine that we have the following query, which is sent to Solr to get the five hundredth page of search results:

q=*:*&sort=price+asc&rows=100&start=50000

What Solr must do Solr to retrieve and render the results list ? Of course, read the documents from Lucene index. There is of course the question of how many documents to be read from the index? Is it 100? Unfortunately, no. Solr must collect 50.100 sorted documents from the Lucene index, due to the fact that we want 100 documents starting from 50.000-th. Kinda scary. Now let’s look at the comparison of how long it takes for Solr download the first page of search results (the query q=*:*&sort=price+asc&rows=100&start=0) and how long it takes to render the last page of search results (ie the query q=*:*&sort = price+asc&rows=100&start=999900). The test was performed on an index containing a million documents, consisting of four fields: id (string), name (text), description (text), price (long). Before every test iteration Solr was started the query was run and Solr was truned off. These steps were repeated 100 times for each of the queries, and times is seen in the table are the arithmetic mean of the query time execution. Here are the test results:

[table “18” not found /]

Typical solutions

Of course we can try to set the cache or the size of queryResultWindowSize, but there will be a problem of how to set the size, there may be a situation where it will be insufficient or not relevant entry in the memory of Solr and then waiting time for the n-th search page will be very long. We can also try adding warming queries, but we won’t be able to prepare all the combinations, but even if we could the the cache would have to be big. So we won’t be able to achieve the desired results with any of these solutions.

Filters, filters and filter again

This behavior Solr (and other applications based on Lucene too) is caused by the queue of documents, called. priority queue, which in order to display the last page of results must download all documents matching the query and return the ones we want located in the desired page. Of course, in our case, if we want the first page of search results queue will have 100 entries. However, if we want the last page will have Solr search in the queue to put one million documents. Of course what I told is in big simplification.

The idea is to limit the number of documents Lucene must put in the queue. How to do it ? We will use filters to help us, so in Solr we will use the fq parameter. Using a filter will limit the number of search results. The ideal size of the queue would be the one that is passed in the rows parameter of query. However, this situation is ideal and not very possible in most situations. An additional problem is that asking a query with a filter we can not determine the number of results, because we do not know how much documents will the filter return. The solution to this problem is the making two queries instead of just one – the first one to see how fimiting is our filter thus using rows=0 and start=0, and the second is already adequately calculated (example below).

The maximum price of the product in the test data is 10,000, and the minimum is 0. So to the first query we will add the following bracket: <0; 1000>, and to the second query, we add the following bracket: <9000; 10000>.

Disadvantages of solution based on filters

There is one minus the filter-based solution and it is quite large. It may happen that the number of results to which the filter is limiting the query is too small. What then? We should increase the choosen bracket for the filter. Of course we can calculate the optimal brackets on the basis of our data, but it depends on the data and queries and why I won’t be talking about this at this point.

What is the performance after the change?

So let’s repeat the tests, but now let’s implement the filter based approach. So the first will just return the first page of results (the query q=*:*&sort=price+asc&rows=100&start=0&fq=price:[0+TO+1000]). The second query (actually two queries) will will be used the check the number of results and then fetch those results (those two queries: q=*:*&sort=price+asc&rows=0&start=0&fq=price:[9000+TO+10,000] and q=*:*&sort=price+asc&rows=100&start=100037&fq=price:[9000+TO+10000]). It is worth noting about the changed start parameter in the query, due to the fact that we get fewer search results (this is caused by the fq parameter). This test was was carried out in similar way to the previous one – start Solr, run a query (or queries), and shutdown Solr. The number seen in the table are the arithmetic mean of query time execution.

[table “19” not found /]

As you can see, the query performance changed. We can therefore conclude that we succeeded. Of course, we could be tempted by further optimizations, but for now let’s say that we are satisfied with the results. I suspect however that you can ask the following question:

How is this handled by other search engines ?

Good question, but the answer is trivial in total – one of the methods is to prevent a very deep paging. This method shall include Google. Try to type the word “ask” for the search and go further than 91 page of search results. Google didn’t let me

Conclusions

As you can see deep paging performance after our changes has increased significantly. We can now allow the user to search results without worrying that it will kill our Solr and the machine on which he works. Of course, this method is not without flaws, due to the fact that we need to know certain parameters (eg in our case, the maximum price for descending sort), but this is a solution that allows you to provide search results in a relatively low latency time when compared to the pre-optimization method.

Index – delete or update?

Rafał Kuć — Wed, 16 Feb 2011 08:16:33 +0000

From time to time, in working with Solr there is a problem – how to update Solr index structure. There are various reasons for these changes – the new functional requirements, optimization, or anything else – it is not important. What is important is the question that arise – should we remove the index, or simply change the structure and do a full indexing? Contrary to appearances, the answer to this question depends on the changes we made in the structure of the index.

Personally, I am an advocate of solutions that have the smallest chance to cause problems – I just like to sleep at night. I think that removing the index after updateing its structure and then do the full indexation of the data is one of those solutions, at least in my opinion. I am aware, however, that this type of solution is not always acceptable. So when we are not forced to remove the index, and when not doing it exposes us to potential problems with the Solr ?

The answer to the question depends on what changed in the structure of the index. Such changes can be divided into three areas covering most of the changes that we make in the structure of the index:

Adding / removing new field
Similarity modification
Field modification

Adding / removing new field

In the case of the first type of modification of the matter is quite simple – if we add or remove a new field to schema.xml there is no need to remove the entire index before re-indexing. Solr handle adding a new field to the current index. Of course, you should be aware that the documents which will not be after this operation will not be re-indexed automatically updated.

Similarity modification

In the second case – the change of the class that is responsible for Similarity also does not force us to to delete the index after the change. But unlike the previous example, if we want Solr to correctly calculate the score, and thus to sort in the correct order we will be forced to re-indexing of all documents previously present in the index.

Field modification

Let stop a minute on the third case. Let’s suppose that we modify slightly the field in the index with the prosaic reason – we are no longer interested in the normalization of its length. We set omitNorms=”true” (I assume that the previous setting was omitNorms=”false”). If we only re-index all the documents, the Lucene indexes, in the combined segments, will still have information about length normalization of the field. Something went wrong. This is precisely the case when it is necessary to delete the index after the change to its structure, and prior to full indexation. At first glance, it seems that this is a very small change, but thinking further, we have some side effects of the change. It is worth remembering that some of the field properties are overwritten by other, as in the case of normalization of the length – if one segment will have lenght normalization, and the second will not, when you combine the segments you will have lenght normalization in the one that was created.

CheckIndex for the rescue

Rafał Kuć — Mon, 17 Jan 2011 08:06:58 +0000

While using Lucene and Solr we are used to a very high reliability of this products. However, there may come the day when Solr will inform us that our index is corrupted, and we need to do something about it. Is the only way to repair the index is to restore it from the backup or do full indexation ? Not only – there is hope in the form of CheckIndex tool.

What is CheckIndex ?

CheckIndex is a tool available in the Lucene library, which allows you to check the files and create new segments that do not contain problematic entries. This means that this tool, with little loss of data is able to repair a broken index, and thus save us from having to restore the index from the backup (of course if we have it) or do the full indexing of all documents that were stored in Solr.

Where do I start?

Please note that, according to what we find in Javadocs, this tool is experimental and may change in the future. Therefore, before starting working with it we should create a copy of the index. In addition, it is worth knowing that the tool analyzes the index byte by byte, and thus for large indexes the time of analysis and repair may be large. It is important not to run the tool with the -fix option at the moment when it is used by Solr or other application based on the Lucene library. Finally, be aware that the launch of the tool in repairing mode may result in removal of some or all documents that are stored in the index.

How to run it ?

To run the utility, go to the directory where the Lucene library files are located and run the following command:

java -ea:org.apache.lucene... org.apache.lucene.index.CheckIndex INDEX_PATH -fix

In my case, it looked as follows:

java -cp lucene-core-2.9.3.jar -ea:org.apache.lucene... org.apache.lucene.index.CheckIndex E:\\Solr\\solr\\data\\index\\ -fix

After a while I got the following information:

Opening index @ E:\Solr\solr\data\index\

Segments file=segments_2 numSegments=1 version=FORMAT_DIAGNOSTICS [Lucene 2.9]
1 of 1: name=_0 docCount=19
compound=false
hasProx=true
numFiles=11
size (MB)=0,018
diagnostics = {os.version=6.1, os=Windows 7, lucene.version=2.9.3 951790 - 2010-06-06 01:30:55, source=flush, os.arch=x86, java.version=1.6.0_23, java.vendor=Sun Microsystems Inc.}
no deletions
test: open reader.........OK
test: fields..............OK [15 fields]
test: field norms.........OK [15 fields]
test: terms, freq, prox...OK [900 terms; 1517 terms/docs pairs; 1707 tokens]
test: stored fields.......OK [232 total field count; avg 12,211 fields per doc]
test: term vectors........OK [3 total vector count; avg 0,158 term/freq vector fields per doc]

No problems were detected with this index.

It mean that the index is correct and there was no need for any corrective action. Additionally, you can learn some interesting things about the index

Broken index

But what happens in the case of the broken index? There is only one way to see it – let’s try. So, I broke one of the index files and ran the CheckIndex tool. The following appeared on the console after I’ve run the CheckIndex tool:

Opening index @ E:\Solr\solr\data\index\

Segments file=segments_2 numSegments=1 version=FORMAT_DIAGNOSTICS [Lucene 2.9]
1 of 1: name=_0 docCount=19
compound=false
hasProx=true
numFiles=11
size (MB)=0,018
diagnostics = {os.version=6.1, os=Windows 7, lucene.version=2.9.3 951790 - 2010-06-06 01:30:55, source=flush, os.arch=x86, java.version=1.6.0_23, java.vendor=Sun Microsystems Inc.}
no deletions
test: open reader.........FAILED
WARNING: fixIndex() would remove reference to this segment; full exception:
org.apache.lucene.index.CorruptIndexException: did not read all bytes from file "_0.fnm": read 150 vs size 152
at org.apache.lucene.index.FieldInfos.read(FieldInfos.java:370)
at org.apache.lucene.index.FieldInfos.(FieldInfos.java:71)
at org.apache.lucene.index.SegmentReader$CoreReaders.(SegmentReader.java:119)
at org.apache.lucene.index.SegmentReader.get(SegmentReader.java:652)
at org.apache.lucene.index.SegmentReader.get(SegmentReader.java:605)
at org.apache.lucene.index.CheckIndex.checkIndex(CheckIndex.java:491)
at org.apache.lucene.index.CheckIndex.main(CheckIndex.java:903)

WARNING: 1 broken segments (containing 19 documents) detected
WARNING: 19 documents will be lost

NOTE: will write new segments file in 5 seconds; this will remove 19 docs from the index. THIS IS YOUR LAST CHANCE TO CTRL+C!
5...
4...
3...
2...
1...
Writing...
OK
Wrote new segments file "segments_3"

As you can see, all the 19 documents that were in the index have been removed. This is an extreme case, but you should realize that this tool might work like this.

The end

If you remember about the basisc assumptions associated with the use of the CheckIndex tool you may find yourself in a situation when this tool will come in handy and you will not have to ask yourself a question like “When the last backup was made ?”.

Quick look – IndexSorter

Rafał Kuć — Mon, 04 Oct 2010 12:14:20 +0000

At the Apache Lucene Eurocon 2010 conference, which took place in May this year, Andrew Białecki in his presentation talked about how to obtain satisfactory search results when using early termination search techniques. Unfortunately the tool he mentioned, was not available in Solr – but it changed.

At the time of writing, the described tools are available only in branch named branch_3x in SVN, but it is planned to migrate this functionality to version 4.x.

But what is it?

Using the techniques of terminating the search after a predetermined time, without looking at the number of search results, at some point we come across the problem of the quality of search results. Instead of receiving the best results, in the context of the search query, we get them in a random fashion (or at least they may look random). This means that we are not able to ensure that the user that uses the system gets the best matching results. Of course, we talk about the situation, when you terminate the search after a predetermined period of time and we that is why Solr can’t gather all the documents that match your query.

Is it useful for me ?

When ending a search after a predetermined time may be useful? There are many uses cases of such a search. Imagine that our implementation is composed of many separate shards, which operate on large amounts of data each. When making a distributed query, each of the shards, present in the search system, must be queried for relevant documents, then all results must be gathered and displayed to the end user (of course, this not need to be a man, this may be an application). But what if each of the shards needs a very long time to process all search results, and we are, for example, only interested in those added in recent times (eg last week). This is where we have the possibility of early termination of search query – assuming that we are more interested in documents added the day before rather than two weeks ago.

How to achieve it ?

Example above illustrates the case when we can use the search that is terminated after a specified time. However, when looking further into search results we come to a problem – to sort search results Solr must collect them all. So when making query with a sort parameter like sort=added+desc to get the documents sorted correctly, each of the shards would have to return all search results – this mean that we can’t use early termination of search ? Not really. To help us, Solr provides a tool – IndexSorter, which until now was available only in the Apache Nutch project, but recently was commited to Lucene and Solr. With this tool, we can pre-sort the index by the parameter that we need. Thus, an index sorted descending by date of a document adding, Solr would first get the documents that have been added lately, and thus we would be able to use early termination.

Using IndexSorter

What to do to use the IndexSorter tool ? Can I tell You the truth ? – It’s not that complicated. Note, however, that at the time of publication of this entry the mentioned tool is only available in branch_3x of Lucene/Solr project. To sort an index on the basis of a field, run the following command from the command line (of course keeping in mind the appropriate location of the library lucene-misc-3.1.jar – after building the project we find it in directory lucene/build/contrib/misc):

java IndexSorter SOURCE_DIRECTORY TARGET_DIRECTORY FIELD_NAME

The parameters mean:

SOURCE_DIRECTORY – a catalog with an index that you want to sort,
TARGET_DIRECTORY – the directory where sorted index will be saved,
FIELD_NAME – the field on which basis the index will be sorted.

If everything goes correctly, You should see something like this:

IndexSorter: done, 896 total milliseconds

The end

In my opinion, Lucene and Solr just got a very interesting feature, which can be used for example wherever the amount of data is very large, when response time can not exceed a certain time limit, or when the results beyond the first (the first 100 or 1000) are not significant. All who are interested in the subject or index sorting and early termination techniques should watch a slide presentation titled “Munching and Crunching: Lucene Index Post-Processing” (slides) led by Andrzej Bialecki during Lucene Eurocon Conference 2010, who discussed these topics.

5 sins of schema.xml modifications

Rafał Kuć — Mon, 30 Aug 2010 12:08:35 +0000

I made a promise and here it is – the entry on the most common mistakes when designing Solr index, which is when You create or modify the schema.xml file for Your system implementation. Feel free to read on

Each of us knows what is schema.xml file and what is (if not, I invite you to read the entry located at: http://solr.pl/2010/08/16/what-is-schema-xml/?lang=en). What are the most frequently commit errors creating or updating this file? I personally met with the following:

1. Trash in the configuration

I confess that the first principle is to keep the file schema.xml in the simplest possible form. Linked to this is a very important issue – this file should not be synonymous with chaos. In other word, do not stick with unnecessary comments, unwanted types, fields and so on. Order in the structure of the schema.xml file not only helps us to maintain this file and its modifications with ease, but also assures us that no information that is unnecessary will be stored in Solr index.

2. Cosmetic changes to the default configuration

How many of those who use Solr in their daily work took the default schema.xml file supplied in the example implementation Solr and only slightly modified the contents – for example, changing only the names of the fields ? I should raise my hand too, because I did it once. This is a pretty big mistake. Someone may ask why. Are you sure You need English stemming when implementing search for content written in Polish ? I think not. The same applies to field and type attributes like term vectors.

3. No updates

Sometimes I find the implementation of search based application, where update of Solr does not mean an update of schema.xml file. If it is a conscious decision, dictated by such costly or even impossible re-indexing of all data, I understand the situation. But there are cases where an upgrade would bring only benefits, and where costs of such upgrade would be minimal (eg less expensive re-index or slight changes in the application). Do not be afraid to update the schema.xml file – whether it is to update the fields, update types, whether the addition of newer stuff. A good example is the migration from Solr 1.3 to version 1.4 – newer version introduced significant changes associated with numeric types, where migration to the new types would result in great increase in query performance using those types (such as queries using value ranges).

4. “I`ll use it one day”

Adding new types, not removing unnecessary now, the same in the case of fields, or copyField definition. Most of us think – that old definition can be useful in the future, but remember that each type is some extra portion of memory needed by Solr, each field is a place in the index. My small advice – if you stop to use the type, field, or whatever else you have in your configuration file (not only in the schema.xml), simply remove it from this file. Applying this principle throughout the life cycle of the applications using Solr will ensure You that the index is in optimal condition, and after a few months since another feature implementation You will not need to be puzzled and as a result You will not need to dig into the application code to determine if the field is used in some forgotten code fragment.

5. Attributes, attributes and again attributes

Preservation of original values, adding term vectors and its properties are just examples of things we don`t need in every implementation. Sometimes we have more than required by the application index. A larger index, lower productivity, at least in some cases (eg, indexing). It is worth considering if you really need all this information, which we say to Solr to calculate and store. Removing some unnecessary, of course, from our point of view of information, may surprise us. Sometimes it is worth a try;)

Feel free to comment, because I will read eagerly, for what else we should pay attention to when modifying schema.xml file.

Finally, I think that it is worth to mention the article “The Seven Deadly Sins of Solr” LucidImagination published on the website at: http://www.lucidimagination.com/blog/2010/01/21/the-seven-deadly-sins-of-solr. It describes bad practices when working with Solr. In my opinion, interesting reading. I highly recommend it.