4.0 – Solr.pl

Solr 4.0 and Polish language analysis

Rafał Kuć — Mon, 02 Apr 2012 21:43:51 +0000

Because Polish language analysis functionality is present in Lucene (and Solr) for some time I decided to take a look and compare it on the basis of upcoming Lucene and Solr 4.0.

Options

At the time of writing, the following options were present when it comes to analyzing Polish:

Use Stempel library (available since Solr 3.1)
Use Hunspell and Polish dictionaries (available since Solr 3.5)
Use Morfologik library (will be available in Solr 4.0, SOLR-3272).

Configuration

Lets look how to configure all the above options in Solr (please remember that all the following configuration examples are based on Solr 4.0).

Stempel

In order to add Polish stemming using Stempel library, we just need to add the following filter to our type definition:

In addition to that, you need to add lucene-analyzers-stempel-4.0.jar library and apache-solr-analysis-extras-4.0.jar library to SOLR_HOME/lib. It’s also a good idea to use solr.LowerCaseFilterFactory before Stempel filter.

Hunspell

Similar to the configuration above, to use Hunspell you need to add a new filter to your type definition. For example in the following way:

Parameters dictionary and affix are responsible for dictionary definition that we want to use. The ignoreCase parameter set to true tells Hunspell to ignore character case. You can find Hunspell dictionaries at the following URL: http://wiki.services.openoffice.org/wiki/Dictionaries.

Morfologik

Similar to the two above examples all you need to change in your schema.xml is adding a new filter, this time the following way:

The dictionary parameter tell Solr which dictionary you would like to use. You can choose the one from the following three:

MORFOLOGIK
MORFEUSZ
COMBINED

In addition to that, you need to add the following libraries to the SOLR_HOME/lib: lucene-analyzers-morfologik-4.0.jar, apache-solr-analysis-extras-4.0.jar, morfologik-fsa-1.5.2.jar, morfologik-polish-1.5.2.jar and morfologik-stemming-1.5.2.jar.

Results Comparison

Of course I wasn’t able to judge the results of analysis from the above three filters on the whole Polish language corpus and that’s why I decided to choose four work, to see the each of the filters behave. Those words are: “urodzić urodzony urodzona urodzeni” (this words are variations of the born word in Polish). The results are as follows:

Stempel

The terms I got from Stempel were the following ones:

[urodzić] [urodzo] [urodzona] [urodzeni]

Not all of them are words, but you have to remember that Stempel is a stemmer and because of that it produce stems which can be different from the actual words or their root forms. It is important to have the words we are interested in to be processed to the same tokens, which will allow to find those words by Lucene/Solr. Remembering that, I have to say, that the results of analysis using Stempel are not as good as I would like them to be. For example by searching for urodzić word you won’t be able to find documents with words like urodzona or urodzić.

Hunspell

The result of Hunspell analysis were as follows:

[urodzić, urodzić] [urodzony, urodzić] [urodzić] [urodzić, urodzony, urodzenie]

Comparing the results I got when using Hunspell to those Stempel produced we can see the difference. Our sample query for the urodzić word, would find documents with words like urodzony, urodzona oraz urodzeni, which is quite nice. You can also notice, that with three words we got more than one term on the same positions. The results I got when using Hunspell are OK and I think they should satisfy most of the users (they do satisfy me), but lets have a look on the newly introduced filter in Lucene and Solr – Morrfologik.

Morfologik

The results of Morfologik analysis were as follows:

[urodzić] [urodzony, urodzić] [urodzić] [urodzić, urodzony]

Again, if you compare those the the ones got when using Hunspell you can hardly see the difference (of course in this particular case). The only difference between Hunspell and Morfologik is the last term for which we got different results. In my opinion the results achieved with Morfologik, are satisfying.

Performance

The performance test was done in a simple manner – for each filter I’ve indexed 5 million documents, where all the text fields were based on Polish language analysis with appropriate filter (in addition to that some standard filters like stopwords, synonyms and so on). Every time the indexation was done on a clean Solr 4.0 instance. Because of using Data Import Handler I’ve sent commit every 100k documents. The index contained several fields, but the actual index structure was not crucial for the test as I indexed the same set of documents every time. Following are the test results:

[table “21” not found /]

Warning: At the time of writing, according to SOLR-3245 JIRA issue there is a problem with Hunspell performance with Polish dictionaries and Solr 4.0. I’m almost certain that this situation will be resolved by the time Solr 4.0 will be released. But right now performance of Hunspell with Polish dictionaries and Solr 4.0 may not be sufficient.

Short Summary

Despite not having performance results for Hunspell (because I don’t count the ones I have right now as correct ones) we can see that Hunspell and Morfologik are a good candidates for Polish language analysis. Looking at Morfologik we have similar performance to Stempel, but Morfologik results are better in my opinion and that will make your user more happy.

Solr 4.0: Realtime GET

Rafał Kuć — Mon, 09 Jan 2012 20:57:41 +0000

The next functionality I decided to look at, from the upcoming Solr 4.0, is the so called “Realtime Get”. It allows you to see the data even though it was not yet added to the index, thus before the commit operation being sent to Solr. Let’s see how it works.

Some theory

Data update in Lucene and Solr has one disadvantage – when you submit index updates they can’t be seen until commit operation is run. The problem is that commit is costly in terms of performance and intense commiting may cause performance problems. So, when you need your data to be visible right after being change you may be forced to choose – either performance, or fast updates. In order to address that Lucene and Solr are working towards enabling Near Real Time (NRT) searching. In Lucene we have that possibility, in Solr 4.0 we will also be able to use that and not only that.

Configuration

In order to use Realtime Get functionality we need to configure the following Solr features:

Transaction log

The first thing to configure is the transaction log writing. In order to do that you need to add the following to your updateHandler configuration:

The above entry says, that the directory holding transaction log will be located in the same directory where the index directory is located.
Realtime Get handler
The second thing that needs to be done, to see the Realtime Get in action, is the appropriate handler configuration (or adding component to your already defined handler). To do that add the following to your solrconfig.xml file:

true

The above entry it's nothing unusual - it just add a new request handler implementing solr.RealTimeGetHandler class, which enables checking the transaction log.
Action
To check how Realtime Get works I decided to do a simple test. The first thing I did is indexing one file (from the ones that are available in the exampledocs directory) with the use of the following bash command:

curl 'http://localhost:8983/solr/update' -d @hd.xml -H 'Content-type:application/xml'
Of course I did not send the commit operation after indexing. As we could expect the following query:

http://localhost:8983/solr/select?q=*:*
didn't return search results. So let's check, if the handler registered as /get will be able to get us some results. In order to do that I send the following query:

http://localhost:8983/solr/get?id=SP2514N
And in result I got the following document:

SP2514N
Samsung SpinPoint P120 SP2514N - hard drive - 250 GB - ATA-133
Samsung Electronics Co. Ltd.
samsung

electronics
hard drive

7200RPM, 8MB cache, IDE Ultra ATA-133
NoiseGuard, SilentSeek technology, Fluid Dynamic Bearing (FDB) motor

92.0
6
true
2006-02-13T15:26:37Z
35.0752,-97.032

So Solr returned the result that wasn't added to the index - nice !
Usage possibilities
You probably noticed, that in order to fetch a document with /get handler I needed to provide it's unique identifier (or identifiers list). That's true, Realtime Get doesn't support searching, because it was not created to support full searching. This functionality is able to show us the updates of the documents which identifiers are known (so for example the ones in the index) - in example by adding the component used in solr.RealTimeGetHandler to any of your defined handler. And the good news is - you don't have to worry update performance - solr.RealTimeGet is very fast. So, if one of your problems is frequent updated you can look in the future with a smile
Last few words
The Realtime Get functionality brings new possibilities when it comes to Solr and also on the road to the SolrCloud. With the use of transaction log one can implement automatic cluster node restore or instance NRT instance updates. As you can see Solr 4.0 is not only about search, but also about data store and bringing Solr closer to NoSQL solutions.
{solr.data.dir:}

The above entry it’s nothing unusual – it just add a new request handler implementing solr.RealTimeGetHandler class, which enables checking the transaction log.
Action
To check how Realtime Get works I decided to do a simple test. The first thing I did is indexing one file (from the ones that are available in the exampledocs directory) with the use of the following bash command:

Of course I did not send the commit operation after indexing. As we could expect the following query:

didn’t return search results. So let’s check, if the handler registered as /get will be able to get us some results. In order to do that I send the following query:

And in result I got the following document:

So Solr returned the result that wasn’t added to the index – nice !
Usage possibilities
You probably noticed, that in order to fetch a document with /get handler I needed to provide it’s unique identifier (or identifiers list). That’s true, Realtime Get doesn’t support searching, because it was not created to support full searching. This functionality is able to show us the updates of the documents which identifiers are known (so for example the ones in the index) – in example by adding the component used in solr.RealTimeGetHandler to any of your defined handler. And the good news is – you don’t have to worry update performance – solr.RealTimeGet is very fast. So, if one of your problems is frequent updated you can look in the future with a smile
Last few words
The Realtime Get functionality brings new possibilities when it comes to Solr and also on the road to the SolrCloud. With the use of transaction log one can implement automatic cluster node restore or instance NRT instance updates. As you can see Solr 4.0 is not only about search, but also about data store and bringing Solr closer to NoSQL solutions.

Solr 4.0: DocTransformers first look

Rafał Kuć — Mon, 05 Dec 2011 20:55:51 +0000

In todays entry we will look at the next feature that will come with version 4.0 of Apache Solr. We will look at the functionality which enables us to modify the fields in Solr result list.

Do I need it ?

Till now, we didn’t have much choice when it comes to the results returned by Solr. When Solr 4.0 will be published we will be given a new tool, so called DocTransformers. This feature enables us to modify the fields of the documents returned in the search results by Solr. Looking at what is available now we can for example change the names of the fields returned or mark the documents that were added by the QueryElevationComponent. Right now there are only a few implementation, but implementing your own DocTranformer is not hard.

What is already available ?

At the exact moment we are writing this, the following transformers are available:

One that enables you to mark the documents that were added by the QueryElevationComponent.
One that enables you to add the explain information to the document.
One that enables you to add static value as a field of the document.
One that enables you to add the shard if from which the document was fetched.
One that enables you to add the docid as the document field (identifier used by Lucene).

How to use DocTransformers ?

Lets look at how to use DocTransformers. To do that I’ve downloaded trunk version of Apache Solr (4.0) from the svn repository and I’ve run the example deployment. Next, I’ve indexed the example data and I’ve run the following query:

http://localhost:8983/solr/select?q=encoded&fl=name,score,[docid],[explain]

If you look at the fl parameter you will notice that we told Solr that we want the name field in the results, the score of the document and two DocTransformers: [docid] and [explain]. In result I’ve got the following XML:



 
  0
  2
  
    encoded
    name,score,[docid],[explain]
  
 
 
 
  Test with some GB18030 encoded characters
  0.50524884
  0
  
  0.50524884 = (MATCH) weight(text:encoded in 0) [DefaultSimilarity], result of:
    0.50524884 = score(doc=0,freq=1.0 = termFreq=1), product of:
      1.0000001 = queryWeight, product of:
        3.2335923 = idf(docFreq=2, maxDocs=28)
        0.3092536 = queryNorm
      0.5052488 = fieldWeight in 0, product of:
        1.0 = tf(freq=1.0), with freq of:
          1.0 = termFreq=1
        3.2335923 = idf(docFreq=2, maxDocs=28)
        0.15625 = fieldNorm(doc=0)
  
 
 
  Test with some UTF-8 encoded characters
  0.4041991
  25
  
  0.4041991 = (MATCH) weight(text:encoded in 25) [DefaultSimilarity], result of:
    0.4041991 = score(doc=25,freq=1.0 = termFreq=1), product of:
      1.0000001 = queryWeight, product of:
        3.2335923 = idf(docFreq=2, maxDocs=28)
        0.3092536 = queryNorm
      0.40419903 = fieldWeight in 25, product of:
        1.0 = tf(freq=1.0), with freq of:
          1.0 = termFreq=1
        3.2335923 = idf(docFreq=2, maxDocs=28)
        0.125 = fieldNorm(doc=25)

As you can see, Solr did what we asked for.

Your own implementation

Let’s discuss, who to implement you own DocTransformer. Below, you have an example class named RenameFieldsTransformer from the org.apache.solr.response.transform package in Apache Solr source code. In general, all you have to do is override the following two methods from the DocTransformer class from org.apache.solr.response.transform package:

String getName() – method returning transformers name,
void transform(SolrDocument doc, int docid) – method which makes the actual transformation.

Implementation looks like this:

public class RenameFieldsTransformer extends DocTransformer {
 final NamedList rename;

 public RenameFieldsTransformer( NamedList rename ) {
  this.rename = rename;
 }

 @Override
 public String getName() {
  StringBuilder str = new StringBuilder();
  str.append( "Rename[" );
  for( int i=0; i< rename.size(); i++ ) {
   if( i > 0 ) {
    str.append( "," );
   }
   str.append( rename.getName(i) ).append( ">>" ).append( rename.getVal( i ) );
  }
  str.append( "]" );
  return str.toString();
 }

 @Override
 public void transform(SolrDocument doc, int docid) {
  for( int i=0; i
The code shown above enables us to rename the fields returned in the results. As you can see the transform method iterates through all the values in rename class variable. The rename variable consist of name value pairs which are field name and the name it should have after the transformation. You must also remember that in order to use your own transformer you need to add it’s configuration to the solrconfig.xml file. Here is the example which can be found on Solr wiki page:


To sum up
You should remember that the describes functionality is marked as experimental and can change its behavior when Lucene and Solr 4.0 will be released. We will get back to this topic as soon as Solr 4.0 will be released.

Hierarchical faceting – Pivot facets in trunk

Rafał Kuć — Mon, 25 Oct 2010 12:17:29 +0000

In a large number of implementations which I took part in, sooner or later, the question arise – what can we do to get faceting as a tree structure. Of course there some tricks for that, however, their use was to modify the data and appropriate processing of the results on application side. It was not particularly functional, nor especially comfortable. However, a few days ago Solr version 4.0 has been enhanced with code that is marked as Solr-792 in the system JIRA. Let’s see in this case, how to get the faceting results as a tree.

Important Note – at this point this functionality is only available in version 4.0, Solr, which is the development version. To use this version you need to download the code from trunk of Lucene/Solr SVN repository.

A few words at the beginning

In many projects in which I had the opportunity to deal with there was a need to use a hierarchical faceting. One of the simplest example is the requirement of showing the cities in the provinces and the number of documents in both provinces, as well as in various cities. Till recently, with no changes in the structure of data, it was impossible to achieve such functionality. Now it is possible

Indexing

In order not to unnecessarily complicate the described functionality I decided to use the sample XML documents that are available in the directory /exampledocs of the example deployment. I also didn’t modify the schema.xml file, or solrconfig.xml, so that configurations are standard. So thats all when it comes to configuration. So we can start the indexing process (I called the command from the directory $SOLR_HOME/exampledocs/):

./post.sh *.xml

After seeing several screens of information, and we have our data indexed.

The mechanism

It is not difficult to use hierarchical faceting. Solr creators gave us to use two additional parameters to the ones we already know:

facet.pivot – list of comma-separated fields, which shows at which fields and in what order to calculate the structure,
facet.pivot.mincount – the minimum number of documents there needs to be to the result to be included in faceting results. The default value is 1.

So let’s try it.

Queries

At the beginning of the try with two fields. I query for all the documents from the index and add the parameter facet.pivot=cat,inStock to say Solr that I want to get the results of the hierarchical faceting, where the first level of the hierarchy is the cat field, and the second level is the inStock field. The query looks as follows:

http://localhost:8983/solr/select/?q=*:*&facet=true&facet.pivot=cat,inStock

To shorten the listing I omitted the part responsible for the search results along with a header.



.
.
.


  
  
  
  
  
    
      
        cat
        electronics
        17
        
          
            inStock
            true
            13
          
          
            inStock
            false
            4
          
        
      
      
        cat
        memory
        6
        
          
            inStock
            true
            6
          
        
      
      
        cat
        connector
        2
        
          
            inStock
            false
            2
          
        
      
      
        cat
        graphics card
        2
        
          
            inStock
            false
            2
          
        
      
      
        cat
        hard drive
        2
        
          
            inStock
            true
            2
          
        
      
      
        cat
        monitor
        2
        
          
            inStock
            true
            2
          
        
      
      
        cat
        search
        2
        
          
            inStock
            true
            2
          
        
      
      
        cat
        software
        2
        
          
            inStock
            true
            2

The presentation of faceting results has changed in this case. For each of the main level we have the markers defining the field (the tag with the attribute name=”field”), value (the tag with the attribute name=”value”) and the number of documents (the tag with the attribute name=”count”). Next there is the the second level hierarchy (tag with the attribute name=”pivot”). The second level contains the same elements as the first level – name, value and the number of documents with a given value.

Let’s see how this mechanism can deal with more levels of depth. To check that I run the following query:

http://localhost:8983/solr/select/?q=*:*&facet=true&facet.pivot=cat,inStock,features

I omitted the response header with the results, leaving the faceting results only. In addition, due to the length of the faceting results I only show one level one level faceting:



.
.
.


  
  
  
  
  
    
      
        cat
        electronics
        17
        
          
            inStock
            true
            13
            
              
                features
                2
                7
              
              
                features
                3
                7
              
              
                features
                lcd
                5
              
              
                features
                x
                5
              
              
                features
                ca
                4
              
              
                features
                latenc
                4
              
              
                features
                tft
                4
              
              
                features
                v
                4
              
              
                features
                0
                3
              
              
                features
                1
                3
              
              
                features
                25
                3
              
              
                features
                30
                3
              
              
                features
                5
                3
              
              
                features
                7
                3
              
              
                features
                8
                3
              
              
                features
                time
                3
              
              
                features
                up
                3
              
              
                features
                000
                2
              
              
                features
                19
                2
              
              
                features
                20
                2
              
              
                features
                2336
                2
              
              
                features
                27
                2
              
              
                features
                275
                2
              
              
                features
                6
                2
              
              
                features
                75
                2
              
              
                features
                activ
                2
              
              
                features
                built
                2
              
              
                features
                cach
                2
              
              
                features
                color
                2
              
              
                features
                flash
                2
              
              
                features
                heat
                2
              
              
                features
                heatspread
                2
              
              
                features
                matrix
                2
              
              
                features
                mb
                2
              
              
                features
                ms
                2
              
              
                features
                photo
                2
              
              
                features
                resolut
                2
              
              
                features
                seek
                2
              
              
                features
                speed
                2
              
              
                features
                spreader
                2
              
              
                features
                unbuff
                2
              
              
                features
                usb
                2
              
            
          
          
            inStock
            false
            4
            
              
                features
                0
                2
              
              
                features
                1
                2
              
              
                features
                16
                2
              
              
                features
                2
                2
              
              
                features
                20
                2
              
              
                features
                3
                2
              
              
                features
                9
                2
              
              
                features
                90
                2
              
              
                features
                adapt
                2
              
              
                features
                car
                2
              
              
                features
                clock
                2
              
              
                features
                direct
                2
              
              
                features
                directx
                2
              
              
                features
                dual
                2
              
              
                features
                dvi
                2
              
              
                features
                express
                2
              
              
                features
                gddr
                2
              
              
                features
                ghz
                2
              
              
                features
                gl
                2
              
              
                features
                gpu
                2
              
              
                features
                gpuvpu
                2
              
              
                features
                hdtv
                2
              
              
                features
                mb
                2
              
              
                features
                mhz
                2
              
              
                features
                open
                2
              
              
                features
                opengl
                2
              
              
                features
                out
                2
              
              
                features
                pci
                2
              
              
                features
                power
                2
              
              
                features
                vpu
                2
              
              
                features
                white
                2
              
              
                features
                x
                2

As shown in the example, also in this case Solr had no problems with the correct calculation of the hierarchy. The above example is almost the same, in the context of data available, as the previous example, it only contains one more level of depth.

A few words at the end

In my opinion this is one of the more useful features for “ordinary” user. Unfortunately, so far only available in development version of Solr. I have not found any information about whether it is planned to transfer this functionality to version 1.5 of Solr, which is named branch_3x branch in SVN. However, it is important that this functionality was commited, and sooner or later Solr users will be able to use it.

Quick look – FieldCollapsing

Rafał Kuć — Mon, 20 Sep 2010 12:12:39 +0000

FieldCollapsing, or in other words grouping of search results has just been commited to the svn repository. I decided to take a look at this functionality and see how it works.

I want to begin with brief information – FieldCollapsing is only available in version 4.0 of Solr, which is a development version of Solr project, and it’s rather unlikely to be transfered to version 3.X.

FieldCollapsing – what is it ?

Imagine that our index contains information about companies from different cities. We want to show our users one (or, for example two or three) companies in each city, of course, the companies that meet the search criteria. How to do that – just use the FieldCollapsing mechanism. It allows the returned results to be grouped based on field contents. The search results can be grouped into a single document, or a fixed quantity of documents.

Parameters

Similarly, as with most features available in Solr, the behavior of FieldCollapsing mechanism can be configured through a number of parameters, here they are:

group – setting this parameter to true enables FieldCollapsing mechanism. The default value is false.
group.field – this parameter determines on the contents of what field grouping is going to take place.
group.func – definition of function, based on the outcome of which grouping will be made.
group.limit – the number of documents returned in each group. The default is 1.
group.sort – parameter specifying how to sort the documents in groups. The default value is the value score desc.

It is worth noting that the rows parameter passed to the query will determine the number of groups to be returned in search results not the amount of individual documents. Sort parameter behaviour is also changed. This parameter will tell Solr how to sort groups not individual documents. Groups wil be sorted based on the content of fields of the first documents in every group.

Search Results

Search results are different from those to which we are accustomed. They are grouped according to the parameters that we have passed. The main element of the search results are no longer documents – when we use FieldCollapsing the main search result element is a group of documents. Within the groups the documents are shown (their number is defined by group.limit parameter). For example, making the following query:

http://localhost:8983/solr/select/?q=*:*&group=true&group.field=instock&indent=true

to Solr which index was created by indexing all documents in XML format from a catalog exampledocs will result in getting the following response:




  0
  0
  
    inStock
    true
    true
    *:*
  


  
    19
    
     
        T
        
          
            electronicshard drive
            7200RPM, 8MB cache, IDE Ultra ATA-133NoiseGuard, SilentSeek technology, Fluid Dynamic Bearing (FDB) motor
            SP2514N
            true
            Samsung Electronics Co. Ltd.
            2006-02-13T15:26:37Z
            Samsung SpinPoint P120 SP2514N - hard drive - 250 GB - ATA-133
            6
            92.0
            45.17614,-93.87341
            45.17614
            -93.87341
            45.17614,-93.87341
          
        
      
      
        F
        
          
            electronicsconnector
            car power adapter, white
            F8V7067-APL-KIT
            false
            Belkin
            2005-08-01T16:30:25Z
            Belkin Mobile Power Cord for iPod w/ Dock
            1
            19.95
            45.17614,-93.87341
            45.17614
            -93.87341
            45.17614,-93.87341
            4.0

At the end

An interesting feature that will certainly find use in some systems. However, please note that this functionality will be further developed. So far there is no support for distributed search and for grouping on multivalued fields. At this time there’s no point of a performance testing, first because of the changes that will come to the mechanism, and secondly because of the fact that this is Lucene and Solr 4.0 which are both in development. However, I will be definitely watching how this functionality evolves