query – Solr.pl

RankField & Rank Query Parser

Rafał Kuć — Mon, 28 Sep 2020 14:22:14 +0000

One of the additions to Solr that we didn’t talk about yet is the new field type called the RankField and the Rank Query Parser that can leverage it. Together they can be used to introduce scoring based on the content of the document in an optimized way. Let’s have a quick look at what the mentioned pair gives us.

The Idea Behind Rank Query Parser

The idea behind the Rank Query Parser is that it provides the functionality of using the information from the document to modify the score of the resulting documents. It provides a subset of what the Function Query Parser already provided, but it can also be used with the BlockMax-WAND algorithm for improved query performance.

The RankField

Using RankField is very simple. We need to define the appropriate field type, a field using that field type, and of course, populate it with data. Let’s assume we have the following document structure:

{
  "id" : 1,
  "name": "RankField and RankQueryParser",
  "type": "post",
  "views": 1000 
}

We have the document identifier, the name of the document, its type, and the number of views. We will be interested in the last field. In addition to using it for display purposes, we would also like to use it for ranking. Our schema could look as follows:

We also need to define the rank type, which could look as follows:

That is everything we need – we are ready to go.

Using the Rank Query Parser

To simply use the RankQueryParser and include the views field in the scoring calculation we could run a query similar to the following one:

q=_query_:{!rank f='views' function='log'}

Knowing that we have two documents that look as follows:

[
  {
    "id" : 1,
    "name": "RankField and RankQueryParser",
    "type": "post",
    "views": 1000 
  },
  {
    "id" : 2,
    "name": "Lucene and Solr 8.6.1 were released",
    "type": "announcement",
    "views": 10
  }
]

Our results would look like this:

{
  "responseHeader":{
    "zkConnected":true,
    "status":0,
    "QTime":3,
    "params":{
      "q":"_query_:{!rank f='views' function='log'}",
      "fl":"score,*"}},
  "response":{"numFound":2,"start":0,"maxScore":6.908755,"numFoundExact":true,"docs":[
      {
        "id":"1",
        "name":"RankField and RankQueryParser",
        "type":"post",
        "_version_":1678886835690930176,
        "score":6.908755},
      {
        "id":"2",
        "name": "Lucene and Solr 8.6.1 were released",
        "type":"announcement",
        "_version_":1678886835758039040,
        "score":2.3978953}]
  }}

You can see that even though we’ve run the match all query that gives a score of 1.0 to all matching documents, the score in our case is different. Solr took the log function and applied it to all matching results.

Performance

Of course, the above behavior can be easily achieved by using a standard Function Query Parser, but the key point with the Rank Query Parser is that we can use the BlockMax-WAND algorithm to improve the performance of our query. To do this we need to include the minExactCount parameter to our query to define how many accurate hits need to be present in the results. After that, Solr may skip documents that do not enter the top N results matching the query.

The response from Solr when minExactCount parameter is used look as follows:

{
  "responseHeader":{
    "zkConnected":true,
    "status":0,
    "QTime":1,
    "params":{
      "q":"_query_:{!rank f='views' function='log'}",
      "fl":"score,*",
      "minExactCount":"1"}},
  "response":{"numFound":2,"start":0,"maxScore":6.908755,"numFoundExact":true,"docs":[
      {
        "id":"1",
        "name":"RankField and RankQueryParser",
        "type":"post",
        "_version_":1678886835690930176,
        "score":6.908755},
      {
        "id":"2",
        "name":"Lucene and Solr 8.6.1 were released",
        "type":"announcement",
        "_version_":1678886835758039040,
        "score":2.3978953}]
  }}

You can see an additional numFoundExact attribute in the response header. We will talk about the BlockMax-WAND algorithm in Solr in the next few weeks in a dedicated blog post, so stay tuned if you would like to read about it. There are some pros and cons to it that I think is worth discussing.

Available Functions

At the moment of writing the blog post there are three functions available that we can use with the Rank Query Parser:

log – the logarithmic function, which accepts weight and scalingFactor attributes
satu – the saturation function accepting the pivot and weight attributes
sigm – the sigmoid function accepting the pivot, weight, and exponent attributes

You can use one of those functions to scale the scoring factor and adjust how the rank field value affects the scoring.

Conclusions

Though we already had the ability to include the function query in our queries and use the field value from it we can now also use the BlockMax-WAND algorithm. This allows improving the query performance in situations where we don’t need the exact number of rows and we are happy with only top N results. Something worth considering.

SolrCloud and query execution control

Rafał Kuć — Mon, 14 Jan 2019 14:27:53 +0000

With the release of Solr 7.0 and introduction of new replica types, in addition to the defa?ult NRT type the question appeared – can we control the queries and where they are executed? Can we tell Solr to execute the queries only on the PULL replicas or give TLOG replicas a priority? Let’s check that out.

Shards parameter

The first control option that we have in SolrCloud is the shards parameter. Using it we can directly control which shards should be used for querying. For example we can provide a logical shard name in our query:

http://localhost:8983/solr/test/select?q=*:*&shards=shard1

http://localhost:8983/solr/test/select?q=*:*&shards=shard1,shard2,shard3

http://localhost:8983/solr/test/select?q=*:*&shards=localhost:6683/solr/test

The first of the above queries will be executed only on those shards that are grouped under the logical shard1 name. The second query will be executed on logical shard1, shard2 and shard3, while the third query will be executed on the shards that are deployed on the localhost:6683 node on the test collection

There is also a possibility to do load balancing across instances, for example:

http://localhost:8983/solr/test/select?q=*:*&shards=localhost:6683/solr/test|localhost:7783/solr/test

The above query will be executed on instance running on port 6683 or on the one running on port 7783.

Shards.preference parameter

While the shards parameter gives us some degree of control where the query should be executed it is not exactly what we would like to have. However to use a certain type of replica we would have to get the data about the physical layout of the shards and this is not something that we would like to do. Because of that the shards.preference parameter has been introduced to Solr. It allows us to tell Solr what type of replicas should have the priority when executing query.

For example, to tell Solr that PULL type replicas should have priority when the query is executed one should add the shards.preference parameter to the query and set it to replica.type:PULL:

http://localhost:8983/solr/test/select?q=*:*&shards.preference=replica.type:PULL

The nice thing is that we can tell Solr that first PULL replicas should be used and then if they are not available TLOG replicas should be used:

http://localhost:8983/solr/test/select?q=*:*&shards.preference=replica.type:PULL,replica.type:TLOG

We can also define that PULL types replicas should be used first and if they are not available local shards should have the priority:

http://localhost:8983/solr/test/select?q=*:*&shards.preference=replica.type:PULL,replica.location:local

In addition to the above example we can also define priority based on location of the replicas. For example if our 192.168.1.1 Solr node is way more powerful compared to the others and we would like to first prioritize PULL replicas and then the mentioned Solr node we would run the following query:

http://localhost:8983/solr/test/select?q=*:*&shards.preference=replica.type:PULL,replica.location:http://192.168.1.1

Summary

The discussed parameters and the shards.preference in particular with its replica.type value can be very useful when we are using SolrCloud with different types of replicas. Telling Solr that we would like to prefer PULL or TLOG replicas we can lower the query based pressure on the NRT replicas and thus have better performance of the whole cluster. What’s more – dividing the replicas can help us in achieving query performance that is close what Solr master – slave architecture provides without sacrificing all the goodies that come with SolrCloud itself.

Solr 6.0 and graph traversal support

Rafał Kuć — Mon, 18 Apr 2016 13:02:05 +0000

One of the new features that are present in the recently released Solr 6.0 is the graph traversal query that allows us to work with graphs. Having a root set and relations between documents (like parent identifier of the document) we can use a single query to get multiple levels of joins in the same request. Let’s look at this new feature working both in old fashioned Solr master – slave as well as in SolrCloud.

For the purpose of this blog post we will use a very simple data set that we can index using a single command and a configuration that can be downloaded from our github account available at: https://github.com/solrpl/blog.

Creating the collection and indexing the data

What we need first is creating the collection and indexing the data itself. We will start Solr with the following using the following command:

bin/solr start -c

This will launch our Solr instance in the cloud mode. Now we need to send our configuration files to Zookeeper which we will do by running the following command:

bin/solr zk -upconfig -n graph_test_config -z localhost:9983 -d graph_test/conf

Next we will create the our graph collection by running the following command:

curl -XGET 'http://localhost:8983/solr/admin/collections?action=CREATE&name=graph&numShards=2&replicationFactor=1&maxShardsPerNode=2&collection.configName=graph_test_config'

Now, after collection creation we can finally index the data by running the following command:

curl -XPOST -H 'Content-Type: application/json' 'http://localhost:8983/solr/graph/update' --data-binary '{
 "add" : { "doc" : { "id" : "1", "name" : "Root document one" } },
 "add" : { "doc" : { "id" : "2", "name" : "Root document two" } },
 "add" : { "doc" : { "id" : "3", "name" : "Root document three" } },
 "add" : { "doc" : { "id" : "11", "parent_id" : "1", "name" : "First level document 1, child one" } },
 "add" : { "doc" : { "id" : "12", "parent_id" : "1", "name" : "First level document 1, child two" } },
 "add" : { "doc" : { "id" : "13", "parent_id" : "1", "name" : "First level document 1, child three" } },
 "add" : { "doc" : { "id" : "21", "parent_id" : "2", "name" : "First level document 2, child one" } },
 "add" : { "doc" : { "id" : "22", "parent_id" : "2", "name" : "First level document 2, child two" } },
 "add" : { "doc" : { "id" : "121", "parent_id" : "12", "name" : "Second level document 12, child one" } },
 "add" : { "doc" : { "id" : "122", "parent_id" : "12", "name" : "Second level document 12, child two" } },
 "add" : { "doc" : { "id" : "131", "parent_id" : "13", "name" : "Second level document 13, child three" } },
 "commit" : {}
}'

So our data have the following relations:

Let’s try searching in the structure.

Basic graph traversal query usage

In the basic form of graph traversal query we need to provide a root set of documents, specify which is the identifier field, which is the parent identifier field and run the query. For example, if we would like to find all the documents in relation to both the root documents and the ones with next levels of relations we could run the following query:

http://localhost:8983/solr/graph/select?q=*:*&fq={!graph from=parent_id to=id}name:"root document"

The documents returned for such query would look as follows:



 
  true
  0
  8
  
   *:*
   {!graph from=parent_id to=id}name:"root document"
  
 
 
 
  1
  Root document one
  1531331026113003520
 
  11
  1
  First level document 1, child one
  1531331026114052096
 
  12
  1
  First level document 1, child two
  1531331026115100672
 
  13
  1
  First level document 1, child three
  1531331026115100673
 
  122
  12
  Second level document 12, child two
  1531331026120343552
 
  2
  Root document two
  1531331026109857792
 
  3
  Root document three
  1531331026110906368
 
  21
  2
  First level document 2, child one
  1531331026111954944

As you can see, we’ve got our root documents returned (identifiers 1, 2 and 3) and we’ve got all the leafs for the whole data set, which is very nice.

Filtering

The results can be filtered by using the traversalFilter property. The defined filter will be applied to each join iteration. For example, if we would like to filter the resulting documents to only those that have term one in the name field we could run the following query:

http://localhost:8983/solr/graph/select?q=*:*&fq={!graph from=parent_id to=id traversalFilter=name:one}name:"root document"

The results would be as follows:



 
  true
  0
  7
  
   *:*
   {!graph from=parent_id to=id traversalFilter=name:one}name:"root document"
  
 
 
 
  1
  Root document one
  1531331026113003520
 
  11
  1
  First level document 1, child one
  1531331026114052096
 
  2
  Root document two
  1531331026109857792
 
  3
  Root document three
  1531331026110906368
 
  21
  2
  First level document 2, child one
  1531331026111954944

As you can see, only the filtered documents were returned for each join and of course the root documents set. Seems that the filter is working

Returning the root or leafs

Apart from filtering we can also tell Solr to only return leafs and to omit root set documents. For example, to omit root set documents we would add the returnRoot property equal to false (defaults to true) in our query:

http://localhost:8983/solr/graph/select?q=*:*&fq={!graph from=parent_id to=id returnRoot=false}name:"root document"

The results are as follows:



 
  true
  0
  10
  
   *:*
   {!graph from=parent_id to=id returnRoot=false}name:"root document"
  
 
 
 
  11
  1
  First level document 1, child one
  1531331026114052096
 
  12
  1
  First level document 1, child two
  1531331026115100672
 
  13
  1
  First level document 1, child three
  1531331026115100673
 
  122
  12
  Second level document 12, child two
  1531331026120343552
 
  21
  2
  First level document 2, child one
  1531331026111954944

As we can see the results are without the root documents.

If we are interested in leafs only, we should add the returnOnlyLeaf parameter and set it to true (defaults to false).

Controlling maximum depth

Finally, using the maxDepth property we can control the maximum depth of the traversal. By default, it is set to -1 which stands for unlimited. For example, if we are only interested in the first level of graph, we could run the following query:

http://localhost:8983/solr/graph/select?q=*:*&fq={!graph from=parent_id to=id maxDepth=1}name:"root document"

The result includes only documents that are of one join from the documents in the root set:



 
  true
  0
  10
  
   *:*
   {!graph from=parent_id to=id maxDepth=1}name:"root document"
  
 
 
 
  1
  Root document one
  1531331026113003520
 
  11
  1
  First level document 1, child one
  1531331026114052096
 
  12
  1
  First level document 1, child two
  1531331026115100672
 
  13
  1
  First level document 1, child three
  1531331026115100673
 
  2
  Root document two
  1531331026109857792
 
  3
  Root document three
  1531331026110906368
 
  21
  2
  First level document 2, child one
  1531331026111954944

Summary

As you can see, in addition to the Parallel SQL over Map Reduce functionality and the cross data center replication we’ve got a pretty neat graph traversal support in Solr 6.0. We haven’t had a chance to test performance of the query on the larger data set, but we will try to come up with the larger data set and some sample queries to see what we can expect from Solr when it comes to graph traversal query performance.

Update

We didn’t mention it, but you can see that not all documents from our sample data set were included in the results. This is because our collection was created with two shards and we run distributed query. To avoid that we could just create collection with a single shard and live with that until graph query supports more

Switch Query Parser – quick look

Rafał Kuć — Mon, 03 Jun 2013 11:59:28 +0000

The number of different query parsers in Solr was always an amusement for me. Can someone here say how many of them are currently available and say what are those? Anyway, in this entry we won’t be talking about all the query parsers available in Solr, but we will take a quick look at one of them – the SwitchQueryParser introduced in Solr 4.2.

Logic behind the parser

The logic of the SwitchQueryParser is quite simple – allow processing a simple logic on the Solr side and add it as a sub-query. For example let’s say that we have an application that understand the following four values of the priceRange field:

cheap – when the price of the product (indexed in the price field) is lower than 10$,
average – when the price is between 10 and 30$,
expensive – when the product price is higher than 30$,
all – in case we want to return all the documents without looking at the price.

We want to have this logic stored in Solr somehow in order not to change our application or its configuration every time we want to change the above ranges. For this purpose we will use the SwitchQueryParser.

Our query

Let’s assume that our application will be able to send the following query:

http://localhost:8983/solr/collection1/price?q=*:*&priceRange=cheap

We want the above query to return all the documents (q=*:*) narrowed to only those that are cheap, which in our case it will mean that they have price lower than 10$ (priceRange=cheap parameter).

Solr configuration

Of course we don’t want to send the price range in the query, because that wouldn’t make much sense for us. Because of that we decided to alter the solrconfig.xml file and add a new SearchHandler with the name of /price, which is configured as follows:


 
  all
 
 
  {!switch case.all='price:[* TO *]' case.cheap='price:[0 TO 10]' case.average='price:[10 TO 30]' case.expensive='price:[30 TO *]' v=$priceRange}

As you can see the configuration of our SearchHandler consist of two elements. In the defaults section we defined the default value for the priceRange parameter, which is all. In addition to that, we’ve defined filter (fq) which is using the SwitchQueryParser (!switch). For each of the possible values the priceRange parameter can take (v=$priceRange) we defined a filter on the price field using the following expression – case.priceRangeValue=filter. So, when the value of the priceRange parameter in the query will be equal to cheap than Solr will use the filter defined by the case.cheap part of the filter definition, when the priceRange parameter value will be equal to expensive than Solr will use the filter defined by the case.expensive and so on.

What to remember about

There is one crucial thing to remember when using the described parser. In our case, if we would priceRange parameter value different than the four mentioned it will result in Solr error.

Summary

In my opinion the SwitchQueryParser is a nice addition, although it is rather a feature that will be used by the minority of the users. However taking into consideration that is can help implementing some very basic logic and because it is simple (and thus not hungry for resources) there will be users which will find this nice query parser useful

Do I have to look for maxBooleanClauses when using filters ?

Rafał Kuć — Mon, 19 Dec 2011 20:56:29 +0000

One of the configuration variables we can find in the solrconfig.xml file is maxBooleanClauses, which specifies the maximum number of boolean clauses that can be combined in a single query. The question is, do I have to worry about it when using filters in Solr ? Let’s try to answer that question without getting into Lucene and Solr source code.

Query

Let’s assume that we have the following query we want to change:

q=category:1 AND category:2 AND category:3 ... AND category:2000

Sending such a query to the Solr instance with the default configuration would result in the following exception: “too many boolean clauses“. Of course we could modify the maxBooleanClauses variable in solrconfig.xml file and get rid of the exception, but let’s try do it the other way:

Let’s change the query to use filters

So, we change the above query to use filters – the fq parameter:

q=*:*&fq=category:(1 2 3 ... 2000)

We send the query to Solr and … and again the same situation happens – exception with the “too many boolean clauses” message. It happens because Solr has to “calculate” filter content and thus run the appropriate query. So, let’s modify the query once again:

Final query change

After the final modification our query should look like this:

q=*:*&fq=category:1&fq=category:2&fq=category:3&....&fq=category:2000

After sending such a query we will get the search results (of course if there are documents matching the query in the index). This time Solr didn’t have to run a single, large boolean query and that’s why we didn’t exceed the maxBooleanClauses limit.

To sum up

As you can see the answer to the question asked in the begining of the post depend on the query we want to run. If our query is using AND boolean operator we can use fq parameter because multiple fq parameters are concatenated using AND. However if we have to use OR we would have to change the limit defined by maxBooleanClauses configuration variable. But please remember that changing this limit can have a negative effect on performance and memory usage.

Quick look: frange

Rafał Kuć — Mon, 30 May 2011 18:46:57 +0000

In Solr 1.4 there were a new type of queries presented the frange queries. This new type of queries let you search for a range of values. According to the Solr developers this queries should be much faster from normal range queries. I thought that I should make a simple test to see how much faster, the new range queries can be expected to be.

Querying Solr

To use the frange query syntax we will need to modify the query. So far, the range query of the data may look as follows:

fq=test_si:[0+TO+10000]

The same query made using the frange looks like this:

fq={!frange l=0 u=10000}test_si

Of course, it is also possible to send query for data types other than numbers, for example:

fq={!frange l=adam u=mariusz}name

Performance

The very logic of the test is pretty simple. The structure of the index contains two fields: id, a unique identifier and a field namestr (string), which contained the values for which I will ask. I generated about 100,000 distinct documents. In addition, in each of the documents the tems in the namestr field are unique so we will easily be able to determine the percentage of terms covered by a given query. Then I started to send queries to Solr that covered a certain percentage of terms in the index. Each of the queries were send multiple times and the execution time of the queries were summed and divided by the number of queries send. The following table shows the test result:

[table “6” not found /]

As you can see a standard range query is faster only for queries that cover a small number of terms from the given field. As you can see, the performance gain using the frange queries is starts from about 5% of terms covered. Interestingly, we get a nice increase in query speed, which is encouraging for even faster searching.

To sum up

The results of my test are different in terms of performance to what Yonik Seeley wrote on his blog (my test data can be the cause of this), but what we can say is that using frange queries we can expect an increase in performance for queries that need to search for a range of values.

Optimization – query result window size

Rafał Kuć — Mon, 10 Jan 2011 08:06:02 +0000

Hereby I would like to start a small series of articles describing the elements of the optimization of Solr instances. At first glance I decided to describe the parameter that specifies the data fetch window size – the query result window size. Hopefully, this article will explain how to use this parameter, how to modify and adapt it to your needs.

The begining

To start talking about this configuration parameter, we must first say how Solr fetches results from the index. When passing the rows parameter to Solr, with the value of 20 for example, we tell Solr that we want the result list to contain the maximum of 20 documents. However, the number of results, which was taken from the index varies and is determined precisely by the queryResultWindowSize parameter. This parameter, defined in the solrconfig.xml file, determines how many results will be retrieved from the index and stored in queryResultCache.

But what can I use queryResultWindowSize for ?

The queryResultWindowSize parameter specifies the size of so called results window, which is simply the number of documents that will be fetched from the index when retrieving search results. For example, setting queryResultWinwdowSize to 100 and send the following query:

q=car&rows=30&start=10

will result in a maximum of 30 documents in the search result list, however Solr will fetch 100 documents from the index (starting at index 0 and ending at index 100) and then try Solr will place them in queryResultCache. The next query, which will differ only in the parameters start and rows can be retrieved from queryResultCache.

Configuration

To set the queryResultWindowSize to the value of 100, you must add the following entry to the solrconfig.xml file:

What to remember ?

Of course, setting only the queryResultsWindowSize is not everything. You should still provide adequate space in queryResultCache for Solr to be able to store the necessary information. However queryResultCache configuration is a topic for another article.

But why use it ?

The answer to that question is quite simple – if your application and your users often use the paging it is reasonable to consider changing the default value of the queryResultWindowSize. In most cases, where the implementation was based on paging, changing the value of this parameter caused a severe increase in performance when switching between query pages of the results.

The scope of Solr faceting

Rafał Kuć — Mon, 23 Aug 2010 12:07:58 +0000

Faceting is one of the ways to categorize the content found in the process of information retrieval. In case of Solr this is the division of set of documents on the basis of certain criteria: content of individual fields, queries or on the basis of compartments or dates. In today’s entry I will try to some scope on the possibility of using the faceting mechanism, both currently available in Solr 1.4.1, as well as what will be available in the future.

One of the few sources of information about faceting is Solr wiki – to be more specific – the page at: http://wiki.apache.org/solr/SimpleFacetParameters. The following article is an extension to the information available on the wiki website.

Solr faceting mechanism can be divided into four basic types:

field faceting,
query faceting,
date faceting,
range facteing.

To turn Solr faceting mechanism on, one need to pass facet parameter with the value true.

Field faceting

First type of faceting. This type of faceting categorize documents found due to specified field. With this type of faceting we are able to get a number of documents found for example in each category or geographical location. Faceting by field is characterized by a large number of options which configure its behavior. This are the parameters available for use:

facet.field – parameter specifying which field will be used to perform faceting. This parameter can be specified multiple times. Remember that adding multiple facet.field parameters to the query can affect performance.
facet.prefix – restricts faceting results to those that begin with the specified prefix. The parameter can be defined for each field specified by the facet.field parameter – you can do it, by adding a parameter like this: facet.field_name.prefix. This parameter is a relatively simple way to implement autocomplete mechanism.
facet.sort – specifies how to sort faceting results. If You use Solr version lower than 1.4, this parameter takes values of true or false indicating successively – sort by the number of results and sort by index order (in the case of ASCII this means alphabetical sorting). If however You are using Solr version 1.4 or higher You should use count value (meaning the same as true), or index value (meaning the same as false). The default value for this parameter is true/count when facet.limit set to 0 or false/index for other values of facet.limit parameter. The parameter can be defined for each field specified by the facet.field parameter.
facet.limit – parameter specifying how many unique values of faceting results to display. A negative value for this parameter mean no limit. Please note that the larger the limit, the more memory you need and the longer query execution. Default parameter value is 100. The parameter can be defined for each field specified by the facet.field parameter.
facet.offset – parameter defining from offset (from the first faceting result) of presented faceting results. Default parameter value is 0. This parameter is designed to help implementing faceting result paging. The parameter can be defined for each field specified by the facet.field parameter.
facet.mincount – parameter specifying the minimum size of result to be included in faceting results. The default value is 0. The parameter can be defined for each field specified by the facet.field parameter.
facet.missing – parameter specifying whether, in addition to standard faceting results, number of documents without a value in the specified field should be included. This parameter can take values of true or false. The default parameter value is false. The parameter can be defined for each field specified by the facet.field parameter.
facet.method – parameter introduced in Solr 1.4. It takes the value of enum or fc. Specifies a method for faceting calculation. Setting this parameter to enum effects in using term enumeration to calculate faceting results. This method is proven to be most efficient when dealing with fields with small number of unique terms. The second method, labeled fc, is the standard method for faceting calculation. It takes all the results and iterate over all documents in the result set. The parameter can be defined for each field specified by the facet.field parameter. The default value is fc for all the fields not based on Boolean type.
facet.enum.cache.minDf – parameter with strange sounding name specifying the minimum number of matching documents to a single term to the fc method to be used for faceting result calculation. I know it sounds strange but i do not know how to explain it easier

These are the parameters of field faceting. In case of most parameters I have written that there is a possibility to define their values for each field specified by facet.field parameter. How does it look like ? Suppose we have a query like this:

q=solr&facet=true&facet.field=category&facet.field=location

It is a simple query for ‘solr’ term with faceting mechanism turned on. There are two facet fields defined – category and location. Lets say, that we would like to have 200 facet results for category field sorted by count and 50 facet results for location field sorted alphabetically. To do that we add the following fragment to the query shown above:

facet.category.limit=200&facet.category.sort=count&facet.location.limit=50&facet.location.sort=index

As shown we can easily modify facet mechanism behavior for individual facet fields.

Query faceting

Facet mechanism based on a single parameter – facet.query to which we give a query. The query passed to the parameter must be constructed so that standard Lucene query parser can understand it. An example use of this parameter is, for example query a group of pricing, which could look like:

facet.query=price: [0+TO+100]

Note, however, that each added facet.query parameter is another query to Lucene, which means performance loss. Many facet.query parameters in a query can be painful to Solr.

There is one more thing worth mentioning when talking about query faceting – there is a possibility to define your own parser to parse facet.query parameter value. To use your own parser, for example, called myParser parameter passed to Solr should look like this:

facet.query={!myParser}aaa

Date faceting

New faceting functionality introduced in Solr 1.3. Date faceting allows you to calculate faceting results including all the intricacies of processing dates. Please note that date faceting can only be used with fields based on the type solr.DateField. Now let’s get on with the parameters associated with date faceting:

facet.date – like facet.field parameter, this parameter is used to identify fields where dates faceting should be used. As in the case of facet.field parameter you can specify this parameter several times to allow date faceting on many fields in one query.
facet.date.start – parameter specifying the lower limit of date on which the faceting calculation should be started. This parameter can be defined for each field specified by the facet.date parameter. This parameter is mandatory when using facet.date and should be defined for each facet.date parameter.
facet.date.end – parameter defining the upper limit of the date, on which the faceting calculation should be ended. This parameter can be defined for each field specified by the facet.date parameter. This parameter is mandatory when using facet.date and should be defined for each facet.date parameter.
facet.date.gap – parameter specifying date compartments to be generated for the defined boundaries. This parameter is mandatory when using facet.date and should be defined for each facet.date parameter. The parameter can be defined for each field specified by the facet.date parameter.
facet.date.hardend – parameter taking values true and false, telling Solr what to do in the case when the parameter facet.date.gap is not evenly splitting the compartments. If we set this parameter to true the last compartment generated by facet.date.gap parameter can be wider than the boundary defined by facet.date.end parameter. If we set this parameter to false (default value) the last compartment generated by facet.date.gap parameter can be smaller then the rest of the ranges. The parameter can be defined for each field specified by the facet.date parameter.
facet.date.other – parameter specifying what values besides the standard ones (ranges) should be added to results of date faceting. The parameter can be defined for each field specified by the facet.date parameter. The parameter can take following values:
- before – in addition to the standard date faceting results, there will be one more – number of documents with a date before the one defined in the facet.date.start parameter,
- after – in addition to the standard date faceting results, there will be one more – number of documents with the date after the one defined in the facet.date.end parameter,
- between – in addition to the standard date faceting results, there will be one more – number of documents with the date between facet.date.start and facet.date.end parameters,
- all – a shortcut to define all the above,
- none – none of the additional results will be added to date faceting results.
facet.date.include – parameter that will be introduced in Solr 4.0. It allows of closing or opening of the compartments defined by the boundaries and the gap. The parameter will accept the following values:
- lower – each of the resulting compartment will contain its lower limit,
- upper – each of the resulting compartment will contain its upper limit,
- egde – the first and last interval will include its external borders – that is, for the first lower and upper range for the last interval,
- outer – a parameter specifying that the compartments defined by the values before and after of the facet.date.other parameter will contain its borders, even if other compartments already contain these borders,
- all – a parameter that causes the inclusion of all of the above options.

That is how we can modify the behavior of the date faceting. Now, some example of using this kind of faceting:

q=solr&facet=true&facet.date=addDate&facet.date.start=NOW/DAY-30DAYS&facet.date.end=NOW/DAY%2B30DAYS&facet.date.gap=%2B1DAY

What does the above query do ? We turn the faceting mechanism on, we define date faceting for addDate field. What we want to get is the compartments between 30 days before today (NOW/DAY-30DAYS) and 30 days after today (NOW/DAY+30DAYS). The compartments will be of the size of a single day.

Range faceting

Functionality which will be available in Solr 3.1. If someone want to test it right now, both the trunk and branch 3.x have this functionality implemented. This method of faceting is the extension of date faceting. This functionality works similar to date faceting – as a result we get a list of compartments constructed automatically based on parameters. Here are the list of parameters that can be used to define range faceting behavior:

facet.range – like facet.field parameter, this parameter is used to identify fields where range faceting should be used. As in the case of facet.field parameter you can specify this parameter several times to allow range faceting on many fields in one query.
facet.range.start – parameter specifying the lower limit of range on which the faceting calculation should be started. This parameter can be defined for each field specified by the facet.range parameter. This parameter is mandatory when using facet.range and should be defined for each facet.range parameter.
facet.range.end – parameter defining the upper limit of the range, on which the faceting calculation should be ended. This parameter can be defined for each field specified by the facet.range parameter. This parameter is mandatory when using facet.range and should be defined for each facet.range parameter.
facet.range.gap – parameter specifying range compartments to be generated for the defined boundaries. This parameter is mandatory when using facet.range and should be defined for each facet.date parameter. The parameter can be defined for each field specified by the facet.date parameter.
facet.range.hardend – parameter taking values true and false, telling Solr what to do in the case when the parameter facet.range.gap is not evenly splitting the compartments. If we set this parameter to true the last compartment generated by facet.range.gap parameter can be wider than the boundary defined by facet.range.end parameter. If we set this parameter to false (default value) the last compartment generated by facet.range.gap parameter can be smaller then the rest of the ranges. The parameter can be defined for each field specified by the facet.rangeparameter.
facet.range.other – parameter specifying what values besides the standard ones (ranges) should be added to results of range faceting. The parameter can be defined for each field specified by the facet.range parameter. The parameter can take following values:
- before – in addition to the standard range faceting results, there will be one more – number of documents with a values lower than the one defined in the facet.range.start parameter,
- after – in addition to the standard range faceting results, there will be one more – number of documents with the values higher than the one defined in the facet.range.end parameter,
- between – in addition to the standard range faceting results, there will be one more – number of documents with the values between facet.range.start and facet.range.end parameters,
- all – a shortcut to define all the above,
- none – none of the additional results will be added to range faceting results.
facet.range.include – parameter allowing closing or opening of the compartments defined by the boundaries and the gap. The parameter will accept the following values:
- lower – each of the resulting compartment will contain its lower limit,
- upper – each of the resulting compartment will contain its upper limit,
- egde – the first and last interval will include its external borders – that is, for the first lower and upper range for the last interval,
- outer – a parameter specifying that the compartments defined by the values before and after of the facet.range.other parameter will contain its borders, even if other compartments already contain these borders,
- all – a parameter that causes the inclusion of all of the above options.

As you can see the range faceting parameters are almost identical to those in date faceting. The behavior is also almost identical. An example query using ranges faceting may be the following query:

q=solr&facet=true&facet.range=price&facet.range.start=0&facet.range.end=1000&facet.range.gap=100

So, we went through all of the types of faceting. But thats not all. Users of Solr version 1.4 and higher have the opportunity to use the so-called LocalParams.

LocalParams and faceting

Suppose we have a requirement. We have a query that returns search results for the term ‘solr’ and in which we have defined two filters, one for category and one for the country of origin of the document. In addition to the search results we want to enable navigation through the regions and categories, but we would like them not to be dependend on each other. That is, we want to give the opportunity to navigate through the regions for the term ‘solr’ but we dont want it to be limited to the selected category, and vice versa. To do it in Solr version 1.3 or earlier, we would write the following query:

q=solr&fq=category:search&fq=region:poland
q=solr&facet=true&facet.field=category&facet.field=region

Two queries, because first we have to get narrowed search results, on the other hand we need the faceting result not to be narrowed by filters. For Solr version 1.4 or higher, we can shorten this to one query. For this purpose, we use the possibility of tagging and exclusion of tagged parameters. First we change the query as follows:

q=solr&fq={!tag=categoryFQ}fq=category:search&fq={!tag=regionFQ}region:poland

For now, the search results will not change. We added tags to the filters in the above query so we can later exclude them in faceting. Then we modify the second query as follows:

q=solr&facet=true&facet.field={!ex=categoryFQ,regionFQ}category&facet.field={!ex=categoryFQ,regionFQ}region

So far the faceting results will not change. We added exclusions to the facet.field parameters, so filters named categoryFQ and regionFQ will not be taken into consideration when calculating faceting results.

Then we combine the modified query, so it should look as follows:

q=solr&fq={!tag=categoryFQ}fq=category:search&fq={!tag=regionFQ}region:poland&facet=true&facet.field={!ex=categoryFQ,regionFQ}category&facet.field={!ex=categoryFQ,regionFQ}region

I`ll write more about LocalParams in a future entries.

A few words at the end

I hope that this article approached the possibility of using Solr faceting, both in earlier versions of Solr, in the present, as well as those that arise in the nearest future.

6 deadly sins in the context of query

Rafał Kuć — Wed, 11 Aug 2010 12:04:37 +0000

In my work related to Lucene and Solr I have seen various queries. While in the case of Lucene, developer usually knows what he/she wants to achieve and use more or less optimal solution, but when it comes to Solr it is not always like this. Solr is a product which could theoretically be used by everyone, both the person who knows Java, one that does not have a broad and specialized technical knowledge, as well as programmer. Precisely because of that Solr is a product which is easy to run and use it, at least when it comes to simple functionalities. I suppose, that is why not many people are worried about reading Solr wiki or at least review the mailing list. As a result, sooner or later people tend to make mistakes. Those errors arise from various shortcomings – lack of knowledge about Solr, lack of skills, lack of experience or simply a lack of time and tight deadlines. Today I would like to show some major mistakes when submitting queries to Solr and how to avoid those mistakes.

1. Lack of filters

One of the fundamental errors that I encounter from time to time is the lack of filters, which in context of query means no fq parameter. Let us remember that filters are out friends Remember that because of filters Solr cache is used more optimally. Filter do not affect relevance of the document in the context of query and search results (score factor), and thus, we can perform filtering without fear of the change in the score value of individual documents (useful for example in e-commerce for product groups narrowing).

2. Logical conditions and q parameter

Another of the “sins” that I come across quite often is a one with close relationship with the previous point. It’s not a bug in the literal sense, but it is an area where a simple change will have a significant influence on performance. Assuming that the default logical operator is OR, imagine your query in the form of: q=(java+design+patterns)+AND+category:books+AND+promotion:true+AND+publisher:ABC. This query is correct from the perspective of the application logic, where we get the appropriate group search results. But what if we also want to optimally use Solr cache and thus boost performance. The anserw is quite simple – move some of the terms to filters. By changing out a query to q=java+design+patterns&fq=category:books&fq=promotion:true&fq=publisher:ABC Solr benefit from two types of cache – queryDocumentCache to retrieve documents for a query with parameter q and the filterCache for each of the filters. With the change of the query we were able to optimize the query to use two types of cache and in addition to optimize the entries of queryDocumentsCache (due to the shortening of the query parameter q).

3. Hugh numbers of facet queries

Another “sin” associated with handling groups of documents. Quite often, especially, in applications that can categorize products in many ways, I have met queries with a lot of facet.query parameters that correspond to the grouping of documents. Grouping by price, location, product groups, and so on. A good example is grouping by price where the business customer can set the price ranges for each category and then application must group products by those ranges. This leads to queries that have 100, 200 or more facet.query parameters added. Please remember that each facet.query has an impact on efficiency, not to mention 100 or 200 of them. If we are interested in a quick response from Solr we can not make such a queries. In such cases, I always propose modifying the index structure if needed, and they are needed in most cases. Some modifications (like defining ranges at index time) allows to eliminate tens or hundreds of facet.query for one facet.field parameter. But this method is associated to another problem – explanation for the customer, why “re-index button” must be pressed after range changes. As a rule, however, performance testing at high loads and large variety of queries speak for themselves.

4. Facet limits

The problem appears in the line where Solr meets business logic. An example of this “sin” is a simple list of categories that a customer wants to have displayed depending on user location on the website. When we have a small numbers of categories we do not have a problem, but what about thousands of categories. Very often, I met with the approach taken by the developers to retrieve all categories of Solr (with increased facet.limit parameter compared to the default value) and choose the right categories in the application that is using Solr. I think this approach can generate problems – first of all faceting requires memory, second aggregating of facet elements need time, and of course getting all of the 50.000 categories with it’s counts can be painful to Solr. If we want a fast queries, try to use the parameter facet.limit reasonable. If You need many facet results try to build your application so it can use facet.offset parameter and therfore use paging. If this is not possible, at least configure Your container to have enough memory to handle parallel queries and get ready to have queries that can perform longer when having high facet.limit value.

5. Downloading of unnecessary data

Very common problem is the retrieval of all information, not just those that we need. Of course, the problem does not apply to deployments where Sole offers information such as, for example, only the product ID. However, a large number of deployments that I’ve deal with was based almost entirely on Solr, and hence Solr index was made up of multiple stored fields. Developers using Solr very rarely used the parameter fl and the possibility of limiting the fields that are returned. In extreme cases, this led to problems with the amount of data that had to be sent over the network.

6. Many requests to obtain count of groups of documents

In some applications more important than the actual search capabilities were the navigation, where users can browse document repository by it’s feature, like department, category, subcategory, and so on. Very often, in addition to names there are also numbers displayed – the numbers of documents with this feature. I met cases were the number of documents were obtained using a separate query. Effect – 100 categories displayed on a web page led to 100 separate queries to Solr. Do not go this way, if You have to modify the Solr index to use the facet mechanism. Maybe at that time it will be more work, but certainly in the long run this is worth it.

A few words at the end

Please note that these are just examples that I think are fairly universal, at least, which I quite often encountered during my work. They are not all the errors that happen when using the Solr, but I hope to highlight some of the mistakes people tend to make and how to go around them.

Solr and PhraseQuery – phrase bonus in query stage

Rafał Kuć — Wed, 14 Jul 2010 09:19:38 +0000

In the majority of system implementations I dealt with, sooner or later, there was a problem – search results tunning. One of the simplest ways to improve the search results quality was phrase boosting. Having the three most popular query parsers in Solr and the variety of parameters to control them I though it will be a good idea to check how they behave and how they affect performance.

In the current trunk of Solr we have three query parsers:

Standard Solr Query Parser – default parser for Solr based on Lucene query parser
DisMax Query Parser
Extended DisMax Query Parser

Each of the mentioned query parsers have it`s own capabilities in case of phrase boosting on query stage. I won`t mention index time term proximity in this post – I`ll get back to it some other time. So, about the parsers now.

Standard Solr Query Parser

Parser based on Standard Lucene Query Parser and enhancing it`s parent capabilities. When it comes to phrase boosting, we don`t have much choice. Lets say, that our system is a search system for large Internet library, where users can rate books, leave comments and discuss books in the library forums. Our goal is to index all the data generated by the users and our suppliers and then represent this data in our search results. When user search for “Java design patterns” we want to show him the books that have those words in a document. No problem, lets make a Solr query like this:

q=java+design+patterns

So we get the results and we can say that our search engine is behaving well and we don`t want to improve search quality. But I would add another part to the query – part that would favor document which have a phrase (words given to the query are next to each other in the document) in the search-able fields. It`s an easy step, our modified query would look like this:
q=java+design+patterns+OR+"java+design+patterns"^30

By adding that additional query part (+OR+”java+design+patterns”^30) we modified our search results – by adding that part, on the first position in our result we now have books which have the exact phrase in the search fields. Lucene query generated by the parser look like that:

name:java name:design name:patterns PhraseQuery(name:"java design patterns"^30.0)
name:java name:design name:patterns name:"java design patterns"^30.0

Search results for above query as follows:




   0
   0
   
      java design patterns OR "java design patterns"^30
      score,id,name
   


   
      1.2399161
      1
      Java design patterns
   
   
      0.010219089
      2
      Design patterns java
   
   
      0.010219089
      3
      Design java patterns
   
   
      0.010219089
      4
      Patterns design java
   
   
      0.010219089
      5
      Patterns java design

DisMax Query Parser

In addition to constructing queries in such a manner as described above, we can use the parameter pf and modify its behavior by using the ps parameter. Pf parameter provide information about the fields in which phrases will be identified. Pf parameter is often used in a manner analogous to the parameter qf specifying a list of search-able fields. In addition to that, we must specify the boost parameter for the phrase otherwise the default boost will be taken into consideration. The query using DisMax would look like that:

q=java+design+patterns&defType=dismax&qf=name&pf=name^30&ps=0

While the query passed to Lucene looks as follows:

+((DisjunctionMaxQuery((name:java)) DisjunctionMaxQuery((name:design)) DisjunctionMaxQuery((name:patterns)))~3) DisjunctionMaxQuery((name:"java design patterns"^30.0))
+(((name:java) (name:design) (name:patterns))~3) (name:"java design patterns"^30.0)

The results for the query thus constructed are as follows:




   0
   0
   
      name^30
      id,name,score
      java design patterns
      name
      dismax
      0
   


   
      1.2399161
      1
      Java design patterns
   
   
      0.013625451
      2
      Design patterns java
   
   
      0.013625451
      3
      Design java patterns
   
   
      0.013625451
      4
      Patterns design java
   
   
      0.013625451
      5
      Patterns java design

It is noteworthy that the order of results for both methods is the same. This follows from the fact, that the phrase has been identified only in the document with the id of 1.Look that there is no difference in the value of score for the first document in both methods. Of course the other documents, located on positions from 2 to 5, are in both cases on the same positions, but have different score values because of the difference in query passed to Lucene.

But, I used the ps parameter (set to 0) and didn`t mention why I did it. When You use the pf (and pf2, but more on that later) parameter, the ps parameter mean Phrase Slop – a maximum distance of words from each other to form a phrase. For instance, ps=2 will mean that the words can be a maximum of two places from each other to form a phrase. Note, however, that despite the fact that both the “Java sample design patterns” and “Java design patterns” will create a phrase, but the document entitled “Java design patterns” will have a bigger score value, despite the settings ps=2, because of terms located closer together.

Extended DisMax Query Parser

Unfortunately without the use of trunk You can not use eDisMax. But, anyway, the query using eDisMax Enhanced Term Proximity Boosting would look like that:

q=java+design+patterns&defType=edismax&qf=name&pf2=name^30&ps=0

The above query creates the following query to Lucene:

+(DisjunctionMaxQuery((name:java)) DisjunctionMaxQuery((name:design)) DisjunctionMaxQuery((name:patterns))) (DisjunctionMaxQuery((name:"java design"^30.0)) DisjunctionMaxQuery((name:"design patterns"^30.0)))
+((name:java) (name:design) (name:patterns)) ((name:"java design"^30.0) (name:"design patterns"^30.0))

As seen, in addition to the standard DisjunctionMaxQuery produced by DisMax (and this its expanded version), extended DisMax parser also produced two additional queries – the ones responsible for enhanced term proximity boosting. The additional queries boosts pair of word created from the terms in the user query. In the presented case the created test pairs were “java design” and “design patterns”. As you can guess the most significant documents in the results list, documents will be generated by having both pairs, the next document will have one of the pair, and another will not have any. As proof I present the result of the above query send to Solr:




   0
   0
   
      id,name,score
      java design patterns
      name
      name^30
      edismax
      0
   


   
      1.1705827
      1
      Java design patterns
   
   
      0.3034844
      2
      Design patterns java
   
   
      0.3034844
      5
      Patterns java design
   
   
      0.014451639
      3
      Design java patterns
   
   
      0.014451639
      4
      Patterns design java

As you can see the first document has not changed its position. The second and third place are the documents that have one of the pairs generated by the parser. As a result documents with id 2 and 5 have the same coefficient score value. The result list is closed by the documents with only terms present in the search-able fields.

Performance

In any case, it must be taken into account that individual features will affect the performance of applications based on Solr. I thought I`ll do a simple performance test. The assumptions of the test are quite simple – index data from wikipedia and for each phrase boost method create five queries – each of the queries assembled from two to six tokens. Solr cache disabled, restart of Solr after each query. The result is the arithmetic mean of 10 repetitions of each test. Before the test results, a few words about the index:

Number of documents in the index: 1,177,239
Number of segments: 1
Number of terms: 18.506.646
Number of term/document pairs: 230.297.212
Number of tokens: 418.135.268
The size of the index: 4.6GB (optimized)
Lucene version used to build the index: 4.0-dev 964000

Phrases that were selected for each iteration of the test:

Iteration I: “Great Peter”
Iteration II: “World War Two”
Iteration III: “World War Two Germany”
Iteration IV: “Move Time Eastern Poland Reformation”
Iteration V: “Change Winter Cloths To Summer Cloths Now”

The results were as follows:

[table “1” not found /]

Please note that the reported results concern only the issue of performance and are not suggesting a method of phrase boosting. The choice of method is a matter of requirements and implementation. As for the results, you can see that the DisMax method is the quickest one.