Solr and PhraseQuery – phrase bonus in query stage

In the majority of system implementations I dealt with, sooner or later, there was a problem – search results tunning. One of the simplest ways to improve the search results quality was phrase boosting. Having the three most popular query parsers in Solr and the variety of parameters to control them I though it will be a good idea to check how they behave and how they affect performance.

In the current trunk of Solr we have three query parsers:

Standard Solr Query Parser – default parser for Solr based on Lucene query parser
DisMax Query Parser
Extended DisMax Query Parser

Each of the mentioned query parsers have it`s own capabilities in case of phrase boosting on query stage. I won`t mention index time term proximity in this post – I`ll get back to it some other time. So, about the parsers now.

Standard Solr Query Parser

Parser based on Standard Lucene Query Parser and enhancing it`s parent capabilities. When it comes to phrase boosting, we don`t have much choice. Lets say, that our system is a search system for large Internet library, where users can rate books, leave comments and discuss books in the library forums. Our goal is to index all the data generated by the users and our suppliers and then represent this data in our search results. When user search for “Java design patterns” we want to show him the books that have those words in a document. No problem, lets make a Solr query like this:

q=java+design+patterns

So we get the results and we can say that our search engine is behaving well and we don`t want to improve search quality. But I would add another part to the query – part that would favor document which have a phrase (words given to the query are next to each other in the document) in the search-able fields. It`s an easy step, our modified query would look like this:
q=java+design+patterns+OR+"java+design+patterns"^30

By adding that additional query part (+OR+”java+design+patterns”^30) we modified our search results – by adding that part, on the first position in our result we now have books which have the exact phrase in the search fields. Lucene query generated by the parser look like that:

<str name="parsedquery">name:java name:design name:patterns PhraseQuery(name:"java design patterns"^30.0)</str>
<str name="parsedquery_toString">name:java name:design name:patterns name:"java design patterns"^30.0</str>

Search results for above query as follows:

<?xml version="1.0" encoding="UTF-8"?>
<response>
<lst name="responseHeader">
   <int name="status">0</int>
   <int name="QTime">0</int>
   <lst name="params">
      <str name="q">java design patterns OR "java design patterns"^30</str>
      <str name="fl">score,id,name</str>
   </lst>
</lst>
<result name="response" numFound="5" start="0" maxScore="1.2399161">
   <doc>
      <float name="score">1.2399161</float>
      <str name="id">1</str>
      <str name="name">Java design patterns</str>
   </doc>
   <doc>
      <float name="score">0.010219089</float>
      <str name="id">2</str>
      <str name="name">Design patterns java</str>
   </doc>
   <doc>
      <float name="score">0.010219089</float>
      <str name="id">3</str>
      <str name="name">Design java patterns</str>
   </doc>
   <doc>
      <float name="score">0.010219089</float>
      <str name="id">4</str>
      <str name="name">Patterns design java</str>
   </doc>
   <doc>
      <float name="score">0.010219089</float>
      <str name="id">5</str>
      <str name="name">Patterns java design</str>
   </doc>
</result>
</response>

DisMax Query Parser

In addition to constructing queries in such a manner as described above, we can use the parameter pf and modify its behavior by using the ps parameter. Pf parameter provide information about the fields in which phrases will be identified. Pf parameter is often used in a manner analogous to the parameter qf specifying a list of search-able fields. In addition to that, we must specify the boost parameter for the phrase otherwise the default boost will be taken into consideration. The query using DisMax would look like that:

q=java+design+patterns&defType=dismax&qf=name&pf=name^30&ps=0

While the query passed to Lucene looks as follows:

<str name="parsedquery">+((DisjunctionMaxQuery((name:java)) DisjunctionMaxQuery((name:design)) DisjunctionMaxQuery((name:patterns)))~3) DisjunctionMaxQuery((name:"java design patterns"^30.0))</str>
<str name="parsedquery_toString">+(((name:java) (name:design) (name:patterns))~3) (name:"java design patterns"^30.0)</str>

The results for the query thus constructed are as follows:

<?xml version="1.0" encoding="UTF-8"?>
<response>
<lst name="responseHeader">
   <int name="status">0</int>
   <int name="QTime">0</int>
   <lst name="params">
      <str name="pf">name^30</str>
      <str name="fl">id,name,score</str>
      <str name="q">java design patterns</str>
      <str name="qf">name</str>
      <str name="defType">dismax</str>
      <str name="ps">0</str>
   </lst>
</lst>
<result name="response" numFound="5" start="0" maxScore="1.2399161">
   <doc>
      <float name="score">1.2399161</float>
      <str name="id">1</str>
      <str name="name">Java design patterns</str>
   </doc>
   <doc>
      <float name="score">0.013625451</float>
      <str name="id">2</str>
      <str name="name">Design patterns java</str>
   </doc>
   <doc>
      <float name="score">0.013625451</float>
      <str name="id">3</str>
      <str name="name">Design java patterns</str>
   </doc>
   <doc>
      <float name="score">0.013625451</float>
      <str name="id">4</str>
      <str name="name">Patterns design java</str>
   </doc>
   <doc>
      <float name="score">0.013625451</float>
      <str name="id">5</str>
      <str name="name">Patterns java design</str>
   </doc>
</result>
</response>

It is noteworthy that the order of results for both methods is the same. This follows from the fact, that the phrase has been identified only in the document with the id of 1.Look that there is no difference in the value of score for the first document in both methods. Of course the other documents, located on positions from 2 to 5, are in both cases on the same positions, but have different score values because of the difference in query passed to Lucene.

But, I used the ps parameter (set to 0) and didn`t mention why I did it. When You use the pf (and pf2, but more on that later) parameter, the ps parameter mean Phrase Slop – a maximum distance of words from each other to form a phrase. For instance, ps=2 will mean that the words can be a maximum of two places from each other to form a phrase. Note, however, that despite the fact that both the “Java sample design patterns” and “Java design patterns” will create a phrase, but the document entitled “Java design patterns” will have a bigger score value, despite the settings ps=2, because of terms located closer together.

Extended DisMax Query Parser

Unfortunately without the use of trunk You can not use eDisMax. But, anyway, the query using eDisMax Enhanced Term Proximity Boosting would look like that:

q=java+design+patterns&defType=edismax&qf=name&pf2=name^30&ps=0

The above query creates the following query to Lucene:

<str name="parsedquery">+(DisjunctionMaxQuery((name:java)) DisjunctionMaxQuery((name:design)) DisjunctionMaxQuery((name:patterns))) (DisjunctionMaxQuery((name:"java design"^30.0)) DisjunctionMaxQuery((name:"design patterns"^30.0)))</str>
<str name="parsedquery_toString">+((name:java) (name:design) (name:patterns)) ((name:"java design"^30.0) (name:"design patterns"^30.0))</str>

As seen, in addition to the standard DisjunctionMaxQuery produced by DisMax (and this its expanded version), extended DisMax parser also produced two additional queries – the ones responsible for enhanced term proximity boosting. The additional queries boosts pair of word created from the terms in the user query. In the presented case the created test pairs were “java design” and “design patterns”. As you can guess the most significant documents in the results list, documents will be generated by having both pairs, the next document will have one of the pair, and another will not have any. As proof I present the result of the above query send to Solr:

<?xml version="1.0" encoding="UTF-8"?>
<response>
<lst name="responseHeader">
   <int name="status">0</int>
   <int name="QTime">0</int>
   <lst name="params">
      <str name="fl">id,name,score</str>
      <str name="q">java design patterns</str>
      <str name="qf">name</str>
      <str name="pf2">name^30</str>
      <str name="defType">edismax</str>
      <str name="ps">0</str>
   </lst>
</lst>
<result name="response" numFound="5" start="0" maxScore="1.1705827">
   <doc>
      <float name="score">1.1705827</float>
      <str name="id">1</str>
      <str name="name">Java design patterns</str>
   </doc>
   <doc>
      <float name="score">0.3034844</float>
      <str name="id">2</str>
      <str name="name">Design patterns java</str>
   </doc>
   <doc>
      <float name="score">0.3034844</float>
      <str name="id">5</str>
      <str name="name">Patterns java design</str>
   </doc>
   <doc>
      <float name="score">0.014451639</float>
      <str name="id">3</str>
      <str name="name">Design java patterns</str>
   </doc>
   <doc>
      <float name="score">0.014451639</float>
      <str name="id">4</str>
      <str name="name">Patterns design java</str>
   </doc>
</result>
</response>

As you can see the first document has not changed its position. The second and third place are the documents that have one of the pairs generated by the parser. As a result documents with id 2 and 5 have the same coefficient score value. The result list is closed by the documents with only terms present in the search-able fields.

Performance

In any case, it must be taken into account that individual features will affect the performance of applications based on Solr. I thought I`ll do a simple performance test. The assumptions of the test are quite simple – index data from wikipedia and for each phrase boost method create five queries – each of the queries assembled from two to six tokens. Solr cache disabled, restart of Solr after each query. The result is the arithmetic mean of 10 repetitions of each test. Before the test results, a few words about the index:

Number of documents in the index: 1,177,239
Number of segments: 1
Number of terms: 18.506.646
Number of term/document pairs: 230.297.212
Number of tokens: 418.135.268
The size of the index: 4.6GB (optimized)
Lucene version used to build the index: 4.0-dev 964000

Phrases that were selected for each iteration of the test:

Iteration I: “Great Peter”
Iteration II: “World War Two”
Iteration III: “World War Two Germany”
Iteration IV: “Move Time Eastern Poland Reformation”
Iteration V: “Change Winter Cloths To Summer Cloths Now”

The results were as follows:

[table “1” not found /]

Please note that the reported results concern only the issue of performance and are not suggesting a method of phrase boosting. The choice of method is a matter of requirements and implementation. As for the results, you can see that the DisMax method is the quickest one.

Solr.pl

Solr and PhraseQuery – phrase bonus in query stage

Leave a Reply Cancel reply