<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>query &#8211; Solr.pl</title>
	<atom:link href="https://solr.pl/en/tag/query-2/feed/" rel="self" type="application/rss+xml" />
	<link>https://solr.pl/en/</link>
	<description>All things to be found - Blog related to Apache Solr &#38; Lucene projects - https://solr.apache.org</description>
	<lastBuildDate>Sat, 14 Nov 2020 15:22:48 +0000</lastBuildDate>
	<language>en-US</language>
	<sy:updatePeriod>
	hourly	</sy:updatePeriod>
	<sy:updateFrequency>
	1	</sy:updateFrequency>
	<generator>https://wordpress.org/?v=6.9</generator>
	<item>
		<title>RankField &#038; Rank Query Parser</title>
		<link>https://solr.pl/en/2020/09/28/rankfield-rank-query-parser/</link>
					<comments>https://solr.pl/en/2020/09/28/rankfield-rank-query-parser/#respond</comments>
		
		<dc:creator><![CDATA[Rafał Kuć]]></dc:creator>
		<pubDate>Mon, 28 Sep 2020 14:22:14 +0000</pubDate>
				<category><![CDATA[Solr]]></category>
		<category><![CDATA[query]]></category>
		<category><![CDATA[query parser]]></category>
		<category><![CDATA[rank]]></category>
		<category><![CDATA[rankfield]]></category>
		<category><![CDATA[rankqueryparser]]></category>
		<category><![CDATA[solr]]></category>
		<guid isPermaLink="false">http://sematext.solr.pl/?p=1036</guid>

					<description><![CDATA[One of the additions to Solr that we didn&#8217;t talk about yet is the new field type called the&#160;RankField&#160;and the&#160;Rank Query Parser&#160;that can leverage it. Together they can be used to introduce scoring based on the content of the document]]></description>
										<content:encoded><![CDATA[
<p>One of the additions to Solr that we didn&#8217;t talk about yet is the new field type called the&nbsp;<strong>RankField</strong>&nbsp;and the&nbsp;<strong>Rank Query Parser</strong>&nbsp;that can leverage it. Together they can be used to introduce scoring based on the content of the document in an optimized way. Let&#8217;s have a quick look at what the mentioned pair gives us.</p>



<span id="more-1036"></span>



<h2 class="wp-block-heading">The Idea Behind Rank Query Parser</h2>



<p>The idea behind the&nbsp;<strong>Rank Query Parser</strong>&nbsp;is that it provides the functionality of using the information from the document to modify the score of the resulting documents. It provides a subset of what the&nbsp;<strong>Function Query Parser</strong>&nbsp;already provided, but it can also be used with the BlockMax-WAND algorithm for improved query performance.&nbsp;</p>



<h2 class="wp-block-heading">The RankField</h2>



<p>Using&nbsp;<strong>RankField</strong>&nbsp;is very simple. We need to define the appropriate field type, a field using that field type, and of course, populate it with data. Let&#8217;s assume we have the following document structure:</p>



<pre class="wp-block-code"><code class="">{
  "id" : 1,
  "name": "RankField and RankQueryParser",
  "type": "post",
  "views": 1000 
}</code></pre>



<p>We have the document identifier, the name of the document, its type, and the number of views. We will be interested in the last field. In addition to using it for display purposes, we would also like to use it for ranking. Our schema could look as follows:</p>



<pre class="wp-block-code"><code class="">&lt;field name="id" type="string" />
&lt;field name="name" type="text_ws" />
&lt;field name="type" type="string" />
&lt;field name="views" type="rank" /></code></pre>



<p>We also need to define the&nbsp;<strong>rank</strong>&nbsp;type, which could look as follows:</p>



<pre class="wp-block-code"><code class="">&lt;fieldType name="rank" class="solr.RankField" /></code></pre>



<p>That is everything we need &#8211; we are ready to go.</p>



<h2 class="wp-block-heading">Using the Rank Query Parser</h2>



<p>To simply use the&nbsp;<strong>RankQueryParser</strong>&nbsp;and include the&nbsp;<strong>views</strong>&nbsp;field in the scoring calculation we could run a query similar to the following one:</p>



<pre class="wp-block-code"><code class="">q=_query_:{!rank f='views' function='log'}</code></pre>



<p>Knowing that we have two documents that look as follows:</p>



<pre class="wp-block-code"><code class="">[
  {
    "id" : 1,
    "name": "RankField and RankQueryParser",
    "type": "post",
    "views": 1000 
  },
  {
    "id" : 2,
    "name": "Lucene and Solr 8.6.1 were released",
    "type": "announcement",
    "views": 10
  }
]</code></pre>



<p>Our results would look like this:</p>



<pre class="wp-block-code"><code class="">{
  "responseHeader":{
    "zkConnected":true,
    "status":0,
    "QTime":3,
    "params":{
      "q":"_query_:{!rank f='views' function='log'}",
      "fl":"score,*"}},
  "response":{"numFound":2,"start":0,"maxScore":6.908755,"numFoundExact":true,"docs":[
      {
        "id":"1",
        "name":"RankField and RankQueryParser",
        "type":"post",
        "_version_":1678886835690930176,
        "score":6.908755},
      {
        "id":"2",
        "name": "Lucene and Solr 8.6.1 were released",
        "type":"announcement",
        "_version_":1678886835758039040,
        "score":2.3978953}]
  }}</code></pre>



<p>You can see that even though we&#8217;ve run the&nbsp;<strong>match all</strong>&nbsp;query that gives a score of&nbsp;<strong>1.0</strong>&nbsp;to all matching documents, the score in our case is different. Solr took the&nbsp;<strong>log</strong>&nbsp;function and applied it to all matching results.</p>



<h2 class="wp-block-heading">Performance</h2>



<p>Of course, the above behavior can be easily achieved by using a standard&nbsp;<strong>Function Query Parser</strong>, but the key point with the&nbsp;<strong>Rank Query Parser</strong>&nbsp;is that we can use the BlockMax-WAND algorithm to improve the performance of our query. To do this we need to include the&nbsp;<strong>minExactCount</strong>&nbsp;parameter to our query to define how many accurate hits need to be present in the results. After that, Solr may skip documents that do not enter the top N results matching the query.</p>



<p>The response from Solr when&nbsp;<strong>minExactCount</strong>&nbsp;parameter is used look as follows:</p>



<pre class="wp-block-code"><code class="">{
  "responseHeader":{
    "zkConnected":true,
    "status":0,
    "QTime":1,
    "params":{
      "q":"_query_:{!rank f='views' function='log'}",
      "fl":"score,*",
      "minExactCount":"1"}},
  "response":{"numFound":2,"start":0,"maxScore":6.908755,"numFoundExact":true,"docs":[
      {
        "id":"1",
        "name":"RankField and RankQueryParser",
        "type":"post",
        "_version_":1678886835690930176,
        "score":6.908755},
      {
        "id":"2",
        "name":"Lucene and Solr 8.6.1 were released",
        "type":"announcement",
        "_version_":1678886835758039040,
        "score":2.3978953}]
  }}</code></pre>



<p>You can see an additional&nbsp;<strong>numFoundExact</strong>&nbsp;attribute in the response header. We will talk about the BlockMax-WAND algorithm in Solr in the next few weeks in a dedicated blog post, so stay tuned if you would like to read about it. There are some pros and cons to it that I think is worth discussing.&nbsp;</p>



<h2 class="wp-block-heading">Available Functions</h2>



<p>At the moment of writing the blog post there are three functions available that we can use with the&nbsp;<strong>Rank Query Parser</strong>:</p>



<ul class="wp-block-list"><li><strong>log</strong>&nbsp;&#8211; the logarithmic function, which accepts&nbsp;<strong>weight</strong>&nbsp;and&nbsp;<strong>scalingFactor</strong>&nbsp;attributes</li><li><strong>satu</strong>&nbsp;&#8211; the saturation function accepting the&nbsp;<strong>pivot</strong>&nbsp;and&nbsp;<strong>weight</strong>&nbsp;attributes</li><li><strong>sigm</strong>&nbsp;&#8211; the sigmoid function accepting the&nbsp;<strong>pivot</strong>,&nbsp;<strong>weight</strong>, and&nbsp;<strong>exponent</strong>&nbsp;attributes</li></ul>



<p>You can use one of those functions to scale the scoring factor and adjust how the rank field value affects the scoring.</p>



<h2 class="wp-block-heading">Conclusions</h2>



<p>Though we already had the ability to include the function query in our queries and use the field value from it we can now also use the BlockMax-WAND algorithm. This allows improving the query performance in situations where we don&#8217;t need the exact number of rows and we are happy with only top N results. Something worth considering.</p>
]]></content:encoded>
					
					<wfw:commentRss>https://solr.pl/en/2020/09/28/rankfield-rank-query-parser/feed/</wfw:commentRss>
			<slash:comments>0</slash:comments>
		
		
			</item>
		<item>
		<title>SolrCloud and query execution control</title>
		<link>https://solr.pl/en/2019/01/14/solrcloud-and-query-execution-control/</link>
					<comments>https://solr.pl/en/2019/01/14/solrcloud-and-query-execution-control/#respond</comments>
		
		<dc:creator><![CDATA[Rafał Kuć]]></dc:creator>
		<pubDate>Mon, 14 Jan 2019 14:27:53 +0000</pubDate>
				<category><![CDATA[Solr]]></category>
		<category><![CDATA[control]]></category>
		<category><![CDATA[execution]]></category>
		<category><![CDATA[query]]></category>
		<category><![CDATA[solr]]></category>
		<category><![CDATA[solrcloud]]></category>
		<guid isPermaLink="false">http://sematext.solr.pl/?p=980</guid>

					<description><![CDATA[With the release of Solr 7.0 and introduction of new replica types, in addition to the defa?ult NRT type the question appeared &#8211; can we control the queries and where they are executed? Can we tell Solr to execute the]]></description>
										<content:encoded><![CDATA[
<p>With the release of Solr 7.0 and introduction of new replica types, in addition to the defa?ult NRT type the question appeared &#8211; can we control the queries and where they are executed? Can we tell Solr to execute the queries only on the PULL replicas or give TLOG replicas a priority? Let&#8217;s check that out.</p>



<span id="more-980"></span>



<h2 class="wp-block-heading">Shards parameter</h2>



<p>The first control option that we have in SolrCloud is the <em>shards</em> parameter. Using it we can directly control which shards should be used for querying. For example we can provide a logical shard name in our query:</p>



<pre class="wp-block-code"><code class="">http://localhost:8983/solr/test/select?q=*:*&amp;shards=shard1</code></pre>



<pre class="wp-block-code"><code class="">http://localhost:8983/solr/test/select?q=*:*&amp;shards=shard1,shard2,shard3</code></pre>



<pre class="wp-block-code"><code class="">http://localhost:8983/solr/test/select?q=*:*&amp;shards=localhost:6683/solr/test</code></pre>



<p>The first of the above queries will be executed only on those shards that are grouped under the logical <em>shard1</em> name. The second query will be executed on logical <em>shard1</em>, <em>shard2</em> and <em>shard3</em>, while the third query will be executed on the shards that are deployed on the <em>localhost:6683</em> node on the <em>test</em> collection</p>



<p>There is also a possibility to do load balancing across instances, for example:</p>



<pre class="wp-block-code"><code class="">http://localhost:8983/solr/test/select?q=*:*&amp;shards=localhost:6683/solr/test|localhost:7783/solr/test</code></pre>



<p>The above query will be executed on instance running on port <em>6683</em> or on the one running on port <em>7783</em>. </p>



<h2 class="wp-block-heading">Shards.preference parameter</h2>



<p>While the <em>shards</em> parameter gives us some degree of control where the query should be executed it is not exactly what we would like to have. However to use a certain type of replica we would have to get the data about the physical layout of the shards and this is not something that we would like to do. Because of that the <em>shards.preference</em> parameter has been introduced to Solr. It allows us to tell Solr what type of replicas should have the priority when executing query. </p>



<p>For example, to tell Solr that PULL type replicas should have priority when the query is executed one should add the <em>shards.preference</em> parameter to the query and set it to <em>replica.type:PULL</em>:</p>



<pre class="wp-block-code"><code class="">http://localhost:8983/solr/test/select?q=*:*&amp;shards.preference=replica.type:PULL</code></pre>



<p>The nice thing is that we can tell Solr that first PULL replicas should be used and then if they are not available TLOG replicas should be used:</p>



<pre class="wp-block-code"><code class="">http://localhost:8983/solr/test/select?q=*:*&amp;shards.preference=replica.type:PULL,replica.type:TLOG</code></pre>



<p>We can also define that PULL types replicas should be used first and if they are not available local shards should have the priority:</p>



<pre class="wp-block-code"><code class="">http://localhost:8983/solr/test/select?q=*:*&amp;shards.preference=replica.type:PULL,replica.location:local</code></pre>



<p>In addition to the above example we can also define priority based on location of the replicas. For example if our <em>192.168.1.1</em> Solr node is way more powerful compared to the others and we would like to first prioritize PULL replicas and then the mentioned Solr node we would run the following query:</p>



<pre class="wp-block-code"><code class="">http://localhost:8983/solr/test/select?q=*:*&amp;shards.preference=replica.type:PULL,replica.location:http://192.168.1.1</code></pre>



<h2 class="wp-block-heading">Summary</h2>



<p>The discussed parameters and the <em>shards.preference</em> in particular with its <em>replica.type</em> value can be very useful when we are using SolrCloud with different types of replicas. Telling Solr that we would like to prefer PULL or TLOG replicas we can lower the query based pressure on the NRT replicas and thus have better performance of the whole cluster. What&#8217;s more &#8211; dividing the replicas can help us in achieving query performance that is close what Solr master &#8211; slave architecture provides without sacrificing all the goodies that come with SolrCloud itself. </p>
]]></content:encoded>
					
					<wfw:commentRss>https://solr.pl/en/2019/01/14/solrcloud-and-query-execution-control/feed/</wfw:commentRss>
			<slash:comments>0</slash:comments>
		
		
			</item>
		<item>
		<title>Solr 6.0 and graph traversal support</title>
		<link>https://solr.pl/en/2016/04/18/solr-6-0-and-graph-traversal-support/</link>
					<comments>https://solr.pl/en/2016/04/18/solr-6-0-and-graph-traversal-support/#respond</comments>
		
		<dc:creator><![CDATA[Rafał Kuć]]></dc:creator>
		<pubDate>Mon, 18 Apr 2016 13:02:05 +0000</pubDate>
				<category><![CDATA[Solr]]></category>
		<category><![CDATA[6.0]]></category>
		<category><![CDATA[graph]]></category>
		<category><![CDATA[query]]></category>
		<category><![CDATA[solr]]></category>
		<category><![CDATA[traversal]]></category>
		<guid isPermaLink="false">http://sematext.solr.pl/?p=889</guid>

					<description><![CDATA[One of the new features that are present in the recently released Solr 6.0 is the graph traversal query that allows us to work with graphs. Having a root set and relations between documents (like parent identifier of the document)]]></description>
										<content:encoded><![CDATA[<p>One of the new features that are present in the recently released <a href="http://solr.pl/en/2016/04/08/lucene-and-solr-6-0/">Solr 6.0</a> is the graph traversal query that allows us to work with graphs. Having a root set and relations between documents (like parent identifier of the document) we can use a single query to get multiple levels of joins in the same request. Let&#8217;s look at this new feature working both in old fashioned Solr master &#8211; slave as well as in SolrCloud.</p>
<p><span id="more-889"></span></p>
<p>For the purpose of this blog post we will use a very simple data set that we can index using a single command and a configuration that can be downloaded from our github account available at: <a href="https://github.com/solrpl/blog">https://github.com/solrpl/blog</a>.</p>
<h3>Creating the collection and indexing the data</h3>
<p>What we need first is creating the collection and indexing the data itself. We will start Solr with the following using the following command:
</p>
<pre class="brush:xml">bin/solr start -c
</pre>
<p>This will launch our Solr instance in the cloud mode. Now we need to send our configuration files to Zookeeper which we will do by running the following command:
</p>
<pre class="brush:xml">bin/solr zk -upconfig -n graph_test_config -z localhost:9983 -d graph_test/conf
</pre>
<p>Next we will create the our <i>graph</i> collection by running the following command:
</p>
<pre class="brush:xml">curl -XGET 'http://localhost:8983/solr/admin/collections?action=CREATE&amp;name=graph&amp;numShards=2&amp;replicationFactor=1&amp;maxShardsPerNode=2&amp;collection.configName=graph_test_config'
</pre>
<p>Now, after collection creation we can finally index the data by running the following command:
</p>
<pre class="brush:xml">curl -XPOST -H 'Content-Type: application/json' 'http://localhost:8983/solr/graph/update' --data-binary '{
 "add" : { "doc" : { "id" : "1", "name" : "Root document one" } },
 "add" : { "doc" : { "id" : "2", "name" : "Root document two" } },
 "add" : { "doc" : { "id" : "3", "name" : "Root document three" } },
 "add" : { "doc" : { "id" : "11", "parent_id" : "1", "name" : "First level document 1, child one" } },
 "add" : { "doc" : { "id" : "12", "parent_id" : "1", "name" : "First level document 1, child two" } },
 "add" : { "doc" : { "id" : "13", "parent_id" : "1", "name" : "First level document 1, child three" } },
 "add" : { "doc" : { "id" : "21", "parent_id" : "2", "name" : "First level document 2, child one" } },
 "add" : { "doc" : { "id" : "22", "parent_id" : "2", "name" : "First level document 2, child two" } },
 "add" : { "doc" : { "id" : "121", "parent_id" : "12", "name" : "Second level document 12, child one" } },
 "add" : { "doc" : { "id" : "122", "parent_id" : "12", "name" : "Second level document 12, child two" } },
 "add" : { "doc" : { "id" : "131", "parent_id" : "13", "name" : "Second level document 13, child three" } },
 "commit" : {}
}'
</pre>
<p>So our data have the following relations:</p>
<p><a href="http://solr.pl/en/2016/04/18/solr-6-0-and-graph-traversal-support/graph-documents-layout/" rel="attachment wp-att-3768"><img fetchpriority="high" decoding="async" class="aligncenter size-full wp-image-3768" src="http://solr.pl/wp-content/uploads/2016/04/Graph-Documents-Layout.png" alt="Graph Documents Layout" width="605" height="469"></a></p>
<p>Let&#8217;s try searching in the structure.</p>
<h3>Basic graph traversal query usage</h3>
<p>In the basic form of graph traversal query we need to provide a root set of documents, specify which is the identifier field, which is the parent identifier field and run the query. For example, if we would like to find all the documents in relation to both the root documents and the ones with next levels of relations we could run the following query:
</p>
<pre class="brush:xml">http://localhost:8983/solr/graph/select?q=*:*&amp;fq={!graph from=parent_id to=id}name:"root document"
</pre>
<p>The documents returned for such query would look as follows:
</p>
<pre class="brush:xml">&lt;?xml version="1.0" encoding="UTF-8"?&gt;
&lt;response&gt;
 &lt;lst name="responseHeader"&gt;
  &lt;bool name="zkConnected"&gt;true&lt;/bool&gt;
  &lt;int name="status"&gt;0&lt;/int&gt;
  &lt;int name="QTime"&gt;8&lt;/int&gt;
  &lt;lst name="params"&gt;
   &lt;str name="q"&gt;*:*&lt;/str&gt;
   &lt;str name="fq"&gt;{!graph from=parent_id to=id}name:"root document"&lt;/str&gt;
  &lt;/lst&gt;
 &lt;/lst&gt;
 &lt;result name="response" numFound="8" start="0" maxScore="1.0"&gt;
 &lt;doc&gt;
  &lt;str name="id"&gt;1&lt;/str&gt;
  &lt;str name="name"&gt;Root document one&lt;/str&gt;
  &lt;long name="_version_"&gt;1531331026113003520&lt;/long&gt;&lt;/doc&gt;
 &lt;doc&gt;
  &lt;str name="id"&gt;11&lt;/str&gt;
  &lt;str name="parent_id"&gt;1&lt;/str&gt;
  &lt;str name="name"&gt;First level document 1, child one&lt;/str&gt;
  &lt;long name="_version_"&gt;1531331026114052096&lt;/long&gt;&lt;/doc&gt;
 &lt;doc&gt;
  &lt;str name="id"&gt;12&lt;/str&gt;
  &lt;str name="parent_id"&gt;1&lt;/str&gt;
  &lt;str name="name"&gt;First level document 1, child two&lt;/str&gt;
  &lt;long name="_version_"&gt;1531331026115100672&lt;/long&gt;&lt;/doc&gt;
 &lt;doc&gt;
  &lt;str name="id"&gt;13&lt;/str&gt;
  &lt;str name="parent_id"&gt;1&lt;/str&gt;
  &lt;str name="name"&gt;First level document 1, child three&lt;/str&gt;
  &lt;long name="_version_"&gt;1531331026115100673&lt;/long&gt;&lt;/doc&gt;
 &lt;doc&gt;
  &lt;str name="id"&gt;122&lt;/str&gt;
  &lt;str name="parent_id"&gt;12&lt;/str&gt;
  &lt;str name="name"&gt;Second level document 12, child two&lt;/str&gt;
  &lt;long name="_version_"&gt;1531331026120343552&lt;/long&gt;&lt;/doc&gt;
 &lt;doc&gt;
  &lt;str name="id"&gt;2&lt;/str&gt;
  &lt;str name="name"&gt;Root document two&lt;/str&gt;
  &lt;long name="_version_"&gt;1531331026109857792&lt;/long&gt;&lt;/doc&gt;
 &lt;doc&gt;
  &lt;str name="id"&gt;3&lt;/str&gt;
  &lt;str name="name"&gt;Root document three&lt;/str&gt;
  &lt;long name="_version_"&gt;1531331026110906368&lt;/long&gt;&lt;/doc&gt;
 &lt;doc&gt;
  &lt;str name="id"&gt;21&lt;/str&gt;
  &lt;str name="parent_id"&gt;2&lt;/str&gt;
  &lt;str name="name"&gt;First level document 2, child one&lt;/str&gt;
  &lt;long name="_version_"&gt;1531331026111954944&lt;/long&gt;&lt;/doc&gt;
 &lt;/result&gt;
&lt;/response&gt;
</pre>
<p>As you can see, we&#8217;ve got our root documents returned (identifiers 1, 2 and 3) and we&#8217;ve got all the leafs for the whole data set, which is very nice.</p>
<h3>Filtering</h3>
<p>The results can be filtered by using the <i>traversalFilter</i> property. The defined filter will be applied to each join iteration. For example, if we would like to filter the resulting documents to only those that have term <i>one</i> in the <i>name</i> field we could run the following query:
</p>
<pre class="brush:xml">http://localhost:8983/solr/graph/select?q=*:*&amp;fq={!graph from=parent_id to=id traversalFilter=name:one}name:"root document"</pre>
<p>The results would be as follows:
</p>
<pre class="brush:xml">&lt;pre class="brush:xml"&gt;
&lt;?xml version="1.0" encoding="UTF-8"?&gt;
&lt;response&gt;
 &lt;lst name="responseHeader"&gt;
  &lt;bool name="zkConnected"&gt;true&lt;/bool&gt;
  &lt;int name="status"&gt;0&lt;/int&gt;
  &lt;int name="QTime"&gt;7&lt;/int&gt;
  &lt;lst name="params"&gt;
   &lt;str name="q"&gt;*:*&lt;/str&gt;
   &lt;str name="fq"&gt;{!graph from=parent_id to=id traversalFilter=name:one}name:"root document"&lt;/str&gt;
  &lt;/lst&gt;
 &lt;/lst&gt;
 &lt;result name="response" numFound="5" start="0" maxScore="1.0"&gt;
 &lt;doc&gt;
  &lt;str name="id"&gt;1&lt;/str&gt;
  &lt;str name="name"&gt;Root document one&lt;/str&gt;
  &lt;long name="_version_"&gt;1531331026113003520&lt;/long&gt;&lt;/doc&gt;
 &lt;doc&gt;
  &lt;str name="id"&gt;11&lt;/str&gt;
  &lt;str name="parent_id"&gt;1&lt;/str&gt;
  &lt;str name="name"&gt;First level document 1, child one&lt;/str&gt;
  &lt;long name="_version_"&gt;1531331026114052096&lt;/long&gt;&lt;/doc&gt;
 &lt;doc&gt;
  &lt;str name="id"&gt;2&lt;/str&gt;
  &lt;str name="name"&gt;Root document two&lt;/str&gt;
  &lt;long name="_version_"&gt;1531331026109857792&lt;/long&gt;&lt;/doc&gt;
 &lt;doc&gt;
  &lt;str name="id"&gt;3&lt;/str&gt;
  &lt;str name="name"&gt;Root document three&lt;/str&gt;
  &lt;long name="_version_"&gt;1531331026110906368&lt;/long&gt;&lt;/doc&gt;
 &lt;doc&gt;
  &lt;str name="id"&gt;21&lt;/str&gt;
  &lt;str name="parent_id"&gt;2&lt;/str&gt;
  &lt;str name="name"&gt;First level document 2, child one&lt;/str&gt;
  &lt;long name="_version_"&gt;1531331026111954944&lt;/long&gt;&lt;/doc&gt;
 &lt;/result&gt;
&lt;/response&gt;
</pre>
<p>As you can see, only the filtered documents were returned for each join and of course the root documents set. Seems that the filter is working <img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f642.png" alt="🙂" class="wp-smiley" style="height: 1em; max-height: 1em;" /></p>
<h3>Returning the root or leafs</h3>
<p>Apart from filtering we can also tell Solr to only return leafs and to omit root set documents. For example, to omit root set documents we would add the <i>returnRoot</i> property equal to <i>false</i> (defaults to <i>true</i>) in our query:
</p>
<pre class="brush:xml">http://localhost:8983/solr/graph/select?q=*:*&amp;fq={!graph from=parent_id to=id returnRoot=false}name:"root document"
</pre>
<p>The results are as follows:
</p>
<pre class="brush:xml">&lt;?xml version="1.0" encoding="UTF-8"?&gt;
&lt;response&gt;
 &lt;lst name="responseHeader"&gt;
  &lt;bool name="zkConnected"&gt;true&lt;/bool&gt;
  &lt;int name="status"&gt;0&lt;/int&gt;
  &lt;int name="QTime"&gt;10&lt;/int&gt;
  &lt;lst name="params"&gt;
   &lt;str name="q"&gt;*:*&lt;/str&gt;
   &lt;str name="fq"&gt;{!graph from=parent_id to=id returnRoot=false}name:"root document"&lt;/str&gt;
  &lt;/lst&gt;
 &lt;/lst&gt;
 &lt;result name="response" numFound="5" start="0" maxScore="1.0"&gt;
 &lt;doc&gt;
  &lt;str name="id"&gt;11&lt;/str&gt;
  &lt;str name="parent_id"&gt;1&lt;/str&gt;
  &lt;str name="name"&gt;First level document 1, child one&lt;/str&gt;
  &lt;long name="_version_"&gt;1531331026114052096&lt;/long&gt;&lt;/doc&gt;
 &lt;doc&gt;
  &lt;str name="id"&gt;12&lt;/str&gt;
  &lt;str name="parent_id"&gt;1&lt;/str&gt;
  &lt;str name="name"&gt;First level document 1, child two&lt;/str&gt;
  &lt;long name="_version_"&gt;1531331026115100672&lt;/long&gt;&lt;/doc&gt;
 &lt;doc&gt;
  &lt;str name="id"&gt;13&lt;/str&gt;
  &lt;str name="parent_id"&gt;1&lt;/str&gt;
  &lt;str name="name"&gt;First level document 1, child three&lt;/str&gt;
  &lt;long name="_version_"&gt;1531331026115100673&lt;/long&gt;&lt;/doc&gt;
 &lt;doc&gt;
  &lt;str name="id"&gt;122&lt;/str&gt;
  &lt;str name="parent_id"&gt;12&lt;/str&gt;
  &lt;str name="name"&gt;Second level document 12, child two&lt;/str&gt;
  &lt;long name="_version_"&gt;1531331026120343552&lt;/long&gt;&lt;/doc&gt;
 &lt;doc&gt;
  &lt;str name="id"&gt;21&lt;/str&gt;
  &lt;str name="parent_id"&gt;2&lt;/str&gt;
  &lt;str name="name"&gt;First level document 2, child one&lt;/str&gt;
  &lt;long name="_version_"&gt;1531331026111954944&lt;/long&gt;&lt;/doc&gt;
 &lt;/result&gt;
&lt;/response&gt;
</pre>
<p>As we can see the results are without the root documents.</p>
<p>If we are interested in leafs only, we should add the <i>returnOnlyLeaf</i> parameter and set it to <i>true</i> (defaults to <i>false</i>).</p>
<h3>Controlling maximum depth</h3>
<p>Finally, using the <i>maxDepth</i> property we can control the maximum depth of the traversal. By default, it is set to <i>-1</i> which stands for unlimited. For example, if we are only interested in the first level of graph, we could run the following query:
</p>
<pre class="brush:xml">http://localhost:8983/solr/graph/select?q=*:*&amp;fq={!graph from=parent_id to=id maxDepth=1}name:"root document"
</pre>
<p>The result includes only documents that are of one join from the documents in the root set:
</p>
<pre class="brush:xml">&lt;?xml version="1.0" encoding="UTF-8"?&gt;
&lt;response&gt;
 &lt;lst name="responseHeader"&gt;
  &lt;bool name="zkConnected"&gt;true&lt;/bool&gt;
  &lt;int name="status"&gt;0&lt;/int&gt;
  &lt;int name="QTime"&gt;10&lt;/int&gt;
  &lt;lst name="params"&gt;
   &lt;str name="q"&gt;*:*&lt;/str&gt;
   &lt;str name="fq"&gt;{!graph from=parent_id to=id maxDepth=1}name:"root document"&lt;/str&gt;
  &lt;/lst&gt;
 &lt;/lst&gt;
 &lt;result name="response" numFound="7" start="0" maxScore="1.0"&gt;
 &lt;doc&gt;
  &lt;str name="id"&gt;1&lt;/str&gt;
  &lt;str name="name"&gt;Root document one&lt;/str&gt;
  &lt;long name="_version_"&gt;1531331026113003520&lt;/long&gt;&lt;/doc&gt;
 &lt;doc&gt;
  &lt;str name="id"&gt;11&lt;/str&gt;
  &lt;str name="parent_id"&gt;1&lt;/str&gt;
  &lt;str name="name"&gt;First level document 1, child one&lt;/str&gt;
  &lt;long name="_version_"&gt;1531331026114052096&lt;/long&gt;&lt;/doc&gt;
 &lt;doc&gt;
  &lt;str name="id"&gt;12&lt;/str&gt;
  &lt;str name="parent_id"&gt;1&lt;/str&gt;
  &lt;str name="name"&gt;First level document 1, child two&lt;/str&gt;
  &lt;long name="_version_"&gt;1531331026115100672&lt;/long&gt;&lt;/doc&gt;
 &lt;doc&gt;
  &lt;str name="id"&gt;13&lt;/str&gt;
  &lt;str name="parent_id"&gt;1&lt;/str&gt;
  &lt;str name="name"&gt;First level document 1, child three&lt;/str&gt;
  &lt;long name="_version_"&gt;1531331026115100673&lt;/long&gt;&lt;/doc&gt;
 &lt;doc&gt;
  &lt;str name="id"&gt;2&lt;/str&gt;
  &lt;str name="name"&gt;Root document two&lt;/str&gt;
  &lt;long name="_version_"&gt;1531331026109857792&lt;/long&gt;&lt;/doc&gt;
 &lt;doc&gt;
  &lt;str name="id"&gt;3&lt;/str&gt;
  &lt;str name="name"&gt;Root document three&lt;/str&gt;
  &lt;long name="_version_"&gt;1531331026110906368&lt;/long&gt;&lt;/doc&gt;
 &lt;doc&gt;
  &lt;str name="id"&gt;21&lt;/str&gt;
  &lt;str name="parent_id"&gt;2&lt;/str&gt;
  &lt;str name="name"&gt;First level document 2, child one&lt;/str&gt;
  &lt;long name="_version_"&gt;1531331026111954944&lt;/long&gt;&lt;/doc&gt;
 &lt;/result&gt;
&lt;/response&gt;
</pre>
<h3>Summary</h3>
<p>As you can see, in addition to the Parallel SQL over Map Reduce functionality and the cross data center replication we&#8217;ve got a pretty neat graph traversal support in Solr 6.0. We haven&#8217;t had a chance to test performance of the query on the larger data set, but we will try to come up with the larger data set and some sample queries to see what we can expect from Solr when it comes to graph traversal query performance.</p>
<h3>Update</h3>
<p>We didn&#8217;t mention it, but you can see that not all documents from our sample data set were included in the results. This is because our collection was created with two shards and we run distributed query. To avoid that we could just create collection with a single shard and live with that until graph query supports more <img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f642.png" alt="🙂" class="wp-smiley" style="height: 1em; max-height: 1em;" /></p>
]]></content:encoded>
					
					<wfw:commentRss>https://solr.pl/en/2016/04/18/solr-6-0-and-graph-traversal-support/feed/</wfw:commentRss>
			<slash:comments>0</slash:comments>
		
		
			</item>
		<item>
		<title>Switch Query Parser &#8211; quick look</title>
		<link>https://solr.pl/en/2013/06/03/switch-query-parser-quick-look/</link>
					<comments>https://solr.pl/en/2013/06/03/switch-query-parser-quick-look/#respond</comments>
		
		<dc:creator><![CDATA[Rafał Kuć]]></dc:creator>
		<pubDate>Mon, 03 Jun 2013 11:59:28 +0000</pubDate>
				<category><![CDATA[Solr]]></category>
		<category><![CDATA[4.2]]></category>
		<category><![CDATA[parser]]></category>
		<category><![CDATA[query]]></category>
		<category><![CDATA[query parser]]></category>
		<category><![CDATA[solr]]></category>
		<category><![CDATA[switch]]></category>
		<category><![CDATA[SwitchQueryParser]]></category>
		<guid isPermaLink="false">http://sematext.solr.pl/?p=556</guid>

					<description><![CDATA[The number of different query parsers in Solr was always an amusement for me. Can someone here say how many of them are currently available and say what are those? Anyway, in this entry we won&#8217;t be talking about all]]></description>
										<content:encoded><![CDATA[<p>The number of different query parsers in Solr was always an amusement for me. Can someone here say how many of them are currently available and say what are those? Anyway, in this entry we won&#8217;t be talking about all the query parsers available in Solr, but we will take a quick look at one of them &#8211; the <em>SwitchQueryParser</em> introduced in Solr 4.2.</p>
<p><span id="more-556"></span></p>
<h3>Logic behind the parser</h3>
<p>The logic of the&nbsp;<em>SwitchQueryParser</em> is quite simple &#8211; allow processing a simple logic on the Solr side and add it as a sub-query. For example let&#8217;s say that we have an application that understand the following four values of the&nbsp;<em>priceRange</em> field:</p>
<ul>
<li><em>cheap</em> &#8211; when the price of the product (indexed in the&nbsp;<em>price</em> field) is lower than 10$,</li>
<li><em>average</em> &#8211; when the price is between 10 and 30$,</li>
<li><em>expensive</em> &#8211; when the product price is higher than&nbsp; 30$,</li>
<li><em>all</em> &#8211; in case we want to return all the documents without looking at the price.</li>
</ul>
<p>We want to have this logic stored in Solr somehow in order not to change our application or its configuration every time we want to change the above ranges. For this purpose we will use the <em>SwitchQueryParser</em>.</p>
<h3>Our query</h3>
<p>Let&#8217;s assume that our application will be able to send the following query:
</p>
<pre class="brush:bash">http://localhost:8983/solr/collection1/price?q=*:*&amp;priceRange=cheap</pre>
<p>We want the above query to return all the documents (<em>q=*:*</em>) narrowed to only those that are cheap, which in our case it will mean that they have price lower than 10$ (<em>priceRange=cheap </em>parameter).</p>
<h3>Solr configuration</h3>
<p>Of course we don&#8217;t want to send the <em>price</em> range in the query, because that wouldn&#8217;t make much sense for us. Because of that we decided to alter the <em>solrconfig.xml</em> file and add a new SearchHandler with the name of /<em>price</em>, which is configured as follows:
</p>
<pre class="brush:xml">&lt;requestHandler name="/price"&gt;
 &lt;lst name="defaults"&gt;
  &lt;str name="priceRange"&gt;all&lt;/str&gt;
 &lt;/lst&gt;
 &lt;lst name="appends"&gt;
  &lt;str name="fq"&gt;{!switch case.all='price:[* TO *]' case.cheap='price:[0 TO 10]' case.average='price:[10 TO 30]' case.expensive='price:[30 TO *]' v=$priceRange}&lt;/str&gt;
 &lt;/lst&gt;
&lt;/requestHandler&gt;</pre>
<p>As you can see the configuration of our SearchHandler consist of two elements. In the <em>defaults</em> section we defined the default value for the <em>priceRange</em> parameter, which is <em>all</em>. In addition to that, we&#8217;ve defined filter (<em>fq</em>) which is using the <em>SwitchQueryParser</em> (<em>!switch</em>). For each of the possible values the <em>priceRange</em> parameter can take (<em>v=$priceRange</em>) we defined a filter on the <em>price</em> field using the following expression &#8211;&nbsp;<em>case.priceRangeValue</em><em>=filter</em>. So, when the value of the <em>priceRange</em> parameter in the query will be equal to <strong><em>cheap</em></strong> than Solr will use the filter defined by the <em>case.<strong>cheap </strong></em>part of the filter definition, when the <em>priceRange</em> parameter value will be equal to <strong><em>expensive </em></strong>than Solr will use the filter defined by the <em>case.<strong>expensive </strong></em>and so on.</p>
<h3>What to remember about</h3>
<p>There is one crucial thing to remember when using the described parser. In our case, if we would <em>priceRange</em> parameter value different than the four mentioned it will result in Solr error.</p>
<h3>Summary</h3>
<p>In my opinion the <em>SwitchQueryParser</em> is a nice addition, although it is rather a feature that will be used by the minority of the users. However taking into consideration that is can help implementing some very basic logic and because it is simple (and thus not hungry for resources) there will be users which will find this nice query parser useful <img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f642.png" alt="🙂" class="wp-smiley" style="height: 1em; max-height: 1em;" /></p>
]]></content:encoded>
					
					<wfw:commentRss>https://solr.pl/en/2013/06/03/switch-query-parser-quick-look/feed/</wfw:commentRss>
			<slash:comments>0</slash:comments>
		
		
			</item>
		<item>
		<title>Do I have to look for maxBooleanClauses when using filters ?</title>
		<link>https://solr.pl/en/2011/12/19/do-i-have-to-look-for-maxbooleanclauses-when-using-filters/</link>
					<comments>https://solr.pl/en/2011/12/19/do-i-have-to-look-for-maxbooleanclauses-when-using-filters/#respond</comments>
		
		<dc:creator><![CDATA[Rafał Kuć]]></dc:creator>
		<pubDate>Mon, 19 Dec 2011 20:56:29 +0000</pubDate>
				<category><![CDATA[Solr]]></category>
		<category><![CDATA[bool]]></category>
		<category><![CDATA[boolean]]></category>
		<category><![CDATA[clause]]></category>
		<category><![CDATA[filter]]></category>
		<category><![CDATA[lucene]]></category>
		<category><![CDATA[maxBooleanClauses]]></category>
		<category><![CDATA[query]]></category>
		<category><![CDATA[solr]]></category>
		<guid isPermaLink="false">http://sematext.solr.pl/?p=388</guid>

					<description><![CDATA[One of the configuration variables we can find in the solrconfig.xml file is&#160;maxBooleanClauses, which specifies the maximum number of boolean clauses that can be combined in a single query. The question is, do I have to worry about it when]]></description>
										<content:encoded><![CDATA[<p>One of the configuration variables we can find in the <em>solrconfig.xml</em> file is&nbsp;<em>maxBooleanClauses</em>, which specifies the maximum number of boolean clauses that can be combined in a single query. The question is, do I have to worry about it when using filters in Solr ? Let&#8217;s try to answer that question without getting into Lucene and Solr source code.</p>
<p><span id="more-388"></span></p>
<h3>Query</h3>
<p>Let&#8217;s assume that we have the following query we want to change:
</p>
<pre class="brush:xml">q=category:1 AND category:2 AND category:3 ... AND category:2000</pre>
<p>Sending such a query to the Solr instance with the default configuration would result in the following exception: &#8220;<em>too many boolean clauses</em>&#8220;. Of course we could modify the <em>maxBooleanClauses</em> variable in <em>solrconfig.xml</em> file and get rid of the exception, but let&#8217;s try do it the other way:</p>
<h3>Let&#8217;s change the query to use filters</h3>
<p>So, we change the above query to use filters &#8211; the <em>fq</em> parameter<em></em>:
</p>
<pre class="brush:xml">q=*:*&amp;fq=category:(1 2 3 ... 2000)</pre>
<p>We send the query to Solr and &#8230; and again the same situation happens &#8211; exception with the &#8220;<em>too many boolean clauses</em>&#8221; message. It happens because Solr has to &#8220;calculate&#8221; filter content and thus run the appropriate query. So, let&#8217;s modify the query once again:</p>
<h3>Final query change</h3>
<p>After the final modification our query should look like this:
</p>
<pre class="brush:xml">q=*:*&amp;fq=category:1&amp;fq=category:2&amp;fq=category:3&amp;....&amp;fq=category:2000</pre>
<p>After sending such a query we will get the search results (of course if there are documents matching the query in the index). This time Solr didn&#8217;t have to run a single, large boolean query and that&#8217;s why we didn&#8217;t exceed the<em> maxBooleanClauses</em> limit.</p>
<h3>To sum up</h3>
<p>As you can see the answer to the question asked in the begining of the post depend on the query we want to run. If our query is using <em>AND</em> boolean operator we can use <em>fq</em> parameter because multiple <em>fq</em> parameters are concatenated using <em>AND</em>. However if we have to use <em>OR</em> we would have to change the limit defined by <em>maxBooleanClauses</em> configuration variable. But please remember that changing this limit can have a negative effect on performance and memory usage.</p>
]]></content:encoded>
					
					<wfw:commentRss>https://solr.pl/en/2011/12/19/do-i-have-to-look-for-maxbooleanclauses-when-using-filters/feed/</wfw:commentRss>
			<slash:comments>0</slash:comments>
		
		
			</item>
		<item>
		<title>Quick look: frange</title>
		<link>https://solr.pl/en/2011/05/30/quick-look-frange/</link>
					<comments>https://solr.pl/en/2011/05/30/quick-look-frange/#respond</comments>
		
		<dc:creator><![CDATA[Rafał Kuć]]></dc:creator>
		<pubDate>Mon, 30 May 2011 18:46:57 +0000</pubDate>
				<category><![CDATA[General]]></category>
		<category><![CDATA[frange]]></category>
		<category><![CDATA[queries]]></category>
		<category><![CDATA[query]]></category>
		<category><![CDATA[range]]></category>
		<category><![CDATA[solr]]></category>
		<guid isPermaLink="false">http://sematext.solr.pl/?p=270</guid>

					<description><![CDATA[In Solr 1.4 there were a new type of queries presented the frange queries. This new type of queries let you search for a range of values. According to the Solr developers this queries should be much faster from normal]]></description>
										<content:encoded><![CDATA[<p>In Solr 1.4 there were a new type of queries presented the <em>frange </em>queries. This new type of queries let you search for a range of values. According to the Solr developers this queries should be much faster from normal range queries. I thought that I should make a simple test to see how much faster, the new range queries can be expected to be.</p>
<p><span id="more-270"></span></p>
<h2>Querying Solr</h2>
<p>To use the <em>frange </em>query syntax we will need to modify the query. So far, the range query of the data may look as follows:
</p>
<pre class="brush:xml">fq=test_si:[0+TO+10000]</pre>
<p>The same query made using the <em>frange </em>looks like this:
</p>
<pre class="brush:xml">fq={!frange l=0 u=10000}test_si</pre>
<p>Of course, it is also possible to send query for data types other than numbers, for example:
</p>
<pre class="brush:xml">fq={!frange l=adam u=mariusz}name</pre>
<h2>Performance</h2>
<p>The very logic of the test is pretty simple. The  structure of the index contains two fields: <em>id</em>, a unique identifier and  a field <em>namestr</em> (string), which contained the values ​​for which I will  ask. I generated about 100,000 distinct documents. In  addition, in each of the documents the tems in the <em>namestr </em>field are  unique so we will easily be able to determine the percentage of terms  covered by a given query. Then I started to send queries to Solr that covered a certain percentage of terms in the index. Each  of the queries were send multiple times and the execution time of the  queries were summed and divided by the number of queries send. The following table shows the test result:</p>
<p>[table “6” not found /]<br />
</p>
<p>As you can see a standard range query is faster only for queries that cover a small number of terms  from the given field. As you can see, the performance gain using the <em>frange </em>queries is starts from about 5% of terms covered.   Interestingly, we get a nice increase in query speed, which is encouraging for even faster searching.</p>
<h2>To sum up</h2>
<p>The results of my test are different in terms of performance to what Yonik Seeley wrote on his blog (my test data can be the cause of this), but what we can say is that using frange queries we can expect an increase in performance for queries that need to search for a range of values.</p>
]]></content:encoded>
					
					<wfw:commentRss>https://solr.pl/en/2011/05/30/quick-look-frange/feed/</wfw:commentRss>
			<slash:comments>0</slash:comments>
		
		
			</item>
		<item>
		<title>Optimization &#8211; query result window size</title>
		<link>https://solr.pl/en/2011/01/10/optimization-query-result-window-size/</link>
					<comments>https://solr.pl/en/2011/01/10/optimization-query-result-window-size/#respond</comments>
		
		<dc:creator><![CDATA[Rafał Kuć]]></dc:creator>
		<pubDate>Mon, 10 Jan 2011 08:06:02 +0000</pubDate>
				<category><![CDATA[Solr]]></category>
		<category><![CDATA[cache]]></category>
		<category><![CDATA[query]]></category>
		<category><![CDATA[queryResultCache]]></category>
		<category><![CDATA[queryResultWindowCache]]></category>
		<category><![CDATA[result]]></category>
		<category><![CDATA[size]]></category>
		<category><![CDATA[window]]></category>
		<guid isPermaLink="false">http://sematext.solr.pl/?p=188</guid>

					<description><![CDATA[Hereby I would like to start a small series of articles describing the elements of the optimization of Solr instances. At first glance I decided to describe the parameter that specifies the data fetch window size &#8211; the query result]]></description>
										<content:encoded><![CDATA[<p>Hereby I would like to start a small series of articles describing the elements of the optimization of Solr instances. At first glance I decided to describe the parameter that specifies the data fetch window size &#8211; the query result window size. Hopefully, this article will explain how to use this parameter, how to modify and adapt it to your needs.</p>
<p><span id="more-188"></span></p>
<h3>The begining</h3>
<p>To start talking about this configuration parameter, we must first say how Solr fetches results from the index. When  passing the <em>rows </em>parameter to Solr, with the value of 20 for example,  we tell Solr that we want the result list to contain the maximum of 20  documents. However,  the number of results, which was taken from the index varies and is  determined precisely by the <em>queryResultWindowSize </em>parameter. This  parameter, defined in the <em>solrconfig.xml </em>file, determines how many  results will be retrieved from the index and stored in <em>queryResultCache</em>.</p>
<h3>But what can I use <em>queryResultWindowSize</em> for ?</h3>
<p>The <em>queryResultWindowSize </em>parameter specifies the size of so called results  window, which is simply the number of documents that will be fetched  from the index when retrieving search results.&nbsp; For example, setting <em>queryResultWinwdowSize </em>to 100 and send the following query:
</p>
<pre class="brush:xml">q=car&amp;rows=30&amp;start=10</pre>
<p>will  result in a maximum of 30 documents in the search result list, however  Solr will fetch 100 documents from the index (starting at index 0 and  ending at index 100) and then try Solr will place them in  <em>queryResultCache</em>. The next query, which will differ only in the parameters <em>start </em>and <em>rows </em>can be retrieved from <em>queryResultCache</em>.</p>
<h3>Configuration</h3>
<p>To set the <em>queryResultWindowSize </em>to the value of 100, you must add the following entry to the <em>solrconfig.xml </em>file:
</p>
<pre class="brush:xml">&lt;queryResultWindowSize&gt;100&lt;/queryResultWindowSize&gt;</pre>
<h3>What to remember ?</h3>
<p>Of course, setting only the <em>queryResultsWindowSize </em>is not everything. You should still provide adequate space in <em>queryResultCache </em>for Solr to be able to store the necessary information. However <em>queryResultCache </em>configuration is a topic for another article.</p>
<h3>But why use it ?</h3>
<p>The  answer to that question is quite simple &#8211; if your application and your  users often use the paging it is reasonable to consider changing the  default value of the <em>queryResultWindowSize</em>. In  most cases, where the implementation was based on paging, changing the  value of this parameter caused a severe increase in performance when switching  between query pages of the results.</p>
]]></content:encoded>
					
					<wfw:commentRss>https://solr.pl/en/2011/01/10/optimization-query-result-window-size/feed/</wfw:commentRss>
			<slash:comments>0</slash:comments>
		
		
			</item>
		<item>
		<title>The scope of Solr faceting</title>
		<link>https://solr.pl/en/2010/08/23/the-scope-of-solr-faceting/</link>
					<comments>https://solr.pl/en/2010/08/23/the-scope-of-solr-faceting/#respond</comments>
		
		<dc:creator><![CDATA[Rafał Kuć]]></dc:creator>
		<pubDate>Mon, 23 Aug 2010 12:07:58 +0000</pubDate>
				<category><![CDATA[Solr]]></category>
		<category><![CDATA[date faceting]]></category>
		<category><![CDATA[facet]]></category>
		<category><![CDATA[facet method]]></category>
		<category><![CDATA[facet parameter]]></category>
		<category><![CDATA[facet query]]></category>
		<category><![CDATA[faceting]]></category>
		<category><![CDATA[local]]></category>
		<category><![CDATA[local params]]></category>
		<category><![CDATA[params]]></category>
		<category><![CDATA[query]]></category>
		<category><![CDATA[range faceting]]></category>
		<category><![CDATA[solr]]></category>
		<guid isPermaLink="false">http://sematext.solr.pl/?p=69</guid>

					<description><![CDATA[Faceting is one of the ways to categorize the content found in the process of information retrieval. In case of Solr this is the division of set of documents on the basis of certain criteria: content of individual fields, queries]]></description>
										<content:encoded><![CDATA[<p>Faceting is one of the ways to categorize the content found in the  process of information retrieval. In case of Solr this is the division  of set of documents on the basis of certain criteria: content of  individual fields, queries or on the basis of compartments or dates. In  today&#8217;s entry I will try to some scope on the possibility of using the  faceting mechanism, both currently available in Solr 1.4.1, as well as  what will be available in the future.</p>
<p><span id="more-69"></span></p>
<p>One of the few sources of information about faceting is Solr wiki &#8211;  to be more specific &#8211; the page at:  <a href="http://wiki.apache.org/solr/SimpleFacetParameters" target="_blank" rel="noopener noreferrer">http://wiki.apache.org/solr/SimpleFacetParameters</a>. The following article  is an extension to the information available on the wiki website.</p>
<p>Solr faceting mechanism can be divided into four basic types:</p>
<ul>
<li>field faceting,</li>
<li>query faceting,</li>
<li>date faceting,</li>
<li>range facteing.</li>
</ul>
<p>To turn Solr faceting mechanism on, one need to pass <em>facet</em> parameter with the value <em>true</em>.</p>
<h3>Field faceting</h3>
<p>First type of faceting. This type of faceting categorize documents  found due to specified field. With this type of faceting we are able to  get a number of documents found for example in each category or  geographical location. Faceting by field is characterized by a large  number of options which configure its behavior. This are the parameters  available for use:</p>
<ul>
<li><em>facet.field</em> &#8211; parameter specifying which field will be  used to perform faceting. This parameter can be specified multiple  times. Remember that adding multiple <em>facet.field</em> parameters to the query can affect performance.</li>
<li><em>facet.prefix</em> &#8211; restricts faceting results to those that begin with the specified  prefix. The parameter can be defined for each field specified by the <em>facet.field</em> parameter &#8211; you can do it, by adding a parameter like this: <em>facet.field_name.prefix</em>. This parameter is a relatively simple way to implement autocomplete mechanism.</li>
<li><em>facet.sort</em> &#8211; specifies how to sort faceting results. If You use Solr version lower than 1.4, this parameter takes values of <em>true </em>or <em>false</em> indicating successively &#8211; sort by the number of results and sort by  index order (in the case of ASCII this means alphabetical sorting). If  however You are using Solr version 1.4 or higher You should use <em>count</em> value (meaning the same as <em>true</em>), or <em>index</em> value (meaning the same as <em>false</em>). The default value for this parameter is <em>true/count</em> when <em>facet.limit</em> set to 0 or <em>false/index</em> for other values of <em>facet.limit</em> parameter. The parameter can be defined for each field specified by the <em>facet.field</em> parameter.</li>
<li><em>facet.limit</em> &#8211; parameter specifying how many unique values of faceting results to  display. A negative value for this parameter mean no limit. Please note  that the larger the limit, the more memory you need and the longer query  execution. Default parameter value is 100. The parameter can be defined  for each field specified by the <em>facet.field</em> parameter.</li>
<li><em>facet.offset</em> &#8211; parameter defining from offset (from the first faceting result) of  presented faceting results. Default parameter value is 0. This parameter  is designed to help implementing faceting result paging. The parameter  can be defined for each field specified by the <em>facet.field</em> parameter.</li>
<li><em>facet.mincount</em> &#8211; parameter specifying the minimum size of result to be included in  faceting results. The default value is 0. The parameter can be defined  for each field specified by the <em>facet.field</em> parameter.</li>
<li><em>facet.missing</em> &#8211; parameter specifying whether, in addition to standard faceting  results, number of documents without a value in the specified field  should be included. This parameter can take values of <em>true</em> or <em>false</em>. The default parameter value is <em>false</em>. The parameter can be defined for each field specified by the <em>facet.field</em> parameter.</li>
<li><em>facet.method &#8211; </em>parameter introduced in Solr 1.4. It takes the value of <em>enum</em> or <em>fc</em>. Specifies a method for faceting calculation. Setting this parameter to <em>enum</em> effects in using term enumeration to calculate faceting results. This  method is proven to be most efficient when dealing with fields with  small number of unique terms. The second method, labeled <em>fc</em>, is  the standard method for faceting calculation. It takes all the results  and iterate over all documents in the result set. The parameter can be  defined for each field specified by the <em>facet.field</em> parameter. The default value is <em>fc</em> for all the fields not based on <em>Boolean</em> type.</li>
<li><em>facet.enum.cache.minDf</em> &#8211; parameter with strange sounding name specifying the minimum number of matching documents to a single term to the <em>fc </em>method to be used for faceting result calculation. I know it sounds strange but i do not know how to explain it easier <img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f609.png" alt="😉" class="wp-smiley" style="height: 1em; max-height: 1em;" /></li>
</ul>
<p>These are the parameters of field faceting. In case of most  parameters I have written that there is a possibility to define their  values for each field specified by <em>facet.field</em> parameter. How does it look like ? Suppose we have a query like this:
</p>
<pre class="brush:xml">q=solr&amp;facet=true&amp;facet.field=category&amp;facet.field=location</pre>
<p>It is a simple query for &#8216;solr&#8217; term with faceting mechanism turned on. There are two facet fields defined &#8211; <em>category</em> and <em>location</em>. Lets say, that we would like to have 200 facet results for <em>category </em>field sorted by count and 50 facet results for <em>location</em> field sorted alphabetically. To do that we add the following fragment to the query shown above:
</p>
<pre class="brush:xml">facet.category.limit=200&amp;facet.category.sort=count&amp;facet.location.limit=50&amp;facet.location.sort=index</pre>
<p>As shown we can easily modify facet mechanism behavior for individual facet fields.</p>
<h3>Query faceting</h3>
<p>Facet mechanism based on a single parameter &#8211; <em>facet.query</em> to  which we give a query. The query passed to the parameter must be  constructed so that standard Lucene query parser can understand it. An example use of this parameter is, for example query a group of pricing, which could look like:
</p>
<pre class="brush:xml">facet.query=price: [0+TO+100]</pre>
<p>Note,  however, that each added <em>facet.query</em> parameter is another  query to Lucene, which means performance loss. Many <em>facet.query</em> parameters in a query can be painful to Solr.</p>
<p>There is one more thing worth mentioning when  talking about query faceting &#8211; there is a possibility to define your own  parser to parse <em>facet.query</em> parameter value. To use your own parser, for example, called <em>myParser</em> parameter passed to Solr should look like this:
</p>
<pre class="brush:xml">facet.query={!myParser}aaa</pre>
<h3>Date faceting</h3>
<p>New faceting functionality introduced in Solr 1.3. Date faceting allows you to calculate faceting results including all the intricacies of processing dates. Please note that date faceting can only be used with fields based on the type <em>solr.</em><em>DateField</em>. Now let&#8217;s get on with the parameters associated with date faceting:</p>
<ul>
<li><em> facet.date</em> &#8211; like <em>facet.field</em> parameter, this parameter is used to identify fields where dates faceting should be used. As  in the case of <em>facet.field</em> parameter you can specify this parameter  several times to allow date faceting on many fields in one  query.</li>
<li><em>facet.date.start &#8211; </em>parameter  specifying the lower limit of date on which the faceting calculation  should be started. This parameter can be defined for each field  specified by the <em>facet.date </em>parameter. This parameter is mandatory when using <em>facet.date</em> and should be defined for each <em>facet.date</em> parameter.</li>
<li><em>facet.date.end</em> &#8211; parameter defining the upper limit of the date, on which the faceting  calculation should be ended. This parameter can be defined for each  field specified by the <em>facet.date </em>parameter. This parameter is mandatory when using <em>facet.date </em>and should be defined for each <em>facet.date</em> parameter.</li>
<li><em>facet.date.gap</em> &#8211; parameter specifying date compartments to be generated for the defined boundaries. This parameter is mandatory when using <em>facet.date</em> and should be defined for each <em>facet.date</em> parameter. The parameter can be defined for each field specified by the <em>facet.date </em>parameter.</li>
<li><em>facet.date.hardend</em> &#8211; parameter taking values <em> true </em>and <em>false</em>, telling Solr what to do in the case when the parameter <em>facet.date.gap</em> is not evenly splitting the compartments. If we set this parameter to <em>true</em> the last compartment generated by <em>facet.date.gap</em> parameter can be wider than the boundary defined by <em>facet.date.end</em> parameter. If we set this parameter to <em>false</em> (default value) the last compartment generated by <em>facet.date.gap</em> parameter can be smaller then the rest of the ranges. The parameter can be defined for each field specified by the <em>facet.date </em>parameter.</li>
<li><em>facet.date.other</em> &#8211; parameter specifying what values besides the standard ones (ranges)  should be added to results of date faceting. The parameter can be  defined for each field specified by the <em>facet.date </em>parameter. The parameter can take following values:
<ul>
<li><em>before </em>&#8211; in addition to the standard date faceting  results, there will be one more &#8211; number of documents with a date before  the one defined in the <em>facet.date.start</em> parameter,</li>
<li><em>after</em> &#8211; in addition to the standard date faceting results, there will be one  more &#8211; number of documents with the date after the one defined in the <em>facet.date.end</em> parameter,</li>
<li><em>between </em>&#8211; in addition to the standard date faceting results, there will be one more &#8211; number of documents with the date between <em>facet.date.start</em> and <em>facet.date.end</em> parameters,</li>
<li><em>all</em> &#8211; a shortcut to define all the above,</li>
<li><em>none </em>&#8211; none of the additional results will be added to date faceting results.</li>
</ul>
</li>
<li><em>facet.date.include</em> &#8211; parameter that will be introduced  in Solr 4.0. It allows of closing or opening of the compartments  defined by the boundaries and the gap. The parameter will accept the  following values:
<ul>
<li><em>lower</em> &#8211; each of the resulting compartment will contain its lower limit,</li>
<li><em>upper</em> &#8211; each of the resulting compartment will contain its upper limit,</li>
<li><em>egde</em> &#8211; the first and last interval will include its external borders &#8211; that  is, for the first lower and upper range for the last interval,</li>
<li><em>outer</em> &#8211; a parameter specifying that the compartments defined by the values <em>before</em> and <em>after</em> of the <em>facet.date.other</em> parameter will contain its borders, even if other compartments already contain these borders,</li>
<li><em>all</em> &#8211; a parameter that causes the inclusion of all of the above options.</li>
</ul>
</li>
</ul>
<p>That is how we can modify the behavior of the date faceting. Now, some example of using this kind of faceting:
</p>
<pre class="brush:xml">q=solr&amp;facet=true&amp;facet.date=addDate&amp;facet.date.start=NOW/DAY-30DAYS&amp;facet.date.end=NOW/DAY%2B30DAYS&amp;facet.date.gap=%2B1DAY</pre>
<p>What does the above query do ? We turn the faceting mechanism on, we define date faceting for <em>addDate</em> field. What we want to get is the compartments between 30 days before today (<em>NOW/DAY-30DAYS</em>) and 30 days after today (<em>NOW/DAY+30DAYS)</em>. The compartments will be of the size of a single day.</p>
<h3>Range faceting</h3>
<p>Functionality which will be available in Solr 3.1. If someone want to test it right now, both the trunk and branch 3.x have this  functionality implemented. This method of faceting is the extension of  date faceting. This functionality works similar to date faceting &#8211; as a  result we get a list of compartments constructed automatically based on  parameters. Here are the list of parameters that can be used to define  range faceting behavior:</p>
<ul>
<li><em> facet.range </em>&#8211; like <em>facet.field</em> parameter, this parameter is used to identify fields where range faceting should be used. As  in the case of <em>facet.field</em> parameter you can specify this parameter  several times to allow range faceting on many fields in one  query.</li>
<li><em>facet.range.start &#8211; </em>parameter   specifying the lower limit of range on which the faceting calculation   should be started. This parameter can be defined for each field  specified  by the <em>facet.range </em>parameter. This parameter is mandatory when using <em>facet.range </em>and should be defined<em> </em>for each <em>facet.range</em> parameter.</li>
<li><em>facet.range.end</em> &#8211; parameter defining the upper limit of the range, on which the  faceting  calculation should be ended. This parameter can be defined for  each field specified  by the <em>facet.range </em>parameter. This parameter is mandatory when using <em>facet.range </em>and should be defined<em> </em>for each <em>facet.range</em> parameter.</li>
<li><em>facet.range.gap</em> &#8211; parameter specifying range compartments to be generated for the defined boundaries. This parameter is mandatory when using <em>facet.range</em> and should be defined for each <em>facet.date</em> parameter. The parameter can be defined for each field specified by the <em>facet.date </em>parameter.</li>
<li><em>facet.range.hardend</em> &#8211; parameter taking values <em> true </em>and <em>false</em>, telling Solr what to do in the case when the parameter <em>facet.range.gap</em> is not evenly splitting the compartments. If we set this parameter to <em>true</em> the last compartment generated by <em>facet.range.gap</em> parameter can be wider than the boundary defined by <em>facet.range.end</em> parameter. If we set this parameter to <em>false</em> (default value) the last compartment generated by <em>facet.range.gap</em> parameter can be smaller then the rest of the ranges. The parameter can be defined for each field specified by the <em>facet.range</em>parameter.</li>
<li><em>facet.range.other</em> &#8211; parameter specifying what values besides the standard ones (ranges)   should be added to results of range faceting. The parameter can be   defined for each field specified by the <em>facet.range </em>parameter. The parameter can take following values:
<ul>
<li><em>before </em>&#8211;  in addition to the standard range faceting  results, there will be one  more &#8211; number of documents with a values  lower than the one defined in the <em>facet.range.start</em> parameter,</li>
<li><em>after</em> &#8211; in addition to the standard range faceting results, there will be one   more &#8211; number of documents with the values higher than the one defined  in the <em>facet.range.end</em> parameter,</li>
<li><em>between </em>&#8211; in addition to the standard range faceting results, there will be one more &#8211; number of documents with the values between <em>facet.range.start</em> and <em>facet.range.end</em> parameters,</li>
<li><em>all</em> &#8211; a shortcut to define all the above,</li>
<li><em>none </em>&#8211; none of the additional results will be added to range faceting results.</li>
</ul>
</li>
<li><em>facet.range.include</em> &#8211; parameter allowing closing  or  opening of the compartments defined by the boundaries and the gap.  The  parameter will accept the following values:
<ul>
<li><em>lower</em> &#8211; each of the resulting compartment will contain its lower limit,</li>
<li><em>upper</em> &#8211; each of the resulting compartment will contain its upper limit,</li>
<li><em>egde</em> &#8211; the first and last interval will include its external borders &#8211; that   is, for the first lower and upper range for the last interval,</li>
<li><em>outer</em> &#8211; a parameter specifying that the compartments defined by the values <em>before</em> and <em>after</em> of the <em>facet.range.other</em> parameter will contain its borders, even if other compartments already contain these borders,</li>
<li><em>all</em> &#8211; a parameter that causes the inclusion of all of the above options.</li>
</ul>
</li>
</ul>
<p>As you can see the range faceting parameters are almost identical to  those in date faceting. The behavior is also almost identical. An  example query using ranges faceting may be the following query:
</p>
<pre class="brush:xml">q=solr&amp;facet=true&amp;facet.range=price&amp;facet.range.start=0&amp;facet.range.end=1000&amp;facet.range.gap=100</pre>
<p>So, we went through all of the types of faceting. But thats not all.  Users of Solr version 1.4 and higher have the opportunity to use the  so-called LocalParams.</p>
<h3>LocalParams and faceting</h3>
<p>Suppose we have a requirement.﻿ We  have a query that returns search results for the term &#8216;solr&#8217; and in  which we have defined two filters, one for category and one for the  country of origin of the document. In addition to the search results we  want to enable navigation through the regions and categories, but we  would like them not to be dependend on each other. That is, we want to  give the opportunity to navigate through the regions for the term &#8216;solr&#8217;  but we dont want it to be limited to the selected category, and vice  versa. To do it in Solr version 1.3 or earlier, we would write the  following query:
</p>
<pre class="brush:xml">q=solr&amp;fq=category:search&amp;fq=region:poland
q=solr&amp;facet=true&amp;facet.field=category&amp;facet.field=region</pre>
<p>Two  queries, because first we have to get narrowed search results, on the  other hand we need the faceting result not to be narrowed by filters.  For Solr version 1.4 or higher, we can shorten this to one query. For  this purpose, we use the possibility of tagging and exclusion of tagged  parameters. First we change the query as follows:
</p>
<pre class="brush:xml">q=solr&amp;fq={!tag=categoryFQ}fq=category:search&amp;fq={!tag=regionFQ}region:poland</pre>
<p>For  now, the search results will not change. We added tags to the filters  in the above query so we can later exclude them in faceting. Then we  modify the second query as follows:
</p>
<pre class="brush:xml">q=solr&amp;facet=true&amp;facet.field={!ex=categoryFQ,regionFQ}category&amp;facet.field={!ex=categoryFQ,regionFQ}region</pre>
<p>So far the faceting results will not change. We added exclusions to the <em>facet.field</em> parameters, so filters named <em>categoryFQ</em> and <em>regionFQ</em> will not be taken into consideration when calculating faceting results.</p>
<p>Then we combine the modified query, so it should look as follows:
</p>
<pre class="brush:xml">q=solr&amp;fq={!tag=categoryFQ}fq=category:search&amp;fq={!tag=regionFQ}region:poland&amp;facet=true&amp;facet.field={!ex=categoryFQ,regionFQ}category&amp;facet.field={!ex=categoryFQ,regionFQ}region</pre>
<p>I`ll write more about LocalParams in a future entries.</p>
<h3>A few words at the end</h3>
<p>I  hope that this article approached the possibility of using Solr  faceting, both in earlier versions of Solr, in the present, as well as  those that arise in the nearest future.</p>
]]></content:encoded>
					
					<wfw:commentRss>https://solr.pl/en/2010/08/23/the-scope-of-solr-faceting/feed/</wfw:commentRss>
			<slash:comments>0</slash:comments>
		
		
			</item>
		<item>
		<title>6 deadly sins in the context of query</title>
		<link>https://solr.pl/en/2010/08/11/6-deadly-sins-in-the-context-of-query/</link>
					<comments>https://solr.pl/en/2010/08/11/6-deadly-sins-in-the-context-of-query/#respond</comments>
		
		<dc:creator><![CDATA[Rafał Kuć]]></dc:creator>
		<pubDate>Wed, 11 Aug 2010 12:04:37 +0000</pubDate>
				<category><![CDATA[Solr]]></category>
		<category><![CDATA[facet]]></category>
		<category><![CDATA[facet.limit]]></category>
		<category><![CDATA[facet.offset]]></category>
		<category><![CDATA[faceting]]></category>
		<category><![CDATA[how to query]]></category>
		<category><![CDATA[query]]></category>
		<category><![CDATA[solr query]]></category>
		<guid isPermaLink="false">http://sematext.solr.pl/?p=62</guid>

					<description><![CDATA[In my work related to Lucene and Solr I have seen various queries. While in the case of Lucene, developer usually knows what he/she wants to achieve and use more or less optimal solution, but when it comes to Solr]]></description>
										<content:encoded><![CDATA[<p>In my work related to Lucene and Solr I have seen various queries. While in the case of Lucene, developer usually knows what he/she wants to achieve and use more or less optimal solution, but when it comes to Solr it is not always like this. Solr is a product which could theoretically be used by everyone, both the person who knows Java, one that does not have a broad and specialized technical knowledge, as well as programmer. Precisely because of that Solr is a product which is easy to run and use it, at least when it comes to simple functionalities. I suppose, that is why not many people are worried about reading Solr wiki or at least review the mailing list. As a result, sooner or later people tend to make mistakes. Those errors arise from various shortcomings &#8211; lack of knowledge about Solr, lack of skills, lack of experience or simply a lack of time and tight deadlines. Today I would like to show some major mistakes when submitting queries to Solr and how to avoid those mistakes.</p>
<p><span id="more-62"></span></p>
<h3>1. Lack of filters</h3>
<p>One of the fundamental errors that I encounter from time to time is the lack of filters, which in context of query means no fq parameter. Let us remember that filters are out friends <img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f609.png" alt="😉" class="wp-smiley" style="height: 1em; max-height: 1em;" /> Remember that because of filters Solr cache is used more optimally. Filter do not affect relevance of the document in the context of query and search results (score factor), and thus, we can perform filtering without fear of the change in the score value of individual documents (useful for example in e-commerce for product groups narrowing).</p>
<h3>2. Logical conditions and q parameter</h3>
<p>Another of the &#8220;sins&#8221; that I come across quite often is a one with close relationship with the previous point. It&#8217;s not a bug in the literal sense, but it is an area where a simple change will have a significant influence on performance. Assuming that the default logical operator is <em>OR</em>, imagine your query in the form of: <code>q=(java+design+patterns)+AND+category:books+AND+promotion:true+AND+publisher:ABC</code>. This query is correct from the perspective of the application logic, where we get the appropriate group search results. But what if we also want to optimally use Solr cache and thus boost performance. The anserw is quite simple &#8211; move some of the terms to filters. By changing out a query to <code>q=java+design+patterns&amp;fq=category:books&amp;fq=promotion:true&amp;fq=publisher:ABC</code> Solr benefit from two types of cache &#8211; queryDocumentCache to retrieve documents for a query with parameter q and the filterCache for each of the filters. With the change of the query we were able to optimize the query to use two types of cache and in addition to optimize the entries of queryDocumentsCache (due to the shortening of the query parameter <em>q</em>).</p>
<h3>3. Hugh numbers of facet queries</h3>
<p>Another &#8220;sin&#8221; associated with handling groups of documents. Quite often, especially, in applications that can categorize products in many ways, I have met queries with a lot of facet.query parameters that correspond to the grouping of documents. Grouping by price, location, product groups, and so on. A good example is grouping by price where the business customer can set the price ranges for each category and then application must group products by those ranges. This leads to queries that have 100, 200 or more<em> facet.query</em> parameters added. Please remember that each <em>facet.query</em> has an impact on efficiency, not to mention 100 or 200 of them. If we are interested in a quick response from Solr we can not make such a queries. In such cases, I always propose modifying the index structure if needed, and they are needed in most cases. Some modifications (like defining ranges at index time) allows to eliminate tens or hundreds of <em>facet.query</em> for one <em>facet.field</em> parameter. But this method is associated to another problem &#8211; explanation for the customer, why &#8220;re-index button&#8221; must be pressed after range changes. As a rule, however, performance testing at high loads and large variety of queries speak for themselves.</p>
<h3>4. Facet limits</h3>
<p>The problem appears in the line where Solr meets business logic. An example of this &#8220;sin&#8221; is a simple list of categories that a customer wants to have displayed depending on user location on the website. When we have a small numbers of categories we do not have a problem, but what about thousands of categories. Very often, I met with the approach taken by the developers to retrieve all categories of Solr (with increased <em>facet.limit</em> parameter compared to the default value) and choose the right categories in the application that is using Solr. I think this approach can generate problems &#8211; first of all faceting requires memory, second aggregating of facet elements need time, and of course getting all of the 50.000 categories with it&#8217;s counts can be painful to Solr. If we want a fast queries, try to use the parameter facet.limit reasonable. If You need many facet results try to build your application so it can use <em>facet.offset</em> parameter and therfore use paging. If this is not possible, at least configure Your container to have enough memory to handle parallel queries and get ready to have queries that can perform longer when having high facet.limit value.</p>
<h3>5. Downloading of unnecessary data</h3>
<p>Very common problem is the retrieval of all information, not just those that we need. Of course, the problem does not apply to deployments where Sole offers information such as, for example, only the product ID. However, a large number of deployments that I&#8217;ve deal with was based almost entirely on Solr, and hence Solr index was made up of multiple stored fields. Developers using Solr very rarely used the parameter<em> fl</em> and the possibility of limiting the fields that are returned. In extreme cases, this led to problems with the amount of data that had to be sent over the network.</p>
<h3>6. Many requests to obtain count of groups of documents</h3>
<p>In some applications more important than the actual search capabilities were the navigation, where users can browse document repository by it&#8217;s feature, like department, category, subcategory, and so on. Very often, in addition to names there are also numbers displayed &#8211; the numbers of documents with this feature. I met cases were the number of documents were obtained using a separate query. Effect &#8211; 100 categories displayed on a web page led to 100 separate queries to Solr. Do not go this way, if You have to modify the Solr index to use the facet mechanism. Maybe at that time it will be more work, but certainly in the long run this is worth it.</p>
<h3>A few words at the end</h3>
<p>Please note that these are just examples that I think are fairly universal, at least, which I quite often encountered during my work. They are not all the errors that happen when using the Solr, but I hope to highlight some of the mistakes people tend to make and how to go around them.</p>
]]></content:encoded>
					
					<wfw:commentRss>https://solr.pl/en/2010/08/11/6-deadly-sins-in-the-context-of-query/feed/</wfw:commentRss>
			<slash:comments>0</slash:comments>
		
		
			</item>
		<item>
		<title>Solr and PhraseQuery &#8211; phrase bonus in query stage</title>
		<link>https://solr.pl/en/2010/07/14/solr-and-phrasequery-phrase-bonus-in-query-stage/</link>
					<comments>https://solr.pl/en/2010/07/14/solr-and-phrasequery-phrase-bonus-in-query-stage/#respond</comments>
		
		<dc:creator><![CDATA[Rafał Kuć]]></dc:creator>
		<pubDate>Wed, 14 Jul 2010 09:19:38 +0000</pubDate>
				<category><![CDATA[Solr]]></category>
		<category><![CDATA[boosting]]></category>
		<category><![CDATA[dismax]]></category>
		<category><![CDATA[edismax]]></category>
		<category><![CDATA[lucene]]></category>
		<category><![CDATA[phrase]]></category>
		<category><![CDATA[phrase query]]></category>
		<category><![CDATA[query]]></category>
		<category><![CDATA[solr]]></category>
		<category><![CDATA[standard]]></category>
		<guid isPermaLink="false">http://sematext.solr.pl/?p=54</guid>

					<description><![CDATA[In the majority of system implementations I dealt with, sooner or later, there was a problem &#8211; search results tunning. One of the simplest ways to improve the search results quality was phrase boosting. Having the three most popular query]]></description>
										<content:encoded><![CDATA[<p>In the majority of system implementations I dealt with, sooner or later, there was a problem &#8211; search results tunning. One of the simplest ways to improve the search results quality was phrase boosting. Having the three most popular query parsers in Solr and the variety of parameters to control them I though it will be a good idea to check how they behave and how they affect performance.</p>
<p><span id="more-54"></span></p>
<p>In the current <em>trunk</em> of Solr we have three query parsers:</p>
<ul>
<li>Standard Solr Query Parser &#8211; default parser for Solr based on Lucene query parser</li>
<li>DisMax Query Parser</li>
<li>Extended DisMax Query Parser</li>
</ul>
<p>Each of the mentioned query parsers have it`s own capabilities in case of phrase boosting on query stage. I won`t mention index time term proximity in this post &#8211; I`ll get back to it some other time. So, about the parsers now.</p>
<p><strong>Standard Solr Query Parser</strong></p>
<p>Parser based on Standard Lucene Query Parser and enhancing it`s parent capabilities. When it comes to phrase boosting, we don`t have much choice. Lets say, that our system is a search system for large Internet library, where users can rate books, leave comments and discuss books in the library forums. Our goal is to index all the data generated by the users and our suppliers and then represent this data in our search results. When user search for &#8220;Java design patterns&#8221;&nbsp; we want to show him the books that have those words in a document. No problem, lets make a Solr query like this:</p>
<p><code>q=java+design+patterns</code></p>
<p>So we get the results and we can say that our search engine is behaving well and we don`t want to improve search quality. But I would add another part to the query &#8211; part that would favor document which have a phrase (words given to the query are next to each other in the document) in the search-able fields. It`s an easy step, our modified query would look like this:<br />
<code><br />
q=java+design+patterns+OR+"java+design+patterns"^30</code></p>
<p>By adding that additional query part (<em>+OR+&#8221;java+design+patterns&#8221;^30</em>) we modified our search results &#8211; by adding that part, on the first position in our result we now have books which have the exact phrase in the search fields. Lucene query generated by the parser look like that:
</p>
<pre class="brush:xml">&lt;str name="parsedquery"&gt;name:java name:design name:patterns PhraseQuery(name:"java design patterns"^30.0)&lt;/str&gt;
&lt;str name="parsedquery_toString"&gt;name:java name:design name:patterns name:"java design patterns"^30.0&lt;/str&gt;</pre>
<p>Search results for above query as follows:
</p>
<pre class="brush:xml">&lt;?xml version="1.0" encoding="UTF-8"?&gt;
&lt;response&gt;
&lt;lst name="responseHeader"&gt;
   &lt;int name="status"&gt;0&lt;/int&gt;
   &lt;int name="QTime"&gt;0&lt;/int&gt;
   &lt;lst name="params"&gt;
      &lt;str name="q"&gt;java design patterns OR "java design patterns"^30&lt;/str&gt;
      &lt;str name="fl"&gt;score,id,name&lt;/str&gt;
   &lt;/lst&gt;
&lt;/lst&gt;
&lt;result name="response" numFound="5" start="0" maxScore="1.2399161"&gt;
   &lt;doc&gt;
      &lt;float name="score"&gt;1.2399161&lt;/float&gt;
      &lt;str name="id"&gt;1&lt;/str&gt;
      &lt;str name="name"&gt;Java design patterns&lt;/str&gt;
   &lt;/doc&gt;
   &lt;doc&gt;
      &lt;float name="score"&gt;0.010219089&lt;/float&gt;
      &lt;str name="id"&gt;2&lt;/str&gt;
      &lt;str name="name"&gt;Design patterns java&lt;/str&gt;
   &lt;/doc&gt;
   &lt;doc&gt;
      &lt;float name="score"&gt;0.010219089&lt;/float&gt;
      &lt;str name="id"&gt;3&lt;/str&gt;
      &lt;str name="name"&gt;Design java patterns&lt;/str&gt;
   &lt;/doc&gt;
   &lt;doc&gt;
      &lt;float name="score"&gt;0.010219089&lt;/float&gt;
      &lt;str name="id"&gt;4&lt;/str&gt;
      &lt;str name="name"&gt;Patterns design java&lt;/str&gt;
   &lt;/doc&gt;
   &lt;doc&gt;
      &lt;float name="score"&gt;0.010219089&lt;/float&gt;
      &lt;str name="id"&gt;5&lt;/str&gt;
      &lt;str name="name"&gt;Patterns java design&lt;/str&gt;
   &lt;/doc&gt;
&lt;/result&gt;
&lt;/response&gt;</pre>
<p><strong>DisMax Query Parser</strong></p>
<p>In addition to constructing queries in such a manner as described above, we can use the parameter <strong>pf</strong> and modify its behavior by using the <strong>ps</strong> parameter. <strong>Pf</strong> parameter provide information about the fields in which phrases will be identified. <strong>Pf p</strong>arameter is often used in a manner analogous to the parameter <strong>qf</strong> specifying a list of search-able fields. In addition to that, we must specify the boost parameter for the phrase otherwise the default boost will be taken into consideration. The query using DisMax would look like that:</p>
<p><code>q=java+design+patterns&amp;defType=dismax&amp;qf=name&amp;pf=name^30&amp;ps=0</code></p>
<p>While the query passed to Lucene looks as follows:
</p>
<pre class="brush:xml">&lt;str name="parsedquery"&gt;+((DisjunctionMaxQuery((name:java)) DisjunctionMaxQuery((name:design)) DisjunctionMaxQuery((name:patterns)))~3) DisjunctionMaxQuery((name:"java design patterns"^30.0))&lt;/str&gt;
&lt;str name="parsedquery_toString"&gt;+(((name:java) (name:design) (name:patterns))~3) (name:"java design patterns"^30.0)&lt;/str&gt;</pre>
<p>The results for the query thus constructed are as follows:
</p>
<pre class="brush:xml">&lt;?xml version="1.0" encoding="UTF-8"?&gt;
&lt;response&gt;
&lt;lst name="responseHeader"&gt;
   &lt;int name="status"&gt;0&lt;/int&gt;
   &lt;int name="QTime"&gt;0&lt;/int&gt;
   &lt;lst name="params"&gt;
      &lt;str name="pf"&gt;name^30&lt;/str&gt;
      &lt;str name="fl"&gt;id,name,score&lt;/str&gt;
      &lt;str name="q"&gt;java design patterns&lt;/str&gt;
      &lt;str name="qf"&gt;name&lt;/str&gt;
      &lt;str name="defType"&gt;dismax&lt;/str&gt;
      &lt;str name="ps"&gt;0&lt;/str&gt;
   &lt;/lst&gt;
&lt;/lst&gt;
&lt;result name="response" numFound="5" start="0" maxScore="1.2399161"&gt;
   &lt;doc&gt;
      &lt;float name="score"&gt;1.2399161&lt;/float&gt;
      &lt;str name="id"&gt;1&lt;/str&gt;
      &lt;str name="name"&gt;Java design patterns&lt;/str&gt;
   &lt;/doc&gt;
   &lt;doc&gt;
      &lt;float name="score"&gt;0.013625451&lt;/float&gt;
      &lt;str name="id"&gt;2&lt;/str&gt;
      &lt;str name="name"&gt;Design patterns java&lt;/str&gt;
   &lt;/doc&gt;
   &lt;doc&gt;
      &lt;float name="score"&gt;0.013625451&lt;/float&gt;
      &lt;str name="id"&gt;3&lt;/str&gt;
      &lt;str name="name"&gt;Design java patterns&lt;/str&gt;
   &lt;/doc&gt;
   &lt;doc&gt;
      &lt;float name="score"&gt;0.013625451&lt;/float&gt;
      &lt;str name="id"&gt;4&lt;/str&gt;
      &lt;str name="name"&gt;Patterns design java&lt;/str&gt;
   &lt;/doc&gt;
   &lt;doc&gt;
      &lt;float name="score"&gt;0.013625451&lt;/float&gt;
      &lt;str name="id"&gt;5&lt;/str&gt;
      &lt;str name="name"&gt;Patterns java design&lt;/str&gt;
   &lt;/doc&gt;
&lt;/result&gt;
&lt;/response&gt;</pre>
<p>It is noteworthy that the order of results for both methods is the same. This follows from the fact, that the phrase has been identified only in the document with the id of 1.Look that there is no difference in the value of <em>score</em> for the first document in both methods. Of course the other documents, located on positions from 2 to 5, are in both cases on the same positions, but have different <em>score</em> values because of the difference in query passed to Lucene.</p>
<p>But, I used the <strong>ps</strong> parameter (set to 0) and didn`t mention why I did it. When You use the <strong>pf</strong> (and pf2, but more on that later) parameter, the <strong>ps</strong> parameter mean <em>Phrase Slop</em> &#8211; a maximum distance of words from each other to form a phrase. For instance, <strong>ps=2</strong> will mean that the words can be a maximum of two places from each other to form a phrase. Note, however, that despite the fact that both the &#8220;Java sample design patterns&#8221; and &#8220;Java design patterns&#8221; will create a phrase, but the document entitled &#8220;Java design patterns&#8221; will have a bigger <em>score</em> value, despite the settings <strong>ps=2</strong>, because of terms located closer together.</p>
<p><strong>Extended DisMax Query Parser</strong></p>
<p>Unfortunately without the use of trunk You can not use eDisMax. But, anyway, the query using eDisMax <em>Enhanced Term Proximity Boosting</em> would look like that:</p>
<p><code>q=java+design+patterns&amp;defType=edismax&amp;qf=name&amp;pf2=name^30&amp;ps=0</code></p>
<p>The above query creates the following query to Lucene:
</p>
<pre class="brush:xml">&lt;str name="parsedquery"&gt;+(DisjunctionMaxQuery((name:java)) DisjunctionMaxQuery((name:design)) DisjunctionMaxQuery((name:patterns))) (DisjunctionMaxQuery((name:"java design"^30.0)) DisjunctionMaxQuery((name:"design patterns"^30.0)))&lt;/str&gt;
&lt;str name="parsedquery_toString"&gt;+((name:java) (name:design) (name:patterns)) ((name:"java design"^30.0) (name:"design patterns"^30.0))&lt;/str&gt;</pre>
<p>As seen, in addition to the standard DisjunctionMaxQuery produced by DisMax (and this its expanded version), extended DisMax parser also produced two additional queries &#8211; the ones responsible for <em>enhanced term proximity boosting</em>.&nbsp; The additional queries boosts pair of word created from the terms in the user query. In the presented case the created test pairs were &#8220;java design&#8221; and &#8220;design patterns&#8221;. As you can guess the most significant documents in the results list, documents will be generated by having both pairs, the next document will have one of the pair, and another will not have any. As proof I present the result of the above query send to Solr:
</p>
<pre class="brush:xml">&lt;?xml version="1.0" encoding="UTF-8"?&gt;
&lt;response&gt;
&lt;lst name="responseHeader"&gt;
   &lt;int name="status"&gt;0&lt;/int&gt;
   &lt;int name="QTime"&gt;0&lt;/int&gt;
   &lt;lst name="params"&gt;
      &lt;str name="fl"&gt;id,name,score&lt;/str&gt;
      &lt;str name="q"&gt;java design patterns&lt;/str&gt;
      &lt;str name="qf"&gt;name&lt;/str&gt;
      &lt;str name="pf2"&gt;name^30&lt;/str&gt;
      &lt;str name="defType"&gt;edismax&lt;/str&gt;
      &lt;str name="ps"&gt;0&lt;/str&gt;
   &lt;/lst&gt;
&lt;/lst&gt;
&lt;result name="response" numFound="5" start="0" maxScore="1.1705827"&gt;
   &lt;doc&gt;
      &lt;float name="score"&gt;1.1705827&lt;/float&gt;
      &lt;str name="id"&gt;1&lt;/str&gt;
      &lt;str name="name"&gt;Java design patterns&lt;/str&gt;
   &lt;/doc&gt;
   &lt;doc&gt;
      &lt;float name="score"&gt;0.3034844&lt;/float&gt;
      &lt;str name="id"&gt;2&lt;/str&gt;
      &lt;str name="name"&gt;Design patterns java&lt;/str&gt;
   &lt;/doc&gt;
   &lt;doc&gt;
      &lt;float name="score"&gt;0.3034844&lt;/float&gt;
      &lt;str name="id"&gt;5&lt;/str&gt;
      &lt;str name="name"&gt;Patterns java design&lt;/str&gt;
   &lt;/doc&gt;
   &lt;doc&gt;
      &lt;float name="score"&gt;0.014451639&lt;/float&gt;
      &lt;str name="id"&gt;3&lt;/str&gt;
      &lt;str name="name"&gt;Design java patterns&lt;/str&gt;
   &lt;/doc&gt;
   &lt;doc&gt;
      &lt;float name="score"&gt;0.014451639&lt;/float&gt;
      &lt;str name="id"&gt;4&lt;/str&gt;
      &lt;str name="name"&gt;Patterns design java&lt;/str&gt;
   &lt;/doc&gt;
&lt;/result&gt;
&lt;/response&gt;</pre>
<p>As you can see the first document has not changed its position. The second and third place are the documents that have one of the pairs generated by the parser. As a result documents with id 2 and 5 have the same coefficient<em> score </em>value. The result list is closed by the documents with only terms present in the search-able fields.</p>
<p><strong>Performance</strong></p>
<p>In any case, it must be taken into account that individual features will affect the performance of applications based on Solr. I thought I`ll do a simple performance test. The assumptions of the test are quite simple &#8211; index data from wikipedia and for each phrase boost method create five queries &#8211; each of the queries assembled from two to six tokens. Solr cache disabled, restart of Solr after each query. The result is the arithmetic mean of 10 repetitions of each test. Before the test results, a few words about the index:</p>
<ul>
<li>Number of documents in the index: 1,177,239</li>
<li>Number of segments: 1</li>
<li>Number of terms: 18.506.646</li>
<li>Number of term/document pairs: 230.297.212</li>
<li>Number of tokens: 418.135.268</li>
<li>The size of the index: 4.6GB (optimized)</li>
<li>Lucene version used to build the index: 4.0-dev 964000</li>
</ul>
<p>Phrases that were selected for each iteration of the test:</p>
<ul>
<li>Iteration I: &#8220;Great Peter&#8221;</li>
<li>Iteration II: &#8220;World War Two&#8221;</li>
<li>Iteration III: &#8220;World War Two Germany&#8221;</li>
<li>Iteration IV: &#8220;Move Time Eastern Poland Reformation&#8221;</li>
<li>Iteration V: &#8220;Change Winter Cloths To Summer Cloths Now&#8221;</li>
</ul>
<p>The results were as follows:</p>
<p>[table “1” not found /]<br />
</p>
<p>Please note that the reported results concern only the issue of performance and are not suggesting a method of phrase boosting. The choice of method is a matter of requirements and implementation. As for the results, you can see that the DisMax method is the quickest one.</p>
]]></content:encoded>
					
					<wfw:commentRss>https://solr.pl/en/2010/07/14/solr-and-phrasequery-phrase-bonus-in-query-stage/feed/</wfw:commentRss>
			<slash:comments>0</slash:comments>
		
		
			</item>
	</channel>
</rss>
