<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Rafał Andrzejewski &#8211; Solr.pl</title>
	<atom:link href="https://solr.pl/en/author/randrzjewski/feed/" rel="self" type="application/rss+xml" />
	<link>https://solr.pl/en/</link>
	<description>All things to be found - Blog related to Apache Solr &#38; Lucene projects - https://solr.apache.org</description>
	<lastBuildDate>Wed, 11 Nov 2020 20:51:09 +0000</lastBuildDate>
	<language>en-US</language>
	<sy:updatePeriod>
	hourly	</sy:updatePeriod>
	<sy:updateFrequency>
	1	</sy:updateFrequency>
	<generator>https://wordpress.org/?v=6.9</generator>
	<item>
		<title>“Car sale application” – solr.ReversedWildcardFilter – let&#8217;s optimize  wildcard queries (part 8)</title>
		<link>https://solr.pl/en/2011/10/10/car-sale-application-solr-reversedwildcardfilter-lets-optimize-wildcard-queries-part-8/</link>
					<comments>https://solr.pl/en/2011/10/10/car-sale-application-solr-reversedwildcardfilter-lets-optimize-wildcard-queries-part-8/#respond</comments>
		
		<dc:creator><![CDATA[Rafał Andrzejewski]]></dc:creator>
		<pubDate>Mon, 10 Oct 2011 19:50:25 +0000</pubDate>
				<category><![CDATA[Solr]]></category>
		<guid isPermaLink="false">http://sematext.solr.pl/?p=372</guid>

					<description><![CDATA[“Car sale application” users started to use wildard queries more and more often. This fact forced us to think about wildcard queries optimization. solr.ReversedWildcardFilter comes to rescue us. solr.ReversedWildcardFilter The solr.ReversedWildcardFilter filter provides us with new tokens, which in fact]]></description>
										<content:encoded><![CDATA[<p>“Car sale application” users started to use wildard queries more and more often. This fact forced us to think about wildcard queries optimization. solr.ReversedWildcardFilter comes to rescue us.</p>
<p><span id="more-372"></span></p>
<h3>solr.ReversedWildcardFilter</h3>
<p>The solr.ReversedWildcardFilter filter provides us with new tokens, which in fact are reverses tokens, that are indexed to provide faster leading wildcard queries. The filter supports the following init arguments:</p>
<ul>
<li><em>withOriginal</em> &#8211; if true, then produce both original and reversed tokens at the same positions. If false, then produce only reversed tokens.</li>
<li><em>maxPosAsterisk</em> &#8211; maximum position (1-based) of the asterisk wildcard (&#8216;*&#8217;) that triggers the reversal of query term. Asterisk that occurs at positions higher than this value will not cause the reversal of query term.</li>
<li><em>maxPosQuestion</em> &#8211; maximum position (1-based) of the question mark wildcard (&#8216;?&#8217;) that triggers the reversal of query term.</li>
<li><em>maxFractionAsterisk</em> &#8211; additional parameter that triggers the reversal if asterisk (&#8216;*&#8217;) position is less than this fraction of the query token length.</li>
<li><em>minTrailing</em> &#8211; minimum number of trailing characters in query token after the last wildcard character. For good performance this should be set to a value larger than 1.</li>
</ul>
<h3>schema.xml changes</h3>
<p>New filter is added to the “text” field type:
</p>
<pre class="brush:xml">&lt;fieldType name="text" class="solr.TextField"
	positionIncrementGap="100"&gt;
	&lt;analyzer type="index"&gt;
		&lt;tokenizer class="solr.WhitespaceTokenizerFactory" /&gt;
		&lt;filter class="solr.PatternReplaceFilterFactory" pattern="'"
			replacement="" replace="all" /&gt;
		&lt;filter class="solr.WordDelimiterFilterFactory"
			generateWordParts="1" generateNumberParts="1" catenateWords="1"
			stemEnglishPossessive="0" /&gt;
		&lt;filter class="solr.LowerCaseFilterFactory" /&gt;
		<strong>&lt;filter class="solr.ReversedWildcardFilterFactory" /&gt;</strong>
	&lt;/analyzer&gt;
	&lt;analyzer type="query"&gt;
		&lt;tokenizer class="solr.WhitespaceTokenizerFactory" /&gt;
		&lt;filter class="solr.PatternReplaceFilterFactory" pattern="'"
			replacement="" replace="all" /&gt;
		&lt;filter class="solr.WordDelimiterFilterFactory"
			generateWordParts="1" generateNumberParts="1" catenateWords="1"
			stemEnglishPossessive="0" /&gt;
		&lt;filter class="solr.LowerCaseFilterFactory" /&gt;
	&lt;/analyzer&gt;
&lt;/fieldType&gt;</pre>
<p>solr.ReversedWildcardFilterFactory filter is added only to the index analyzer. We do not define any arguments in the filter definition, because we would like to use the default configuration, which is:</p>
<ul>
<li><em>withOriginal</em> &#8211; „true”, we would like to produce original tokens</li>
<li><em>maxPosAsterisk</em> &#8211; 2</li>
<li><em>maxPosQuestion</em> &#8211; 1</li>
<li><em>maxPosQuestion</em> &#8211; 0.0f (disabled)</li>
<li><em>maxPosQuestion</em> &#8211; 2</li>
</ul>
<h3>Sample data</h3>
<p>Let&#8217;s index some sample data:
</p>
<pre class="brush:xml">&lt;add&gt;
  &lt;doc&gt;
    &lt;field name="id"&gt;1&lt;/field&gt;
    &lt;field name="make"&gt;Lancia&lt;/field&gt;
    &lt;field name="model"&gt;Delta&lt;/field&gt;
    ...
  &lt;/doc&gt;
  &lt;doc&gt;
    &lt;field name="id"&gt;2&lt;/field&gt;
    &lt;field name="make"&gt;Land Rover&lt;/field&gt;
    &lt;field name="model"&gt;Defender&lt;/field&gt;
    ...
  &lt;/doc&gt;
  &lt;doc&gt;
    &lt;field name="id"&gt;3&lt;/field&gt;
    &lt;field name="make"&gt;Acura&lt;/field&gt;
    &lt;field name="model"&gt;MDX&lt;/field&gt;
    ...
  &lt;/doc&gt;
  &lt;doc&gt;
    &lt;field name="id"&gt;4&lt;/field&gt;
    &lt;field name="make"&gt;Acura&lt;/field&gt;
    &lt;field name="model"&gt;RDX&lt;/field&gt;
    ...
  &lt;/doc&gt;
  &lt;doc&gt;
    &lt;field name="id"&gt;5&lt;/field&gt;
    &lt;field name="make"&gt;Acura&lt;/field&gt;
    &lt;field name="model"&gt;RSX&lt;/field&gt;
    ...
  &lt;/doc&gt;
&lt;/add&gt;</pre>
<h3>Let&#8217;s create queries</h3>
<p>Let me remind you that the default search field is the “content” field, that among others contains “make” and “model” field. To analyse query results and solr.ReversedWildcardFilter filter behaviour, we will set the „stored” argument of the „content” field to “true”. We will also add the debugQuery query argument, which will allow us to find out, which tokens are used in the query processing (original or reversed).</p>
<ol>
<li>?q=lan*&amp;fl=id,content&amp;debugQuery=on
<pre class="brush:xml">&lt;result name="response" numFound="2" start="0"&gt;
  &lt;doc&gt;
    &lt;arr name="content"&gt;
      &lt;str&gt;Lancia&lt;/str&gt;
      &lt;str&gt;Delta&lt;/str&gt;
      &lt;str&gt;2002&lt;/str&gt;
    &lt;/arr&gt;
    &lt;str name="id"&gt;1&lt;/str&gt;
  &lt;/doc&gt;
  &lt;doc&gt;
    &lt;arr name="content"&gt;
      &lt;str&gt;Land Rover&lt;/str&gt;
      &lt;str&gt;Defender&lt;/str&gt;
      &lt;str&gt;2002&lt;/str&gt;
    &lt;/arr&gt;
    &lt;str name="id"&gt;2&lt;/str&gt;
  &lt;/doc&gt;
&lt;/result&gt;
&lt;lst name="debug"&gt;
  &lt;str name="rawquerystring"&gt;lan*&lt;/str&gt;
  &lt;str name="querystring"&gt;lan*&lt;/str&gt;
  &lt;str name="parsedquery"&gt;content:lan*&lt;/str&gt;
  &lt;str name="parsedquery_toString"&gt;content:lan*&lt;/str&gt;
  ...
&lt;/lst&gt;</pre>
<p>We have used asterisk wildcard (&#8216;*&#8217;) at the end of the query (position = 4), so the original tokens were used:
</p>
<pre class="brush:xml">&lt;str name="parsedquery"&gt;content:lan*&lt;/str&gt;</pre>
</li>
<li>?q=*dx&amp;fl=id,content&amp;debugQuery=on
<pre class="brush:xml">&lt;result name="response" numFound="2" start="0"&gt;
  &lt;doc&gt;
    &lt;arr name="content"&gt;
      &lt;str&gt;Acura&lt;/str&gt;
      &lt;str&gt;MDX&lt;/str&gt;
      &lt;str&gt;2002&lt;/str&gt;
    &lt;/arr&gt;
    &lt;str name="id"&gt;3&lt;/str&gt;
  &lt;/doc&gt;
  &lt;doc&gt;
    &lt;arr name="content"&gt;
      &lt;str&gt;Acura&lt;/str&gt;
      &lt;str&gt;RDX&lt;/str&gt;
      &lt;str&gt;2003&lt;/str&gt;
    &lt;/arr&gt;
    &lt;str name="id"&gt;4&lt;/str&gt;
  &lt;/doc&gt;
&lt;/result&gt;
&lt;lst name="debug"&gt;
  &lt;str name="rawquerystring"&gt;*dx&lt;/str&gt;
  &lt;str name="querystring"&gt;*dx&lt;/str&gt;
  &lt;str name="parsedquery"&gt;content:#1;xd*&lt;/str&gt;
  &lt;str name="parsedquery_toString"&gt;content:#1;xd*&lt;/str&gt;
  ...
&lt;/lst&gt;</pre>
<p>We have used asterisk wildcard (&#8216;*&#8217;) at the beginning of the query (position = 1) and additionally we have two trailing characters after the last wildcard. That&#8217;s why the revesed tokens were used:
</p>
<pre class="brush:xml">&lt;str name="parsedquery"&gt;content:#1;xd*&lt;/str&gt;</pre>
<p>As we can see, the reversed tokens have a special prefix in order to avoid collisions and false matches.</p>
</li>
<li>?q=r?x&amp;fl=id,content&amp;debugQuery=on
<pre class="brush:xml">&lt;result name="response" numFound="2" start="0"&gt;
  &lt;doc&gt;
    &lt;arr name="content"&gt;
      &lt;str&gt;Acura&lt;/str&gt;
      &lt;str&gt;RDX&lt;/str&gt;
      &lt;str&gt;2003&lt;/str&gt;
    &lt;/arr&gt;
    &lt;str name="id"&gt;4&lt;/str&gt;
  &lt;/doc&gt;
  &lt;doc&gt;
    &lt;arr name="content"&gt;
      &lt;str&gt;Acura&lt;/str&gt;
      &lt;str&gt;RSX&lt;/str&gt;
      &lt;str&gt;2006&lt;/str&gt;
    &lt;/arr&gt;
    &lt;str name="id"&gt;5&lt;/str&gt;
  &lt;/doc&gt;
&lt;/result&gt;
&lt;lst name="debug"&gt;
  &lt;str name="rawquerystring"&gt;r?x&lt;/str&gt;
  &lt;str name="querystring"&gt;r?x&lt;/str&gt;
  &lt;str name="parsedquery"&gt;content:r?x&lt;/str&gt;
  &lt;str name="parsedquery_toString"&gt;content:r?x&lt;/str&gt;
  ...
&lt;/lst&gt;</pre>
<p>We have used question mark wildcard (&#8216;?&#8217;) on position number 2 and additionally we have only one trailing character after the wildcard. The original tokens were used:
</p>
<pre class="brush:xml">&lt;str name="parsedquery"&gt;content:r?x&lt;&lt;/str&gt;</pre>
</li>
</ol>
<h3>The end</h3>
<p>Thanks to the solr.ReversedWildcardFilter filter, we have successfully optimized wildcard queries. “Car sale application” users can now effectively use them <img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f642.png" alt="🙂" class="wp-smiley" style="height: 1em; max-height: 1em;" /></p>
]]></content:encoded>
					
					<wfw:commentRss>https://solr.pl/en/2011/10/10/car-sale-application-solr-reversedwildcardfilter-lets-optimize-wildcard-queries-part-8/feed/</wfw:commentRss>
			<slash:comments>0</slash:comments>
		
		
			</item>
		<item>
		<title>“Car sale application”– Result Grouping, two additional parameters description (part 7)</title>
		<link>https://solr.pl/en/2011/08/01/car-sale-application-result-grouping-two-additional-parameters-description-part-7/</link>
					<comments>https://solr.pl/en/2011/08/01/car-sale-application-result-grouping-two-additional-parameters-description-part-7/#respond</comments>
		
		<dc:creator><![CDATA[Rafał Andrzejewski]]></dc:creator>
		<pubDate>Mon, 01 Aug 2011 19:46:10 +0000</pubDate>
				<category><![CDATA[Solr]]></category>
		<guid isPermaLink="false">http://sematext.solr.pl/?p=361</guid>

					<description><![CDATA[In the last “car sale application” related post we have described the result grouping functionality. Today I would like to show you how easily we can determine the groups amount and how to sort documents within every group. Requirements specification]]></description>
										<content:encoded><![CDATA[<p>In the last “car sale application” related post we have described the result grouping functionality. Today I would like to show you how easily we can determine the groups amount and how to sort documents within every group.</p>
<p><span id="more-361"></span></p>
<h3>Requirements specification</h3>
<p>I would like to be able to create a grouping query, that will show me the number of generated groups and provide only one result within every group – the one with the lowest price in its year group.</p>
<h3>New functionality request parameters description</h3>
<p>What we need is:</p>
<ul>
<li><em>group.ngroups</em> &#8211; boolean type parameter that allows us to include the number of generated groups</li>
<li><em>group.sort</em> &#8211; parameter describing how to sort documents within a group</li>
</ul>
<h3>Let&#8217;s create query</h3>
<p>Using the query example from the previous post, let&#8217;s add two new parameters:
</p>
<pre class="brush:xml">?q=audi+a4&amp;group=true&amp;group.field=year_group&amp;group.limit=1&amp;fl=id,mileage,make,model,year,price&amp;group.ngroups=true&amp;group.sort=price+asc</pre>
<p>As you see, we&#8217;ve also set the group.limit parameter to 1 (in order to have the only one result within every group) and extended the value of fl parameter by adding the price field. As a result we have the response:
</p>
<pre class="brush:xml">&lt;lst name="grouped"&gt;
  &lt;lst name="year_group"&gt;
    &lt;int name="matches"&gt;5&lt;/int&gt;
    &lt;int name="ngroups"&gt;3&lt;/int&gt;
    &lt;arr name="groups"&gt;
      &lt;lst&gt;
        &lt;str name="groupValue"&gt;2002&lt;/str&gt;
        &lt;result name="doclist" numFound="2" start="0"&gt;
          &lt;doc&gt;
            &lt;str name="id"&gt;3&lt;/str&gt;
            &lt;str name="make"&gt;Audi&lt;/str&gt;
            &lt;int name="mileage"&gt;125000&lt;/int&gt;
            &lt;str name="model"&gt;A4&lt;/str&gt;
            &lt;float name="price"&gt;21300.0&lt;/float&gt;
            &lt;int name="year"&gt;2002&lt;/int&gt;
          &lt;/doc&gt;
        &lt;/result&gt;
      &lt;/lst&gt;
      &lt;lst&gt;
        &lt;str name="groupValue"&gt;2003&lt;/str&gt;
        &lt;result name="doclist" numFound="2" start="0"&gt;
          &lt;doc&gt;
            &lt;str name="id"&gt;2&lt;/str&gt;
            &lt;str name="make"&gt;Audi&lt;/str&gt;
            &lt;int name="mileage"&gt;220000&lt;/int&gt;
            &lt;str name="model"&gt;A4&lt;/str&gt;
            &lt;float name="price"&gt;27800.0&lt;/float&gt;
            &lt;int name="year"&gt;2003&lt;/int&gt;
          &lt;/doc&gt;
        &lt;/result&gt;
      &lt;/lst&gt;
      &lt;lst&gt;
        &lt;str name="groupValue"&gt;2006&lt;/str&gt;
        &lt;result name="doclist" numFound="1" start="0"&gt;
          &lt;doc&gt;
            &lt;str name="id"&gt;5&lt;/str&gt;
            &lt;str name="make"&gt;Audi&lt;/str&gt;
            &lt;int name="mileage"&gt;9900&lt;/int&gt;
            &lt;str name="model"&gt;A4&lt;/str&gt;
            &lt;float name="price"&gt;32100.0&lt;/float&gt;
            &lt;int name="year"&gt;2006&lt;/int&gt;
          &lt;/doc&gt;
        &lt;/result&gt;
      &lt;/lst&gt;
    &lt;/arr&gt;
  &lt;/lst&gt;
&lt;/lst&gt;</pre>
<p>As we see in the response, we have a new element that shows us how many groups are generated:
</p>
<pre class="brush:xml">&lt;int name="ngroups"&gt;3&lt;/int&gt;</pre>
<p>We also have only one document in each year group – the car with the lowest price in its year group. You don&#8217;t believe me ? Look at the query responses from the previous post and compare the prices <img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f642.png" alt="🙂" class="wp-smiley" style="height: 1em; max-height: 1em;" /></p>
<h3>The end</h3>
<p>It was a fast review of yet two another grouping parameters. Big thanks to David Martin who gave me the subject by asking some questions in the previous post <img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f642.png" alt="🙂" class="wp-smiley" style="height: 1em; max-height: 1em;" /></p>
]]></content:encoded>
					
					<wfw:commentRss>https://solr.pl/en/2011/08/01/car-sale-application-result-grouping-two-additional-parameters-description-part-7/feed/</wfw:commentRss>
			<slash:comments>0</slash:comments>
		
		
			</item>
		<item>
		<title>“Car sale application” – SpellCheckComponent – did you really mean that ? (part 5)</title>
		<link>https://solr.pl/en/2011/05/23/car-sale-application-spellcheckcomponent-did-you-really-mean-that-part-5/</link>
					<comments>https://solr.pl/en/2011/05/23/car-sale-application-spellcheckcomponent-did-you-really-mean-that-part-5/#respond</comments>
		
		<dc:creator><![CDATA[Rafał Andrzejewski]]></dc:creator>
		<pubDate>Mon, 23 May 2011 18:46:24 +0000</pubDate>
				<category><![CDATA[Solr]]></category>
		<guid isPermaLink="false">http://sematext.solr.pl/?p=268</guid>

					<description><![CDATA[The time has come to add another important functionality to our car sale application. It will be the spell checking mechanism with the ability to construct a new query from the suggestions. It has become the main functionality of every]]></description>
										<content:encoded><![CDATA[<p>The time has come to add another important functionality to our car sale application. It will be the spell checking mechanism with the ability to construct a new query from the suggestions. It has become the main functionality of every search engine so we will also make use of it.</p>
<p><span id="more-268"></span></p>
<h2>Requirements specification</h2>
<p>Our car database is so large that it contains many different names of makes and models. Some of that names could be really hard to spell/write:</p>
<ol>
<li>
<ul>
<li><em>make</em>: Bugatti</li>
<li><em>model</em>: Veyron</li>
</ul>
</li>
<li>
<ul>
<li><em>make</em>: Daewoo</li>
<li><em>model</em>: Lacetti</li>
</ul>
</li>
<li>
<ul>
<li><em>make</em>: Cadillac</li>
<li><em>model</em>: Brougham</li>
</ul>
</li>
<li>
<ul>
<li><em>make</em>: Ford</li>
<li><em>model</em>: Capri</li>
</ul>
</li>
<li>
<ul>
<li><em>make</em>: Maserati</li>
<li><em>model</em>: Coupe</li>
</ul>
</li>
</ol>
<p>The query examples, where misspelled words caused the query to provide no search results:</p>
<ol>
<li>?q=bugati+weyron</li>
<li>?q=daewo+laceti</li>
<li>?q=cadilac+brogham</li>
<li>?q=ford+kapri</li>
<li>?q=maseratti+coupe</li>
</ol>
<p>We would like to add the functionality, that in case of entering incorrect names will be able to suggest the phrase which probably was the intention of an application user. Then we will be able to make use of it to find the documents related to the proper phrase.</p>
<h2>solrconfig.xml changes</h2>
<p>The most important element, which should be added to the solrconfig.xml configuration file is the <em>solr.SpellCheckComponent</em>. Let&#8217;s try to add the simple standard configuration of this component and find out how it works:
</p>
<pre class="brush:xml">&lt;searchComponent name="spellcheck" class="solr.SpellCheckComponent"&gt;
    &lt;lst name="spellchecker"&gt;
      &lt;str name="classname"&gt;solr.IndexBasedSpellChecker&lt;/str&gt;
      &lt;str name="spellcheckIndexDir"&gt;./spellchecker&lt;/str&gt;
      &lt;str name="field"&gt;content&lt;/str&gt;
      &lt;str name="buildOnCommit"&gt;true&lt;/str&gt;
    &lt;/lst&gt;
&lt;/searchComponent&gt;</pre>
<p>Let&#8217;s explain the attributes used in this component:</p>
<ol>
<li>
<ul>
<li><em>classname</em> – the name of the class which is the implementation of our spellcheck mechanism. We are using the solr.IndexBasedSpellChecker class,  which use a spelling dictionary that is based on the Solr/Lucene index.</li>
</ul>
<ul>
<li><em>spellcheckIndexDir</em> – the directory name which holds the spellcheck index.</li>
</ul>
<ul>
<li><em>field</em> – the name of the field defined in the schema.xml file, used as the  source field to generate the spellcheck index. In our case it will be the “content” field (why? &#8211; it will be explained later).</li>
</ul>
<ul>
<li><em>buildOnCommit</em> – if the value of this attribute is set to <em>true</em>, then the spellcheck index will be automatically build after the main solr index commit.</li>
</ul>
</li>
</ol>
<p>Now when we have our component defined, let&#8217;s add it to some handler to be able to make use of it. The best option is to add it to our standard, default handler, which would provide query results with the suggestions when hitting only one request. Before the changes, our default handler looked like this:
</p>
<pre class="brush:xml">&lt;requestHandler name="standard" class="solr.SearchHandler" default="true"&gt;
     &lt;lst name="defaults"&gt;
       &lt;str name="echoParams"&gt;explicit&lt;/str&gt;
     &lt;/lst&gt;
&lt;/requestHandler&gt;</pre>
<p lang="pl-PL">Po zmianie, wygląda tak:</p>
<pre class="brush:xml">&lt;requestHandler name="standard" class="solr.SearchHandler" default="true"&gt;
     &lt;lst name="defaults"&gt;
       &lt;str name="echoParams"&gt;explicit&lt;/str&gt;
       &lt;str name="spellcheck"&gt;true&lt;/str&gt;
       &lt;str name="spellcheck.collate"&gt;true&lt;/str&gt;
     &lt;/lst&gt;
     &lt;arr name="last-components"&gt;
       &lt;str&gt;spellcheck&lt;/str&gt;
     &lt;/arr&gt;
&lt;/requestHandler&gt;</pre>
<p>As we can see, we have added the spellcheck component and yet two another default values:</p>
<ol>
<li>
<ul>
<li><em>spellcheck</em> – when set to <em>true</em> causes every request should also generate a spellcheck suggestion.</li>
</ul>
<ul>
<li><em>spellcheck.collate</em> &#8211; when set to <em>true</em> causes the mechanism to choose the best suggestion for every word entered and to construct a new query containing proper words. If the spellchecker recognises a word to be correct, it leaves it unchanged.</li>
</ul>
</li>
</ol>
<h2>schema.xml changes</h2>
<p>The possible changes in the schema.xml configuration file would be to add the field which would be used by the <em>solr.SpellCheckComponent</em> component as the source of tokens used for spell checking. The field should contain the data which we would like to be used when creating the spellcheck index. The type of that field should ensure the proper data tokenization. It should be also out of any stemming/lametization filters that could affect the spellcheck results badly.</p>
<p>Our schema already contains the field which fulfil all those requirements &#8211; “content” field. Just to remind, it is the default search field used by our search engine. The current field and type definitions look like this:
</p>
<pre class="brush:xml">&lt;field name="content" type="text" indexed="true" stored="false" multiValued="true"/&gt;</pre>
<pre class="brush:xml">&lt;fieldType name="text" positionIncrementGap="100"&gt;
 &lt;analyzer&gt;
  &lt;tokenizer class="solr.WhitespaceTokenizerFactory"/&gt;
   &lt;filter class="solr.PatternReplaceFilterFactory" pattern="'" replacement="" replace="all" /&gt;
   &lt;filter class="solr.WordDelimiterFilterFactory"
    generateWordParts="1"
    generateNumberParts="1"
    catenateWords="1"
    stemEnglishPossessive="0"
  /&gt;
  &lt;filter class="solr.LowerCaseFilterFactory"/&gt;
 &lt;/analyzer&gt;
&lt;/fieldType&gt;</pre>
<p>There are values of three fields copied to the &#8220;content&#8221; field: make, model and year:
</p>
<pre class="brush:xml">&lt;copyField source="make" dest="content"/&gt;
&lt;copyField source="model" dest="content"/&gt;
&lt;copyField source="year" dest="content"/&gt;</pre>
<h2>Let’s create queries</h2>
<p>Let&#8217;s take the no results queries from the requirements specification and add the spellcheck.q parameter which value will be the same as entered in the q parameter. Now, hitting only one query, we are able to get the search results wit the spellcheck suggestions:</p>
<ol>
<li>?q=bugati+weyron&amp;spellcheck.q=bugati+weyron
<ul>
<pre class="brush:xml">&lt;result name="response" numFound="0" start="0" /&gt;
&lt;lst name="spellcheck"&gt;
  &lt;lst name="suggestions"&gt;
    &lt;lst name="bugati"&gt;
      &lt;int name="numFound"&gt;1&lt;/int&gt;
      &lt;int name="startOffset"&gt;0&lt;/int&gt;
      &lt;int name="endOffset"&gt;6&lt;/int&gt;
      &lt;arr name="suggestion"&gt;
        &lt;str&gt;bugatti&lt;/str&gt;
      &lt;/arr&gt;
    &lt;/lst&gt;
    &lt;lst name="weyron"&gt;
      &lt;int name="numFound"&gt;1&lt;/int&gt;
      &lt;int name="startOffset"&gt;7&lt;/int&gt;
      &lt;int name="endOffset"&gt;13&lt;/int&gt;
      &lt;arr name="suggestion"&gt;
        &lt;str&gt;veyron&lt;/str&gt;
      &lt;/arr&gt;
    &lt;/lst&gt;
      &lt;str name="collation"&gt;bugatti veyron&lt;/str&gt;
    &lt;/lst&gt;
&lt;/lst&gt;</pre>
<p>The spellcheck mechanism has corrected the query tokens and the collation functionality has generated the proper phrase query, which now can be simply used in order to provide us the proper search results. Let&#8217;s check the rest of the queries:</p>
</ul>
</li>
<li>?q=daewo+laceti&amp;spellcheck.q=?q=daewo+laceti
<ul>
<pre class="brush:xml">&lt;result name="response" numFound="0" start="0" /&gt;
&lt;lst name="spellcheck"&gt;
  &lt;lst name="suggestions"&gt;
    &lt;lst name="daewo"&gt;
      &lt;int name="numFound"&gt;1&lt;/int&gt;
      &lt;int name="startOffset"&gt;0&lt;/int&gt;
      &lt;int name="endOffset"&gt;5&lt;/int&gt;
      &lt;arr name="suggestion"&gt;
        &lt;str&gt;daewoo&lt;/str&gt;
      &lt;/arr&gt;
    &lt;/lst&gt;
    &lt;lst name="laceti"&gt;
      &lt;int name="numFound"&gt;1&lt;/int&gt;
      &lt;int name="startOffset"&gt;6&lt;/int&gt;
      &lt;int name="endOffset"&gt;12&lt;/int&gt;
      &lt;arr name="suggestion"&gt;
        &lt;str&gt;lacetti&lt;/str&gt;
      &lt;/arr&gt;
    &lt;/lst&gt;
      &lt;str name="collation"&gt;daewoo lacetti&lt;/str&gt;
    &lt;/lst&gt;
&lt;/lst&gt;</pre>
</ul>
</li>
<li>?q=cadilac+brogham&amp;spellcheck.q=cadilac+brogham
<ul>
<pre class="brush:xml">&lt;result name="response" numFound="0" start="0" /&gt;
&lt;lst name="spellcheck"&gt;
  &lt;lst name="suggestions"&gt;
    &lt;lst name="cadilac"&gt;
      &lt;int name="numFound"&gt;1&lt;/int&gt;
      &lt;int name="startOffset"&gt;0&lt;/int&gt;
      &lt;int name="endOffset"&gt;7&lt;/int&gt;
      &lt;arr name="suggestion"&gt;
        &lt;str&gt;cadillac&lt;/str&gt;
      &lt;/arr&gt;
    &lt;/lst&gt;
    &lt;lst name="brogham"&gt;
      &lt;int name="numFound"&gt;1&lt;/int&gt;
      &lt;int name="startOffset"&gt;8&lt;/int&gt;
      &lt;int name="endOffset"&gt;15&lt;/int&gt;
      &lt;arr name="suggestion"&gt;
        &lt;str&gt;brougham&lt;/str&gt;
      &lt;/arr&gt;
    &lt;/lst&gt;
      &lt;str name="collation"&gt;cadillac brougham&lt;/str&gt;
    &lt;/lst&gt;
&lt;/lst&gt;</pre>
</ul>
</li>
<li>?q=ford+kapri&amp; spellcheck.q=?q=ford+kapri
<ul>
<pre class="brush:xml">&lt;result name="response" numFound="0" start="0" /&gt;
&lt;lst name="spellcheck"&gt;
  &lt;lst name="suggestions"&gt;
    &lt;lst name="kapri"&gt;
      &lt;int name="numFound"&gt;1&lt;/int&gt;
      &lt;int name="startOffset"&gt;5&lt;/int&gt;
      &lt;int name="endOffset"&gt;10&lt;/int&gt;
      &lt;arr name="suggestion"&gt;
        &lt;str&gt;capri&lt;/str&gt;
      &lt;/arr&gt;
    &lt;/lst&gt;
      &lt;str name="collation"&gt;ford capri&lt;/str&gt;
    &lt;/lst&gt;
&lt;/lst&gt;</pre>
</ul>
</li>
<li>?q=maseratti+coupe&amp;spellcheck.q=?q=maseratti+coupe
<ul>
<pre class="brush:xml">&lt;result name="response" numFound="0" start="0" /&gt;
&lt;lst name="spellcheck"&gt;
  &lt;lst name="suggestions"&gt;
    &lt;lst name="maseratti"&gt;
      &lt;int name="numFound"&gt;1&lt;/int&gt;
      &lt;int name="startOffset"&gt;0&lt;/int&gt;
      &lt;int name="endOffset"&gt;9&lt;/int&gt;
      &lt;arr name="suggestion"&gt;
        &lt;str&gt;maserati&lt;/str&gt;
      &lt;/arr&gt;
    &lt;/lst&gt;
      &lt;str name="collation"&gt;maserati coupe&lt;/str&gt;
    &lt;/lst&gt;
&lt;/lst&gt;</pre>
</ul>
</li>
</ol>
<p>The spellcheck mechanism has worked great, correcting all of the misspellings and generating the proper phrase queries. In the last two cases (4,5) we can see that the component has not corrected the properly entered words (4 – ford, 5 – coupe) but used them to construct the proper queries (collation).</p>
<h2>The end</h2>
<p>Our search engine has now yet another functionality. This time it was the spell checking mechanism. Now all we have to do is to wait for some comments … and maybe some improvements can be provided <img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f642.png" alt="🙂" class="wp-smiley" style="height: 1em; max-height: 1em;" /></p>
]]></content:encoded>
					
					<wfw:commentRss>https://solr.pl/en/2011/05/23/car-sale-application-spellcheckcomponent-did-you-really-mean-that-part-5/feed/</wfw:commentRss>
			<slash:comments>0</slash:comments>
		
		
			</item>
		<item>
		<title>&#8220;Car sale application&#8221; – Unicode Collation, sorting text in a language-sensitive way (part 4)</title>
		<link>https://solr.pl/en/2011/04/11/car-sale-application-unicode-collation-sorting-text-in-a-language-sensitive-way-part-4/</link>
					<comments>https://solr.pl/en/2011/04/11/car-sale-application-unicode-collation-sorting-text-in-a-language-sensitive-way-part-4/#respond</comments>
		
		<dc:creator><![CDATA[Rafał Andrzejewski]]></dc:creator>
		<pubDate>Mon, 11 Apr 2011 18:38:34 +0000</pubDate>
				<category><![CDATA[Solr]]></category>
		<guid isPermaLink="false">http://sematext.solr.pl/?p=258</guid>

					<description><![CDATA[In the third part of our ”Car sale” application related posts we added some location data and the information about the city that is related to every car. Shortly afterwards we added the possibility to sort using the city field]]></description>
										<content:encoded><![CDATA[<p>In the <a href="http://solr.pl/en/2011/03/14/car-sale-application-–-spatial-search-adding-location-data-part-3/" target="_blank" rel="noopener noreferrer">third part</a> of our ”Car sale” application related posts we added some location data and the information about the city that is related to every car. Shortly afterwards we added the possibility to sort using the city field by simply modifying the schema:</p>
<p><span id="more-258"></span></p>
<pre class="brush:xml">&lt;field name="city_sort" type="lowercase" indexed="true" stored="false" /&gt;
...
&lt;copyField source="city" dest="city_sort"/&gt;</pre>
<p>It turned out, that sorting using the city_sort field did not work as we expected. All because of the polish signs appearing in the city names. What should we do with it ?</p>
<h2><!--more-->Requirements specification</h2>
<p>Let&#8217;s check if the „city_sort” field sorting does really not working well in conjunction with the polish signs. When we enter the query:
</p>
<pre class="brush:xml">q=*:*&amp;fl=city&amp;sort=city_sort+asc</pre>
<p>we have the result:
</p>
<pre class="brush:xml">&lt;result name="response" numFound="6" start="0"&gt;
   &lt;doc&gt;
      &lt;str name="city"&gt;Białystok&lt;/str&gt;
   &lt;/doc&gt;
   &lt;doc&gt;
      &lt;str name="city"&gt;Koszalin&lt;/str&gt;
   &lt;/doc&gt;
   &lt;doc&gt;
      &lt;str name="city"&gt;Szczecin&lt;/str&gt;
   &lt;/doc&gt;
   &lt;doc&gt;
      &lt;str name="city"&gt;Warszawa&lt;/str&gt;
   &lt;/doc&gt;
   &lt;doc&gt;
      &lt;str name="city"&gt;Świdnik&lt;/str&gt;
   &lt;/doc&gt;
   &lt;doc&gt;
      &lt;str name="city"&gt;Łowicz&lt;/str&gt;
   &lt;/doc&gt;
&lt;/result&gt;</pre>
<p>That&#8217;s really not what we expect. We would like to have:
</p>
<pre class="brush:xml">&lt;result name="response" numFound="6" start="0"&gt;
   &lt;doc&gt;
      &lt;str name="city"&gt;Białystok&lt;/str&gt;
   &lt;/doc&gt;
   &lt;doc&gt;
      &lt;str name="city"&gt;Koszalin&lt;/str&gt;
   &lt;/doc&gt;
   &lt;doc&gt;
      &lt;str name="city"&gt;Łowicz&lt;/str&gt;
   &lt;/doc&gt;
   &lt;doc&gt;
      &lt;str name="city"&gt;Szczecin&lt;/str&gt;
   &lt;/doc&gt;
   &lt;doc&gt;
      &lt;str name="city"&gt;Świdnik&lt;/str&gt;
   &lt;/doc&gt;
   &lt;doc&gt;
      &lt;str name="city"&gt;Warszawa&lt;/str&gt;
   &lt;/doc&gt;
&lt;/result&gt;</pre>
<p>To make the sorting functionality work well, we will use the „solr.CollationKeyFilter” filter.</p>
<h2>solr.CollationKeyFilter</h2>
<p>The filter called solr.CollationKeyFilter is used at index time, indexing special &#8220;sort keys&#8221; into the sort field. It allows us to choose the collator related to wanted country and language. We can also choose the strength of the collation which determines the minimum level of difference considered significant during comparison. For example:
</p>
<pre class="brush:xml">&lt;filter class="solr.CollationKeyFilterFactory" language="es" country=”ES” strength="primary" /&gt;</pre>
<p>The given example shows us the configuration of the solr.CollationKeyFilterFactory, where we want to handle the spanish language with the <a href="http://download.oracle.com/javase/1.5.0/docs/api/java/text/Collator.html#PRIMARY" target="_blank" rel="noopener noreferrer">primary</a> strength.</p>
<h2>Schema.xml changes</h2>
<ol>
<li>New field types definitions:
<ul>
<pre class="brush:xml">&lt;fieldType name="polishLowercase" positionIncrementGap="100"&gt;
  &lt;analyzer&gt;
    &lt;tokenizer class="solr.KeywordTokenizerFactory"/&gt;
    &lt;filter class="solr.LowerCaseFilterFactory" /&gt;
    &lt;filter class="solr.TrimFilterFactory" /&gt;
    &lt;filter class="solr.CollationKeyFilterFactory"  language="pl" country=”PL” strength="primary" /&gt;
  &lt;/analyzer&gt;
&lt;/fieldType&gt;</pre>
<p>As we may notice, it&#8217;s the definition of the currently existing „lowercase” type, where we added the solr.CollationKeyFilter, handling the polish language. The type will be used for the fields, where the data contains polish signs.</p>
</ul>
</li>
<li>New „city_sort” field definition:
<ul>
<li>let&#8217;s change the type for the „city_sort” field to „polishLowercase”:</li>
<pre class="brush:xml">&lt;field name="city_sort" type="polishLowercase" indexed="true" stored="false" /&gt;</pre>
</ul>
</li>
</ol>
<h2>Functional tests</h2>
<p>Before we check if the given field type change is just what we need, we must remember that the solr.CollationKeyFilter is used at index time, so we need to re-index all of the data.</p>
<p>Now let&#8217;s check our test query result:
</p>
<pre class="brush:xml">q=*:*&amp;fl=city&amp;sort=city_sort+asc</pre>
<p>It appears that the result is correct:
</p>
<pre class="brush:xml">&lt;result name="response" numFound="6" start="0"&gt;
   &lt;doc&gt;
      &lt;str name="city"&gt;Białystok&lt;/str&gt;
   &lt;/doc&gt;
   &lt;doc&gt;
      &lt;str name="city"&gt;Koszalin&lt;/str&gt;
   &lt;/doc&gt;
   &lt;doc&gt;
      &lt;str name="city"&gt;Łowicz&lt;/str&gt;
   &lt;/doc&gt;
   &lt;doc&gt;
      &lt;str name="city"&gt;Szczecin&lt;/str&gt;
   &lt;/doc&gt;
   &lt;doc&gt;
      &lt;str name="city"&gt;Świdnik&lt;/str&gt;
   &lt;/doc&gt;
   &lt;doc&gt;
      &lt;str name="city"&gt;Warszawa&lt;/str&gt;
   &lt;/doc&gt;
&lt;/result&gt;</pre>
<h2>The end</h2>
<p>Yet another reported problem has been solved successfully. We have improved the quality of the sorting mechanism, where we must handle the polish signs, by adding the solr.CollationKeyFilter which entirely fulfilled our needs. Now we can only wait for another notifications and improvements <img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f642.png" alt="🙂" class="wp-smiley" style="height: 1em; max-height: 1em;" /></p>
]]></content:encoded>
					
					<wfw:commentRss>https://solr.pl/en/2011/04/11/car-sale-application-unicode-collation-sorting-text-in-a-language-sensitive-way-part-4/feed/</wfw:commentRss>
			<slash:comments>0</slash:comments>
		
		
			</item>
	</channel>
</rss>
