<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>howto &#8211; Solr.pl</title>
	<atom:link href="https://solr.pl/en/tag/howto-2/feed/" rel="self" type="application/rss+xml" />
	<link>https://solr.pl/en/</link>
	<description>All things to be found - Blog related to Apache Solr &#38; Lucene projects - https://solr.apache.org</description>
	<lastBuildDate>Wed, 11 Nov 2020 08:19:47 +0000</lastBuildDate>
	<language>en-US</language>
	<sy:updatePeriod>
	hourly	</sy:updatePeriod>
	<sy:updateFrequency>
	1	</sy:updateFrequency>
	<generator>https://wordpress.org/?v=6.9</generator>
	<item>
		<title>&#8220;Car sale application&#8221; – Spatial Search, adding location data (part 3)</title>
		<link>https://solr.pl/en/2011/03/14/car-sale-application-spatial-search-adding-location-data-part-3/</link>
					<comments>https://solr.pl/en/2011/03/14/car-sale-application-spatial-search-adding-location-data-part-3/#respond</comments>
		
		<dc:creator><![CDATA[Rafał Kuć]]></dc:creator>
		<pubDate>Mon, 14 Mar 2011 08:19:01 +0000</pubDate>
				<category><![CDATA[Solr]]></category>
		<category><![CDATA[howto]]></category>
		<category><![CDATA[schema]]></category>
		<guid isPermaLink="false">http://sematext.solr.pl/?p=221</guid>

					<description><![CDATA[The amount of announcements in our database is so large, that our web site users started to look for another option to filter search results and another way of sorting them. We need to add the functionality, which allows us]]></description>
										<content:encoded><![CDATA[<p>The amount of announcements in our database is so large, that our web site users   started to look for another option to filter search results and another way of sorting them. We need to add the functionality, which allows us to operate with localization data related to the cars.</p>
<p><span id="more-221"></span></p>
<h2>Requirements specification</h2>
<p lang="pl-PL">We would like to add two new functionalities:</p>
<ol>
<li>Filtering the results in order to display only those announcements, that are located not farther than x kilometres from the given place, where x = 50,100,200,500,1000 km.</li>
<li>Sorting the results using the distance between the given place and the given car&#8217;s localization.</li>
</ol>
<p>In order to face the requirements, we need to use solr&#8217;s functionality called “Spatial Search”, that is available in solr distribution from version 3.1. The changes we need to provide are related to schema.xml file modifications and the input data changes, where we have to add the information about the localization of every car. In the end we will create proper requests.</p>
<h2>Schema.xml changes</h2>
<ol>
<li>New field types definitions:
<ul>
<li>the first definition is nothing more than another numerical type:</li>
<pre class="brush:xml">&lt;fieldType name="tdouble" precisionStep="8" omitNorms="true" positionIncrementGap="0"/&gt;</pre>
<li>the second definition uses the &#8220;solr.LatLonType&#8221; class, which allows us to index localization data using the dynamic field with suffix &#8220;_coordinate&#8221;:</li>
<pre class="brush:xml">&lt;fieldType name="location" subFieldSuffix="_coordinate"/&gt;</pre>
</ul>
</li>
<li>New fields definitions:
<ul>
<li>field, that will be used to accumulate the city name data, that is related to every car:</li>
<pre class="brush:xml">&lt;field name="city" type="string" indexed="true" stored="true" /&gt;</pre>
<li>&#8220;loc&#8221; field will be used to index localization data:</li>
<pre class="brush:xml">&lt;field name="loc" type="location" indexed="true" stored="false"/&gt;</pre>
<li>the dynamic field used internally to accumulate the information provided by the &#8220;loc&#8221; field:</li>
<pre class="brush:xml">&lt;dynamicField name="*_coordinate" type="tdouble" indexed="true" stored="false"/&gt;</pre>
</ul>
</li>
</ol>
<h2>Input data analysis</h2>
<p>In order to present how to modify the input data, let&#8217;s take 5 announcements from the cities:</p>
<ol>
<li>Koszalin
<ul>
<li><em>latitude</em>: 54.12</li>
<li><em>longitude</em>: 16.11</li>
</ul>
</li>
<li>Białystok
<ul>
<li><em>latitude</em>: 53.08</li>
<li><em>longitude</em>: 23.09</li>
</ul>
</li>
<li>Szczecin
<ul>
<li><em>latitude</em>: 53.25</li>
<li><em>longitude</em>: 14.35</li>
</ul>
</li>
<li>Gdańsk
<ul>
<li><em>latitude</em>: 54.21</li>
<li><em>longitude</em>: 18.40</li>
</ul>
</li>
<li>Warszawa
<ul>
<li><em>latitude</em>: 52.15</li>
<li><em>longitude</em>: 21.00</li>
</ul>
</li>
</ol>
<p>We provide the localization data by entering the latitude and longitude separated by the comma in the &#8220;loc&#8221; field. Our data might look like this:
</p>
<pre class="brush:xml">&lt;add&gt;
   &lt;doc&gt;
      &lt;field name="id"&gt;1&lt;/field&gt;
      &lt;field name="make"&gt;Audi&lt;/field&gt;
      &lt;field name="model"&gt;80&lt;/field&gt;
      &lt;field name="year"&gt;2008&lt;/field&gt;
      &lt;field name="price"&gt;9774&lt;/field&gt;
      &lt;field name="engine_size"&gt;2000&lt;/field&gt;
      &lt;field name="mileage"&gt;92467&lt;/field&gt;
      &lt;field name="colour"&gt;green&lt;/field&gt;
      &lt;field name="damaged"&gt;false&lt;/field&gt;
      &lt;field name="city"&gt;Koszalin&lt;/field&gt;
      &lt;field name="loc"&gt;54.12,16.11&lt;/field&gt;
   &lt;/doc&gt;
   &lt;doc&gt;
      &lt;field name="id"&gt;2&lt;/field&gt;
      &lt;field name="make"&gt;Audi&lt;/field&gt;
      &lt;field name="model"&gt;A8&lt;/field&gt;
      &lt;field name="year"&gt;2009&lt;/field&gt;
      &lt;field name="price"&gt;9078&lt;/field&gt;
      &lt;field name="engine_size"&gt;1000&lt;/field&gt;
      &lt;field name="mileage"&gt;31369&lt;/field&gt;
      &lt;field name="colour"&gt;black&lt;/field&gt;
      &lt;field name="damaged"&gt;false&lt;/field&gt;
      &lt;field name="city"&gt;Białystok&lt;/field&gt;
      &lt;field name="loc"&gt;53.08,23.09&lt;/field&gt;
   &lt;/doc&gt;
   &lt;doc&gt;
      &lt;field name="id"&gt;3&lt;/field&gt;
      &lt;field name="make"&gt;Audi&lt;/field&gt;
      &lt;field name="model"&gt;TT&lt;/field&gt;
      &lt;field name="year"&gt;1997&lt;/field&gt;
      &lt;field name="price"&gt;1109&lt;/field&gt;
      &lt;field name="engine_size"&gt;1299&lt;/field&gt;
      &lt;field name="mileage"&gt;116987&lt;/field&gt;
      &lt;field name="colour"&gt;silver&lt;/field&gt;
      &lt;field name="damaged"&gt;true&lt;/field&gt;
      &lt;field name="city"&gt;Szczecin&lt;/field&gt;
      &lt;field name="loc"&gt;53.25,14.35&lt;/field&gt;
   &lt;/doc&gt;
   &lt;doc&gt;
      &lt;field name="id"&gt;4&lt;/field&gt;
      &lt;field name="make"&gt;BMW&lt;/field&gt;
      &lt;field name="model"&gt;Seria 7&lt;/field&gt;
      &lt;field name="year"&gt;2007&lt;/field&gt;
      &lt;field name="price"&gt;140000&lt;/field&gt;
      &lt;field name="engine_size"&gt;3000&lt;/field&gt;
      &lt;field name="mileage"&gt;418000&lt;/field&gt;
      &lt;field name="colour"&gt;green&lt;/field&gt;
      &lt;field name="damaged"&gt;false&lt;/field&gt;
      &lt;field name="city"&gt;Gdańsk&lt;/field&gt;
      &lt;field name="loc"&gt;54.21,18.40&lt;/field&gt;
   &lt;/doc&gt;
   &lt;doc&gt;
      &lt;field name="id"&gt;5&lt;/field&gt;
      &lt;field name="make"&gt;Chevrolet&lt;/field&gt;
      &lt;field name="model"&gt;TrailBlazer&lt;/field&gt;
      &lt;field name="year"&gt;2007&lt;/field&gt;
      &lt;field name="price"&gt;140000&lt;/field&gt;
      &lt;field name="engine_size"&gt;3000&lt;/field&gt;
      &lt;field name="mileage"&gt;418000&lt;/field&gt;
      &lt;field name="colour"&gt;green&lt;/field&gt;
      &lt;field name="damaged"&gt;false&lt;/field&gt;
      &lt;field name="city"&gt;Warszawa&lt;/field&gt;
      &lt;field name="loc"&gt;52.15,21.00&lt;/field&gt;
   &lt;/doc&gt;
&lt;/add&gt;</pre>
<h2>Let&#8217;s create queries</h2>
<p>We have our localization data in the index, so all we need right now is to create queries that will satisfy our needs. Let&#8217;s imagine, that we are searching for announcements when being in Białystok city, which is located about 200 km away from the Warszawa city, about  400 km away from the Gdańsk city, about  550 km away from the Koszalin city and about 650 km away from the Szczecin city.</p>
<p>To execute the first point from the requirements specification, we add the special filter query to our request:
</p>
<pre class="brush:xml">...&amp;fq={!geofilt sfield=loc}&amp;pt=53.08,23.09&amp;d=50</pre>
<p>where:</p>
<ul>
<li><em>sfield</em> &#8211; the name of the field, where we have our localization data indexed.</li>
<li><em>pt</em> &#8211; the localization of the starting point, it is the Białystok city in our case.</li>
<li><em>d</em> &#8211; the distance used to narrow the search results. By using the 50,100,200,500,1000 values we can satisfy all our needs.</li>
</ul>
<p>Example:</p>
<ol>
<li>Query:
<pre class="brush:xml">q=*:*&amp;fq={!geofilt sfield=loc}&amp;pt=53.08,23.09&amp;d=200</pre>
</li>
<li>Search results:</li>
<pre class="brush:xml">&lt;result name="response" numFound="2" start="0"&gt;
   &lt;doc&gt;
      &lt;str name="city"&gt;Białystok&lt;/str&gt;
      &lt;str name="colour"&gt;black&lt;/str&gt;
      &lt;bool name="damaged"&gt;false&lt;/bool&gt;
      &lt;int name="engine_size"&gt;1000&lt;/int&gt;
      &lt;str name="id"&gt;2&lt;/str&gt;
      &lt;str name="make"&gt;Audi&lt;/str&gt;
      &lt;int name="mileage"&gt;31369&lt;/int&gt;
      &lt;str name="model"&gt;A8&lt;/str&gt;
      &lt;float name="price"&gt;9078.0&lt;/float&gt;
      &lt;int name="year"&gt;2009&lt;/int&gt;
   &lt;/doc&gt;
   &lt;doc&gt;
      &lt;str name="city"&gt;Warszawa&lt;/str&gt;
      &lt;str name="colour"&gt;green&lt;/str&gt;
      &lt;bool name="damaged"&gt;false&lt;/bool&gt;
      &lt;int name="engine_size"&gt;3000&lt;/int&gt;
      &lt;str name="id"&gt;5&lt;/str&gt;
      &lt;str name="make"&gt;Chevrolet &lt;/str&gt;
      &lt;int name="mileage"&gt;418000&lt;/int&gt;
      &lt;str name="model"&gt;TrailBlazer&lt;/str&gt;
      &lt;float name="price"&gt;140000.0&lt;/float&gt;
      &lt;int name="year"&gt;2007&lt;/int&gt;
   &lt;/doc&gt;
&lt;/result&gt;</pre>
</ol>
<p>That&#8217;s great, we don&#8217;t have any announcements from the Koszalin, Gdańsk or Szczecin city, as these cities are located farther than 200 km from the Białystok city.</p>
<p>To execute the second point from the requirements specification, we use the possibility to sort the search results by using the geodist function. The query would look like this:
</p>
<pre class="brush:xml">...&amp;sfield=loc&amp;pt=53.08,23.09&amp;sort=geodist()+desc</pre>
<p>The example of sorting the search results using the distance, starting from the Białystok city:</p>
<ol>
<li>Query:
<pre class="brush:xml">q=*:*&amp;sfield=loc&amp;pt=53.08,23.09&amp;sort=geodist()+asc</pre>
</li>
<li>Search results:</li>
<pre class="brush:xml">&lt;result name="response" numFound="5" start="0"&gt;
   &lt;doc&gt;
      &lt;str name="city"&gt;Bialystok&lt;/str&gt;
      &lt;str name="colour"&gt;black&lt;/str&gt;
      &lt;bool name="damaged"&gt;false&lt;/bool&gt;
      &lt;int name="engine_size"&gt;1000&lt;/int&gt;
      &lt;str name="id"&gt;2&lt;/str&gt;
      &lt;str name="make"&gt;Audi&lt;/str&gt;
      &lt;int name="mileage"&gt;31369&lt;/int&gt;
      &lt;str name="model"&gt;A8&lt;/str&gt;
      &lt;float name="price"&gt;9078.0&lt;/float&gt;
      &lt;int name="year"&gt;2009&lt;/int&gt;
   &lt;/doc&gt;
   &lt;doc&gt;
      &lt;str name="city"&gt;Warszawa&lt;/str&gt;
      &lt;str name="colour"&gt;green&lt;/str&gt;
      &lt;bool name="damaged"&gt;false&lt;/bool&gt;
      &lt;int name="engine_size"&gt;3000&lt;/int&gt;
      &lt;str name="id"&gt;5&lt;/str&gt;
      &lt;str name="make"&gt;Chevrolet &lt;/str&gt;
      &lt;int name="mileage"&gt;418000&lt;/int&gt;
      &lt;str name="model"&gt;TrailBlazer&lt;/str&gt;
      &lt;float name="price"&gt;140000.0&lt;/float&gt;
      &lt;int name="year"&gt;2007&lt;/int&gt;
   &lt;/doc&gt;
   &lt;doc&gt;
      &lt;str name="city"&gt;Gdańsk&lt;/str&gt;
      &lt;str name="colour"&gt;green&lt;/str&gt;
      &lt;bool name="damaged"&gt;false&lt;/bool&gt;
      &lt;int name="engine_size"&gt;3000&lt;/int&gt;
      &lt;str name="id"&gt;4&lt;/str&gt;
      &lt;str name="make"&gt;BMW&lt;/str&gt;
      &lt;int name="mileage"&gt;418000&lt;/int&gt;
      &lt;str name="model"&gt;Seria 7&lt;/str&gt;
      &lt;float name="price"&gt;140000.0&lt;/float&gt;
      &lt;int name="year"&gt;2007&lt;/int&gt;
   &lt;/doc&gt;
   &lt;doc&gt;
      &lt;str name="city"&gt;Koszalin&lt;/str&gt;
      &lt;str name="colour"&gt;green&lt;/str&gt;
      &lt;bool name="damaged"&gt;false&lt;/bool&gt;
      &lt;int name="engine_size"&gt;2000&lt;/int&gt;
      &lt;str name="id"&gt;1&lt;/str&gt;
      &lt;str name="make"&gt;Audi&lt;/str&gt;
      &lt;int name="mileage"&gt;92467&lt;/int&gt;
      &lt;str name="model"&gt;80&lt;/str&gt;
      &lt;float name="price"&gt;9774.0&lt;/float&gt;
      &lt;int name="year"&gt;2008&lt;/int&gt;
   &lt;/doc&gt;
   &lt;doc&gt;
      &lt;str name="city"&gt;Szczecin&lt;/str&gt;
      &lt;str name="colour"&gt;silver&lt;/str&gt;
      &lt;bool name="damaged"&gt;true&lt;/bool&gt;
      &lt;int name="engine_size"&gt;1299&lt;/int&gt;
      &lt;str name="id"&gt;3&lt;/str&gt;
      &lt;str name="make"&gt;Audi&lt;/str&gt;
      &lt;int name="mileage"&gt;116987&lt;/int&gt;
      &lt;str name="model"&gt;TT&lt;/str&gt;
      &lt;float name="price"&gt;1109.0&lt;/float&gt;
      &lt;int name="year"&gt;1997&lt;/int&gt;
   &lt;/doc&gt;
&lt;/result&gt;</pre>
</ol>
<p>That&#8217;s correct! Mission accomplished.</p>
<h2>The end</h2>
<p>Once more we are up to our website users expectations. This time we have added the functionalities, which allow our users to filter and sort the search results using the localization and distance data. Full success!</p>
]]></content:encoded>
					
					<wfw:commentRss>https://solr.pl/en/2011/03/14/car-sale-application-spatial-search-adding-location-data-part-3/feed/</wfw:commentRss>
			<slash:comments>0</slash:comments>
		
		
			</item>
		<item>
		<title>”Car sale” application – WordDelimiterFilter and PatternReplaceFilter, helping to improve search results (part 2)</title>
		<link>https://solr.pl/en/2011/02/14/car-sale-application-worddelimiterfilter-and-patternreplacefilter-helping-to-improve-search-results-part-2/</link>
					<comments>https://solr.pl/en/2011/02/14/car-sale-application-worddelimiterfilter-and-patternreplacefilter-helping-to-improve-search-results-part-2/#respond</comments>
		
		<dc:creator><![CDATA[Rafał Kuć]]></dc:creator>
		<pubDate>Mon, 14 Feb 2011 08:09:28 +0000</pubDate>
				<category><![CDATA[Solr]]></category>
		<category><![CDATA[howto]]></category>
		<category><![CDATA[schema.xml]]></category>
		<guid isPermaLink="false">http://sematext.solr.pl/?p=196</guid>

					<description><![CDATA[In the first part of our ”Car sale” application related posts we created some standard index structure by properly configuring schema.xml configuration file. It didn&#8217;t take long to hear the first complains from the website users with this kind of]]></description>
										<content:encoded><![CDATA[<p>In the <a href="http://solr.pl/en/2011/01/31/car-sale-application-schema-xml-designing-to-gain-what-we-really-need-part-1/" target="_blank" rel="noopener noreferrer">first part of our ”Car sale” application</a> related posts we created some standard index structure by properly configuring schema.xml configuration file. It didn&#8217;t take long to hear the first complains from the website users with this kind of configuration. Why don&#8217;t I receive any search results entering the &#8220;audi a&#8221; phrase ? I would like to see some announcements with &#8220;Audi A6&#8221; and &#8220;Audi A8&#8221; for example. I entered the phrase &#8220;Honda crv&#8221; – 0 results, &#8220;Suzuki maruti&#8221; – none. Are there no related offers in the announcement database ? There are! But the current configuration of the searchable field type (field &#8220;content&#8221; – type &#8220;text&#8221;) does not allow us to find those offers using the queries we&#8217;ve entered. That&#8217;s the reason why the WordDelimiterFilter and PatternReplaceFilter need to enter the battlefield.</p>
<p><span id="more-196"></span></p>
<h2>Requirements specification</h2>
<p>We need to analyze the data, that is indexed in the &#8220;content&#8221; field. Let&#8217;s examine the sample data, that will be used for helping to create the new &#8220;text&#8221; type configuration:</p>
<ul>
<li><em>Make</em>: Audi<br />
<em>Model</em>: 80, 90, A6, A8, TT</li>
</ul>
<ul>
<li><em>Make</em>: BMW<br />
<em>Model</em>: M3, M5, Series 7, Series 8, X1, X3</li>
</ul>
<ul>
<li><em>Make</em>: Chevrolet<br />
<em>Model</em>: TrailBlazer</li>
</ul>
<ul>
<li><em>Make</em>: Citroen<br />
<em>Model</em>: C-Crosser, C3 Pluriel, C4 Picasso</li>
</ul>
<ul>
<li><em>Make</em>: Ford<br />
<em>Model</em>: C-MAX, S-MAX</li>
</ul>
<ul>
<li><em>Make</em>: Honda<br />
<em>Model</em>: Accord, CR-V, FR-V, HR-V</li>
</ul>
<ul>
<li><em>Make</em>: Kia<br />
<em>Model</em>: Cee&#8217;d</li>
</ul>
<ul>
<li><em>Make</em>: Suzuki<br />
<em>Model</em>: Alto/Maruti</li>
</ul>
<p>Make names are simple words, that are easily handled by the current configuration (WhitespaceTokenizer + LowerCaseFilter). The problem is with the model names, as they contain additional characters and separators, that we often ignore when entering the search phrase. Let&#8217;s try to put the sample date into some groups, that will help us with the incoming configuration:</p>
<ol>
<li>Model names, that do not need to be processed by any additional filters (the current &#8220;text” type configuration is sufficient) &#8211; 80, 90, TT, Series 7, Series 8, Accord</li>
<li>Model names, which contain letters and numbers, where we want to split on letter-number transitions &#8211;  A6, A8, M3, M5, X1, X3, C3 Pluriel, C4 Picasso. We would like to be able to find those models when entering only a letter, only a number and whole model name too.</li>
<li>Models, which have the case transitions in the name – TrailBlazer. We would like to find the document with this name when entering &#8220;trail&#8221;, &#8220;blazer&#8221;, &#8220;trailBlazer&#8221;, &#8220;trailblazer&#8221;.</li>
<li>Model names, that contain intra-word delimiters, which we want to ignore or split on them &#8211; C-Crosser, C-MAX, S-MAX, CR-V, FR-V, HR-V, Alto/Maruti.<br />
Example: we would like to find the document with the model name &#8220;C-MAX&#8221; entering the phrases &#8220;c&#8221;, &#8220;max&#8221;, &#8220;c-max&#8221; &#8220;cmax&#8221;.</li>
<li>We intentionally omitted the &#8220;Cee&#8217;d&#8221; model name in the 4th point as we would like to treat this example a little differently. We don&#8217;t want to be able to find this model when entering the &#8220;cee&#8221; and &#8220;d&#8221; phrases. We treat the name only as the whole word &#8211; &#8220;cee&#8217;d&#8221; or &#8220;ceed&#8221;.</li>
</ol>
<h2>WordDelimiterFilter configuration</h2>
<p>With the given configuration we&#8217;ve described above, we are going to add proper values to the WordDelimiterFilter attributes in order to satisfy our needs:</p>
<ol>
<li>WordDelimiterFilter is needless in this case, as the current &#8220;text&#8221; type configuration (WhitespaceTokenizer + LowerCaseFilter) is sufficient.</li>
<li>In order to face the 2nd point requirements we need to set the proper values of the following attributes:
<ul>
<li> <em>generateWordParts=&#8221;1&#8243;</em> &#8211; the value must be set to &#8220;1&#8221; if we want to be able to generate parts of words</li>
<li><em>generateNumberParts=&#8221;1&#8243;</em> &#8211; the value must be set to &#8220;1&#8221; if we want to be able to generate parts of number words</li>
<li><em>splitOnNumerics=&#8221;1&#8243;</em> &#8211; the value must be set to &#8220;1&#8221; if we want to be able to generate a new parts from alphabet =&gt; number transitions</li>
</ul>
</li>
<li>In order to face the 3rd point requirements we need to set the proper values of the following attributes:
<ul>
<li> <em>generateWordParts=&#8221;1&#8243;</em></li>
<li><em>splitOnCaseChange=&#8221;1&#8243;</em> &#8211; the value must be set to &#8220;1&#8221; if we want to be able to split on lowercase =&gt; uppercase transitions</li>
</ul>
</li>
<li>In order to face the 4th point requirements we need to set the proper values of the following attributes:
<ul>
<li> <em>generateWordParts=&#8221;1&#8243;</em></li>
<li><em>catenateWords=&#8221;1&#8243; </em>&#8211;  the value must be set to &#8220;1&#8221; if we want to be able to ignore the intra-word delimiters by joining the subwords</li>
</ul>
</li>
</ol>
<p>So let&#8217;s take a look at our WordDelimiterFilter configuration:
</p>
<pre class="brush:xml">&lt;filter class="solr.WordDelimiterFilterFactory"
 splitOnNumerics="1"
 splitOnNumerics="1"
 generateWordParts="1"
 generateNumberParts="1"
 catenateWords="1"
/&gt;</pre>
<p>Additionaly we may notice that the default value of the &#8220;splitOnNumerics&#8221; and &#8220;splitOnNumerics&#8221; attributes is &#8220;1&#8221;. The rest of the WordDelimiterFilter&#8217;s attributes (except the &#8220;stemEnglishPossessive&#8221;) have the default value set to &#8220;0&#8221;. So our configuration can be reduced to:
</p>
<pre class="brush:xml">&lt;filter class="solr.WordDelimiterFilterFactory"
 generateWordParts="1"
 generateNumberParts="1"
 catenateWords="1"
 stemEnglishPossessive="0"
/&gt;</pre>
<p>What about the 5th point of our data specification ? As we have stated, we wouldn&#8217;t like to treat the &#8220;&#8216;&#8221; sign as the intra-word delimiter. So maybe we could use the protected=&#8221;protwords.txt&#8221; option of the WordDelimiterFilter which will keep the word &#8220;Cee&#8217;d&#8221; unchanged ? Ok, but we would also like to be able to find this model when entering the &#8220;ceed&#8221; phrase, so this option is not good for us. The best solution would be to take care of this case in the separate filter and leave the WordDelimiterFilter with nothing to do.</p>
<h2>PatternReplaceFilter configuration</h2>
<p>we are going to put the PatternReplaceFilter before the WordDelimiterFilter. Using the PatternReplaceFilter we will be able to ignore the &#8221; &#8216; &#8221; sign by replacing it with the empty sign. Configuring the filter this way, the WordDelimiterFilter will receive the &#8220;Ceed&#8221; token and will not modify this value. The configuration of the filters will be the same for indexing and searching, so a user will be able to find the offer with the &#8220;Cee&#8217;d&#8221; model when entering the phrases &#8220;cee&#8217;d&#8221; and &#8220;ceed&#8221;:
</p>
<pre>&lt;filter class="solr.PatternReplaceFilterFactory" pattern="'" replacement="" replace="all" /&gt;</pre>
<h2>New &#8220;text&#8221; type configuration visualization</h2>
<p>Let&#8217;s take a look at our new &#8220;text&#8221; type:
</p>
<pre class="brush:xml">&lt;fieldType name="text" positionIncrementGap="100"&gt;
 &lt;analyzer&gt;
  &lt;tokenizer class="solr.WhitespaceTokenizerFactory"/&gt;
   &lt;filter class="solr.PatternReplaceFilterFactory" pattern="'" replacement="" replace="all" /&gt;
   &lt;filter class="solr.WordDelimiterFilterFactory"
    generateWordParts="1"
    generateNumberParts="1"
    catenateWords="1"
    stemEnglishPossessive="0"
  /&gt;
  &lt;filter class="solr.LowerCaseFilterFactory"/&gt;
 &lt;/analyzer&gt;
&lt;/fieldType&gt;</pre>
<p>We are going to use the solr&#8217;s administration panel to find out if the configuration we&#8217;ve created is correct:</p>
<p><a href="http://solr.pl/wp-content/uploads/2011/02/11.jpg"><img fetchpriority="high" decoding="async" class="aligncenter size-full wp-image-840" src="http://solr.pl/wp-content/uploads/2011/02/11.jpg" alt="" width="725" height="70"></a></p>
<p><a href="http://solr.pl/wp-content/uploads/2011/02/2.jpg"><img decoding="async" class="aligncenter size-full wp-image-841" src="http://solr.pl/wp-content/uploads/2011/02/2.jpg" alt="" width="694" height="740"></a></p>
<ol>
<li> (Model: &#8220;80&#8221;) As we&#8217;ve expected, our new filters don&#8217;t influence the data typical for the 1st point.</li>
<li>(Model: &#8220;A8&#8221;) WordDelimiterFilter did the split on letter-number transitions.</li>
<li>(Model: &#8220;TrailBlazer&#8221;) WordDelimiterFilter did the case transition generating &#8220;trail&#8221; and &#8220;Blazer&#8221; tokens. Additionaly we have the opportunity to  enter the  &#8220;trailblazer&#8221; phrase. Superb!</li>
<li>(Model: &#8220;CR-V&#8221;) WordDelimiterFilter ignored the intra-word delimiters by generating subwords(&#8220;cr&#8221; and &#8220;v&#8221;) and joining the subwords additionaly (&#8220;crv&#8221;).</li>
<li>(Model: &#8220;Cee&#8217;d&#8221;) PatternReplaceFilter have replaced the &#8220;Cee&#8217;d&#8221; word to &#8220;Ceed&#8221; and the WordDelimiterFilter have only passed the value. That&#8217;s what we needed.</li>
</ol>
<h2>The end</h2>
<p>In this post we&#8217;ve showed how to configure two new filters in order to improve the search results quality – WordDelimiterFilter and PatternReplaceFilter. Our website users are satisfied &#8230; for now.</p>
]]></content:encoded>
					
					<wfw:commentRss>https://solr.pl/en/2011/02/14/car-sale-application-worddelimiterfilter-and-patternreplacefilter-helping-to-improve-search-results-part-2/feed/</wfw:commentRss>
			<slash:comments>0</slash:comments>
		
		
			</item>
		<item>
		<title>&#8220;Car sale application&#8221; &#8211; schema.xml designing to gain what we really need (part 1)</title>
		<link>https://solr.pl/en/2011/01/31/car-sale-application-schema-xml-designing-to-gain-what-we-really-need-part-1/</link>
					<comments>https://solr.pl/en/2011/01/31/car-sale-application-schema-xml-designing-to-gain-what-we-really-need-part-1/#respond</comments>
		
		<dc:creator><![CDATA[Rafał Kuć]]></dc:creator>
		<pubDate>Mon, 31 Jan 2011 08:07:42 +0000</pubDate>
				<category><![CDATA[Solr]]></category>
		<category><![CDATA[howto]]></category>
		<category><![CDATA[schema.xml]]></category>
		<category><![CDATA[solr]]></category>
		<guid isPermaLink="false">http://sematext.solr.pl/?p=192</guid>

					<description><![CDATA[One of the fundamental solr&#8217;s configuration file is the schema.xml file. It is a kind of connector between what we need and what solr understands. If we want to have a search engine, that gives us search results we really]]></description>
										<content:encoded><![CDATA[<p>One of the fundamental solr&#8217;s configuration file is the schema.xml file. It is a kind of connector between what we need and what solr understands. If we want to have a search engine, that gives us search results we really expect, then it is very important to properly design the schema.xml configuration file.<br />
We would like to introduce you the first of the series of articles which will hopefully show us how to design schema.xml file and how to handle and modify all of the file&#8217;s components.</p>
<p><span id="more-192"></span></p>
<h2>Requirements specification</h2>
<p>Imagine we would like to use solr to provide our car sale website with a search engine. The functional part of our website is at the beginning rather primitive and takes the advantage of only the small piece of every car information:</p>
<ul>
<li>make</li>
<li>model</li>
<li>year of production</li>
<li>price</li>
<li>engine size</li>
<li>mileage</li>
<li>colour</li>
<li>damaged</li>
</ul>
<p>We would like to design a simple configuration schema file, which will make possible to index data from the given fields. But before we open the schema.xml file and start typing, let&#8217;s answer the seven fundamental questions related to our fields:</p>
<h3>1. What is the field type ?</h3>
<p>Let&#8217;s determine the type of every field:</p>
<ul>
<li>make &#8211; text field</li>
<li>model &#8211; text field</li>
<li>year of production &#8211; integer field</li>
<li>price &#8211; float field</li>
<li>engine size &#8211; integer field</li>
<li>mileage &#8211; integer field</li>
<li>colour &#8211; textual field</li>
<li>damaged &#8211; logical field</li>
</ul>
<h4>So what ?</h4>
<p>So we will need some basic type definitions like string, boolean, int, float.</p>
<h3>2. Is it the field used in search process ?</h3>
<p>We would like to use the data from some fields in order to enable our search engine to find the proper documents (car sale announcements). To accomplish that we are going to use 3 fields: make, model, year of production.</p>
<h4>So what ?</h4>
<p>So we will need to create another field type, which will contain some filters to  make finding the documents easy and efficient. We will create another field of the newly created type, where we will put all the data from make, model and year of production fields.</p>
<h3>3. Is it the field used in faceting or sorting operation ?</h3>
<p>In our website we would like to sort search results using 4 fields: model, year of production, price and mileage. We would also like to be able to to use facet operation on fields: make, model, year of production and colour.</p>
<h4>So what ?</h4>
<p>When we want to create a field type for fields used for sorting/faceting, then we need to know that this type cannot contain tokenizers and filters which can tokenize values in those fields. But still we want the values to be lowercased, so the letters size does not influence the sorting/faceting results. So that&#8217;s the kind of another field type we will need to create.</p>
<h3>4. Is it the field used to filter search results?</h3>
<p>We would like to have the possibility to filter search results using ranges on fields: year of production, price, engine size and mileage.</p>
<h4>So what ?</h4>
<p>So let&#8217;s use the field types which will accelerate range queries.</p>
<h3>5. Are there any fields which are not mentioned in the questions number 2, 3 or 4 ?</h3>
<p>There is a field &#8220;damaged&#8221; which is not supposed to be involved in any of the mentioned operations.</p>
<h4>So what?</h4>
<p>So we will set the value of the &#8220;indexed&#8221; attribute to &#8220;false&#8221;.</p>
<h3>6. Is the field required ?</h3>
<p>We assume that there are 3 fields which are supposed to be required: make, model and year. We don&#8217;t want to have documents in index (car sale announcements available in the search process), which do not have values in those fields.</p>
<h4>So what ?</h4>
<p>So we will set the value of the &#8220;required&#8221; attribute to &#8220;true&#8221;.</p>
<h3>7. Do we need to retrieve the information from the field in the original state?</h3>
<p>We would like to retrieve the information from all of the fields mentioned in the requirements specification and present them directly on the website.</p>
<h4>So what?</h4>
<p>So we will set the value of the &#8220;stored&#8221; attribute to &#8220;true&#8221;.</p>
<h2>Let&#8217;s add field type definitions</h2>
<p>We&#8217;ve answered our questions, we&#8217;ve come to some conclusions so let&#8217;s add field types to the schema file:</p>
<p>We add the solr.StrField type, which is not analysed and can be used for example as the type for the unique document key.
</p>
<pre class="brush:xml">&lt;fieldType name="string" class="solr.StrField" sortMissingLast="true" omitNorms="true"/&gt;</pre>
<p>Add the boolean type:
</p>
<pre class="brush:xml">&lt;fieldType name="boolean" class="solr.BoolField" sortMissingLast="true" omitNorms="true"/&gt;</pre>
<p>Now the numerical types. Remember that we need types that can help us to accelerate range queries. So let&#8217;s use the tint and tfloat types:
</p>
<pre class="brush:xml">    &lt;fieldType name="tint" class="solr.TrieIntField" precisionStep="8" omitNorms="true" positionIncrementGap="0"/&gt;
    &lt;fieldType name="tfloat" class="solr.TrieFloatField" precisionStep="8" omitNorms="true" positionIncrementGap="0"/&gt;</pre>
<p>Now let&#8217;s create the textual type, which will be a definition type for the catch-all field used for searching. For now, the type with the whitespace tokenizer and the lowercase filter will be just fine:
</p>
<pre class="brush:xml">    &lt;fieldType name="text" class="solr.TextField" positionIncrementGap="100"&gt;
      &lt;analyzer&gt;
        &lt;tokenizer class="solr.WhitespaceTokenizerFactory"/&gt;
        &lt;filter class="solr.LowerCaseFilterFactory"/&gt;
      &lt;/analyzer&gt;
    &lt;/fieldType&gt;</pre>
<p>And last, but not least, the type for the sortable/facetable fields. What we need is the type that lowercases the entire field value, keeping it as a single token. KeywordTokenizer does no actual tokenizing, so it is the ideal tokenizer for our need. The TrimFilterFactory removes any leading or trailing whitespace:
</p>
<pre class="brush:xml">    &lt;fieldType name="lowercase" class="solr.TextField" positionIncrementGap="100"&gt;
      &lt;analyzer&gt;
        &lt;tokenizer class="solr.KeywordTokenizerFactory"/&gt;
        &lt;filter class="solr.LowerCaseFilterFactory" /&gt;
        &lt;filter class="solr.TrimFilterFactory" /&gt;
      &lt;/analyzer&gt;
    &lt;/fieldType&gt;</pre>
<h2>Time to add field definitions</h2>
<p>Document id:
</p>
<pre class="brush:xml">   &lt;field name="id" type="string" indexed="true" stored="true" required="true" /&gt;</pre>
<p>Make and model:
</p>
<pre class="brush:xml">   &lt;field name="make" type="text" indexed="false" stored="true" required="true" /&gt;
   &lt;field name="model" type="text" indexed="false" stored="true" required="true" /&gt;</pre>
<p>Now why is the value of the &#8220;indexed&#8221; attribute set to &#8220;false&#8221; ? As far as we know, we need those fields to search, sort and facet operations. That&#8217;s true &#8230; but we need to notice that for the searching purposes we will copy the data from those fields to one catch-all field:
</p>
<pre class="brush:xml">   &lt;field name="content" type="text" indexed="true" stored="false" multiValued="true"/&gt;</pre>
<p>and for the sorting/faceting purposes we will copy the data yet to other fields of the type &#8220;lowercase&#8221;:
</p>
<pre class="brush:xml">   &lt;field name="make_sort" type="lowercase" indexed="true" stored="false" /&gt;
   &lt;field name="model_sort" type="lowercase" indexed="true" stored="false" /&gt;</pre>
<p>So the fields make and model will not take part in the operations itself and we can set the &#8220;indexed&#8221; attribute to &#8220;false&#8221; for best index size.</p>
<p>The rest of the fields:
</p>
<pre class="brush:xml">   &lt;field name="year" type="tint" indexed="true" stored="true" required="true" /&gt;
   &lt;field name="price" type="tfloat" indexed="true" stored="true" /&gt;
   &lt;field name="engine_size" type="tint" indexed="true" stored="true" /&gt;
   &lt;field name="mileage" type="tint" indexed="true" stored="true" /&gt;
   &lt;field name="colour" type="lowercase" indexed="true" stored="true" /&gt;</pre>
<p>Remember about the &#8220;false&#8221; value of the &#8220;indexed&#8221; attribute of the &#8220;damaged&#8221; field:
</p>
<pre class="brush:xml"> &lt;field name="damaged" type="boolean" indexed="false" stored="true" /&gt;</pre>
<h2>copyField &#8211; let&#8217;s index the same data differently</h2>
<p>We have mentioned the field values copying several times already so now let&#8217;s define copy fields.</p>
<p>Fields used for searching are copied to catch-all &#8220;content&#8221; field. There is more than one source field, that&#8217;s why the &#8220;content&#8221; field definition contains the multiValued attribute set to &#8220;true&#8221;:
</p>
<pre class="brush:xml"> &lt;copyField source="make" dest="content"/&gt;
 &lt;copyField source="model" dest="content"/&gt;
 &lt;copyField source="year" dest="content"/&gt;</pre>
<p>Copying the sortable/facetable fields:
</p>
<pre class="brush:xml"> &lt;copyField source="make" dest="make_sort"/&gt;
 &lt;copyField source="model" dest="model_sort"/&gt;</pre>
<h2>Anything else ?</h2>
<p>We shall add 3 more elements to the schema:</p>
<p>The unique key of the document:
</p>
<pre class="brush:xml"> &lt;uniqueKey&gt;id&lt;/uniqueKey&gt;</pre>
<p>Default search field:
</p>
<pre class="brush:xml"> &lt;defaultSearchField&gt;content&lt;/defaultSearchField&gt;</pre>
<p>Default query parser operator. Let&#8217;s set it to &#8220;AND&#8221;.
</p>
<pre class="brush:xml"> &lt;solrQueryParser defaultOperator="AND"/&gt;</pre>
<p>It&#8217;s done! The schema.xml configuration file is ready and looks like this:
</p>
<pre class="brush:xml">&lt;?xml version="1.0" encoding="UTF-8" ?&gt;

&lt;schema name="carsale" version="1.2"&gt;

  &lt;types&gt;
    &lt;fieldType name="string" class="solr.StrField" sortMissingLast="true" omitNorms="true"/&gt;

    &lt;fieldType name="boolean" class="solr.BoolField" sortMissingLast="true" omitNorms="true"/&gt;

     &lt;fieldType name="tint" class="solr.TrieIntField" precisionStep="8" omitNorms="true" positionIncrementGap="0"/&gt;
    &lt;fieldType name="tfloat" class="solr.TrieFloatField" precisionStep="8" omitNorms="true" positionIncrementGap="0"/&gt;

    &lt;fieldType name="text" class="solr.TextField" positionIncrementGap="100"&gt;
      &lt;analyzer&gt;
        &lt;tokenizer class="solr.WhitespaceTokenizerFactory"/&gt;
        &lt;filter class="solr.LowerCaseFilterFactory"/&gt;
      &lt;/analyzer&gt;
    &lt;/fieldType&gt;

    &lt;fieldType name="lowercase" class="solr.TextField" positionIncrementGap="100"&gt;
      &lt;analyzer&gt;
        &lt;tokenizer class="solr.KeywordTokenizerFactory"/&gt;
        &lt;filter class="solr.LowerCaseFilterFactory" /&gt;
        &lt;filter class="solr.TrimFilterFactory" /&gt;
      &lt;/analyzer&gt;
    &lt;/fieldType&gt;

 &lt;/types&gt;

 &lt;fields&gt;
   &lt;field name="id" type="string" indexed="true" stored="true" required="true" /&gt;
   &lt;field name="make" type="text" indexed="false" stored="true" required="true" /&gt;
   &lt;field name="model" type="text" indexed="false" stored="true" required="true" /&gt;
   &lt;field name="make_sort" type="lowercase" indexed="true" stored="false" /&gt;
   &lt;field name="model_sort" type="lowercase" indexed="true" stored="false" /&gt;
   &lt;field name="year" type="tint" indexed="true" stored="true" required="true" /&gt;
   &lt;field name="price" type="tfloat" indexed="true" stored="true" /&gt;
   &lt;field name="engine_size" type="tint" indexed="true" stored="true" /&gt;
   &lt;field name="mileage" type="tint" indexed="true" stored="true" /&gt;
   &lt;field name="colour" type="lowercase" indexed="true" stored="true" /&gt;
   &lt;field name="damaged" type="boolean" indexed="false" stored="true" /&gt;
   &lt;field name="content" type="text" indexed="true" stored="false" multiValued="true"/&gt;

 &lt;/fields&gt;

 &lt;uniqueKey&gt;id&lt;/uniqueKey&gt;

 &lt;defaultSearchField&gt;content&lt;/defaultSearchField&gt;

 &lt;solrQueryParser defaultOperator="AND"/&gt;

 &lt;copyField source="make" dest="content"/&gt;
 &lt;copyField source="model" dest="content"/&gt;
 &lt;copyField source="make" dest="make_sort"/&gt;
 &lt;copyField source="model" dest="model_sort"/&gt;
 &lt;copyField source="year" dest="content"/&gt;

&lt;/schema&gt;</pre>
<h2>The end</h2>
<p>In today&#8217;s post we have created the simple schema.xml file, which allows us to index data, so that we are able to face our car sale website search functionalities. But still we want to develop our website which will surely affects the schema &#8230; and not only the schema. In the next &#8220;car sale&#8221; related post we will try to face some new requirements and provide next modifications.</p>
]]></content:encoded>
					
					<wfw:commentRss>https://solr.pl/en/2011/01/31/car-sale-application-schema-xml-designing-to-gain-what-we-really-need-part-1/feed/</wfw:commentRss>
			<slash:comments>0</slash:comments>
		
		
			</item>
	</channel>
</rss>
