{"id":196,"date":"2011-02-14T09:09:28","date_gmt":"2011-02-14T08:09:28","guid":{"rendered":"http:\/\/sematext.solr.pl\/?p=196"},"modified":"2020-11-11T09:10:05","modified_gmt":"2020-11-11T08:10:05","slug":"car-sale-application-worddelimiterfilter-and-patternreplacefilter-helping-to-improve-search-results-part-2","status":"publish","type":"post","link":"https:\/\/solr.pl\/en\/2011\/02\/14\/car-sale-application-worddelimiterfilter-and-patternreplacefilter-helping-to-improve-search-results-part-2\/","title":{"rendered":"\u201dCar sale\u201d application \u2013 WordDelimiterFilter and PatternReplaceFilter, helping to improve search results (part 2)"},"content":{"rendered":"<p>In the <a href=\"http:\/\/solr.pl\/en\/2011\/01\/31\/car-sale-application-schema-xml-designing-to-gain-what-we-really-need-part-1\/\" target=\"_blank\" rel=\"noopener noreferrer\">first part of our \u201dCar sale\u201d application<\/a> related posts we created some standard index structure by properly configuring schema.xml configuration file. It didn&#8217;t take long to hear the first complains from the website users with this kind of configuration. Why don&#8217;t I receive any search results entering the &#8220;audi a&#8221; phrase ? I would like to see some announcements with &#8220;Audi A6&#8221; and &#8220;Audi A8&#8221; for example. I entered the phrase &#8220;Honda crv&#8221; \u2013 0 results, &#8220;Suzuki maruti&#8221; \u2013 none. Are there no related offers in the announcement database ? There are! But the current configuration of the searchable field type (field &#8220;content&#8221; \u2013 type &#8220;text&#8221;) does not allow us to find those offers using the queries we&#8217;ve entered. That&#8217;s the reason why the WordDelimiterFilter and PatternReplaceFilter need to enter the battlefield.<\/p>\n\n\n<!--more-->\n\n\n<h2>Requirements specification<\/h2>\n<p>We need to analyze the data, that is indexed in the &#8220;content&#8221; field. Let&#8217;s examine the sample data, that will be used for helping to create the new &#8220;text&#8221; type configuration:<\/p>\n<ul>\n<li><em>Make<\/em>: Audi<br>\n<em>Model<\/em>: 80, 90, A6, A8, TT<\/li>\n<\/ul>\n<ul>\n<li><em>Make<\/em>: BMW<br>\n<em>Model<\/em>: M3, M5, Series 7, Series 8, X1, X3<\/li>\n<\/ul>\n<ul>\n<li><em>Make<\/em>: Chevrolet<br>\n<em>Model<\/em>: TrailBlazer<\/li>\n<\/ul>\n<ul>\n<li><em>Make<\/em>: Citroen<br>\n<em>Model<\/em>: C-Crosser, C3 Pluriel, C4 Picasso<\/li>\n<\/ul>\n<ul>\n<li><em>Make<\/em>: Ford<br>\n<em>Model<\/em>: C-MAX, S-MAX<\/li>\n<\/ul>\n<ul>\n<li><em>Make<\/em>: Honda<br>\n<em>Model<\/em>: Accord, CR-V, FR-V, HR-V<\/li>\n<\/ul>\n<ul>\n<li><em>Make<\/em>: Kia<br>\n<em>Model<\/em>: Cee&#8217;d<\/li>\n<\/ul>\n<ul>\n<li><em>Make<\/em>: Suzuki<br>\n<em>Model<\/em>: Alto\/Maruti<\/li>\n<\/ul>\n<p>Make names are simple words, that are easily handled by the current configuration (WhitespaceTokenizer + LowerCaseFilter). The problem is with the model names, as they contain additional characters and separators, that we often ignore when entering the search phrase. Let&#8217;s try to put the sample date into some groups, that will help us with the incoming configuration:<\/p>\n<ol>\n<li>Model names, that do not need to be processed by any additional filters (the current &#8220;text\u201d type configuration is sufficient) &#8211; 80, 90, TT, Series 7, Series 8, Accord<\/li>\n<li>Model names, which contain letters and numbers, where we want to split on letter-number transitions &#8211;  A6, A8, M3, M5, X1, X3, C3 Pluriel, C4 Picasso. We would like to be able to find those models when entering only a letter, only a number and whole model name too.<\/li>\n<li>Models, which have the case transitions in the name \u2013 TrailBlazer. We would like to find the document with this name when entering &#8220;trail&#8221;, &#8220;blazer&#8221;, &#8220;trailBlazer&#8221;, &#8220;trailblazer&#8221;.<\/li>\n<li>Model names, that contain intra-word delimiters, which we want to ignore or split on them &#8211; C-Crosser, C-MAX, S-MAX, CR-V, FR-V, HR-V, Alto\/Maruti.<br>\nExample: we would like to find the document with the model name &#8220;C-MAX&#8221; entering the phrases &#8220;c&#8221;, &#8220;max&#8221;, &#8220;c-max&#8221; &#8220;cmax&#8221;.<\/li>\n<li>We intentionally omitted the &#8220;Cee&#8217;d&#8221; model name in the 4th point as we would like to treat this example a little differently. We don&#8217;t want to be able to find this model when entering the &#8220;cee&#8221; and &#8220;d&#8221; phrases. We treat the name only as the whole word &#8211; &#8220;cee&#8217;d&#8221; or &#8220;ceed&#8221;.<\/li>\n<\/ol>\n<h2>WordDelimiterFilter configuration<\/h2>\n<p>With the given configuration we&#8217;ve described above, we are going to add proper values to the WordDelimiterFilter attributes in order to satisfy our needs:<\/p>\n<ol>\n<li>WordDelimiterFilter is needless in this case, as the current &#8220;text&#8221; type configuration (WhitespaceTokenizer + LowerCaseFilter) is sufficient.<\/li>\n<li>In order to face the 2nd point requirements we need to set the proper values of the following attributes:\n<ul>\n<li> <em>generateWordParts=&#8221;1&#8243;<\/em> &#8211; the value must be set to &#8220;1&#8221; if we want to be able to generate parts of words<\/li>\n<li><em>generateNumberParts=&#8221;1&#8243;<\/em> &#8211; the value must be set to &#8220;1&#8221; if we want to be able to generate parts of number words<\/li>\n<li><em>splitOnNumerics=&#8221;1&#8243;<\/em> &#8211; the value must be set to &#8220;1&#8221; if we want to be able to generate a new parts from alphabet =&gt; number transitions<\/li>\n<\/ul>\n<\/li>\n<li>In order to face the 3rd point requirements we need to set the proper values of the following attributes:\n<ul>\n<li> <em>generateWordParts=&#8221;1&#8243;<\/em><\/li>\n<li><em>splitOnCaseChange=&#8221;1&#8243;<\/em> &#8211; the value must be set to &#8220;1&#8221; if we want to be able to split on lowercase =&gt; uppercase transitions<\/li>\n<\/ul>\n<\/li>\n<li>In order to face the 4th point requirements we need to set the proper values of the following attributes:\n<ul>\n<li> <em>generateWordParts=&#8221;1&#8243;<\/em><\/li>\n<li><em>catenateWords=&#8221;1&#8243; <\/em>&#8211;  the value must be set to &#8220;1&#8221; if we want to be able to ignore the intra-word delimiters by joining the subwords<\/li>\n<\/ul>\n<\/li>\n<\/ol>\n<p>So let&#8217;s take a look at our WordDelimiterFilter configuration:\n<\/p>\n<pre class=\"brush:xml\">&lt;filter class=\"solr.WordDelimiterFilterFactory\"\n splitOnNumerics=\"1\"\n splitOnNumerics=\"1\"\n generateWordParts=\"1\"\n generateNumberParts=\"1\"\n catenateWords=\"1\"\n\/&gt;<\/pre>\n<p>Additionaly we may notice that the default value of the &#8220;splitOnNumerics&#8221; and &#8220;splitOnNumerics&#8221; attributes is &#8220;1&#8221;. The rest of the WordDelimiterFilter&#8217;s attributes (except the &#8220;stemEnglishPossessive&#8221;) have the default value set to &#8220;0&#8221;. So our configuration can be reduced to:\n<\/p>\n<pre class=\"brush:xml\">&lt;filter class=\"solr.WordDelimiterFilterFactory\"\n generateWordParts=\"1\"\n generateNumberParts=\"1\"\n catenateWords=\"1\"\n stemEnglishPossessive=\"0\"\n\/&gt;<\/pre>\n<p>What about the 5th point of our data specification ? As we have stated, we wouldn&#8217;t like to treat the &#8220;&#8216;&#8221; sign as the intra-word delimiter. So maybe we could use the protected=&#8221;protwords.txt&#8221; option of the WordDelimiterFilter which will keep the word &#8220;Cee&#8217;d&#8221; unchanged ? Ok, but we would also like to be able to find this model when entering the &#8220;ceed&#8221; phrase, so this option is not good for us. The best solution would be to take care of this case in the separate filter and leave the WordDelimiterFilter with nothing to do.<\/p>\n<h2>PatternReplaceFilter configuration<\/h2>\n<p>we are going to put the PatternReplaceFilter before the WordDelimiterFilter. Using the PatternReplaceFilter we will be able to ignore the &#8221; &#8216; &#8221; sign by replacing it with the empty sign. Configuring the filter this way, the WordDelimiterFilter will receive the &#8220;Ceed&#8221; token and will not modify this value. The configuration of the filters will be the same for indexing and searching, so a user will be able to find the offer with the &#8220;Cee&#8217;d&#8221; model when entering the phrases &#8220;cee&#8217;d&#8221; and &#8220;ceed&#8221;:\n<\/p>\n<pre>&lt;filter class=\"solr.PatternReplaceFilterFactory\" pattern=\"'\" replacement=\"\" replace=\"all\" \/&gt;<\/pre>\n<h2>New &#8220;text&#8221; type configuration visualization<\/h2>\n<p>Let&#8217;s take a look at our new &#8220;text&#8221; type:\n<\/p>\n<pre class=\"brush:xml\">&lt;fieldType name=\"text\" positionIncrementGap=\"100\"&gt;\n &lt;analyzer&gt;\n  &lt;tokenizer class=\"solr.WhitespaceTokenizerFactory\"\/&gt;\n   &lt;filter class=\"solr.PatternReplaceFilterFactory\" pattern=\"'\" replacement=\"\" replace=\"all\" \/&gt;\n   &lt;filter class=\"solr.WordDelimiterFilterFactory\"\n    generateWordParts=\"1\"\n    generateNumberParts=\"1\"\n    catenateWords=\"1\"\n    stemEnglishPossessive=\"0\"\n  \/&gt;\n  &lt;filter class=\"solr.LowerCaseFilterFactory\"\/&gt;\n &lt;\/analyzer&gt;\n&lt;\/fieldType&gt;<\/pre>\n<p>We are going to use the solr&#8217;s administration panel to find out if the configuration we&#8217;ve created is correct:<\/p>\n<p><a href=\"http:\/\/solr.pl\/wp-content\/uploads\/2011\/02\/11.jpg\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter size-full wp-image-840\" src=\"http:\/\/solr.pl\/wp-content\/uploads\/2011\/02\/11.jpg\" alt=\"\" width=\"725\" height=\"70\"><\/a><\/p>\n<p><a href=\"http:\/\/solr.pl\/wp-content\/uploads\/2011\/02\/2.jpg\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter size-full wp-image-841\" src=\"http:\/\/solr.pl\/wp-content\/uploads\/2011\/02\/2.jpg\" alt=\"\" width=\"694\" height=\"740\"><\/a><\/p>\n<ol>\n<li> (Model: &#8220;80&#8221;) As we&#8217;ve expected, our new filters don&#8217;t influence the data typical for the 1st point.<\/li>\n<li>(Model: &#8220;A8&#8221;) WordDelimiterFilter did the split on letter-number transitions.<\/li>\n<li>(Model: &#8220;TrailBlazer&#8221;) WordDelimiterFilter did the case transition generating &#8220;trail&#8221; and &#8220;Blazer&#8221; tokens. Additionaly we have the opportunity to  enter the  &#8220;trailblazer&#8221; phrase. Superb!<\/li>\n<li>(Model: &#8220;CR-V&#8221;) WordDelimiterFilter ignored the intra-word delimiters by generating subwords(&#8220;cr&#8221; and &#8220;v&#8221;) and joining the subwords additionaly (&#8220;crv&#8221;).<\/li>\n<li>(Model: &#8220;Cee&#8217;d&#8221;) PatternReplaceFilter have replaced the &#8220;Cee&#8217;d&#8221; word to &#8220;Ceed&#8221; and the WordDelimiterFilter have only passed the value. That&#8217;s what we needed.<\/li>\n<\/ol>\n<h2>The end<\/h2>\n<p>In this post we&#8217;ve showed how to configure two new filters in order to improve the search results quality \u2013 WordDelimiterFilter and PatternReplaceFilter. Our website users are satisfied &#8230; for now.<\/p>","protected":false},"excerpt":{"rendered":"<p>In the first part of our \u201dCar sale\u201d application related posts we created some standard index structure by properly configuring schema.xml configuration file. It didn&#8217;t take long to hear the first complains from the website users with this kind of<\/p>\n","protected":false},"author":3,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":"","_links_to":"","_links_to_target":""},"categories":[27],"tags":[288,178],"class_list":["post-196","post","type-post","status-publish","format-standard","hentry","category-solr-en","tag-howto-2","tag-schema-xml-2"],"_links":{"self":[{"href":"https:\/\/solr.pl\/en\/wp-json\/wp\/v2\/posts\/196","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/solr.pl\/en\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/solr.pl\/en\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/solr.pl\/en\/wp-json\/wp\/v2\/users\/3"}],"replies":[{"embeddable":true,"href":"https:\/\/solr.pl\/en\/wp-json\/wp\/v2\/comments?post=196"}],"version-history":[{"count":1,"href":"https:\/\/solr.pl\/en\/wp-json\/wp\/v2\/posts\/196\/revisions"}],"predecessor-version":[{"id":197,"href":"https:\/\/solr.pl\/en\/wp-json\/wp\/v2\/posts\/196\/revisions\/197"}],"wp:attachment":[{"href":"https:\/\/solr.pl\/en\/wp-json\/wp\/v2\/media?parent=196"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/solr.pl\/en\/wp-json\/wp\/v2\/categories?post=196"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/solr.pl\/en\/wp-json\/wp\/v2\/tags?post=196"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}