{"id":268,"date":"2011-05-23T20:46:24","date_gmt":"2011-05-23T18:46:24","guid":{"rendered":"http:\/\/sematext.solr.pl\/?p=268"},"modified":"2020-11-11T20:46:53","modified_gmt":"2020-11-11T19:46:53","slug":"car-sale-application-spellcheckcomponent-did-you-really-mean-that-part-5","status":"publish","type":"post","link":"https:\/\/solr.pl\/en\/2011\/05\/23\/car-sale-application-spellcheckcomponent-did-you-really-mean-that-part-5\/","title":{"rendered":"\u201cCar sale application\u201d \u2013 SpellCheckComponent \u2013 did you really mean that ? (part 5)"},"content":{"rendered":"<p>The time has come to add another important functionality to our car sale application. It will be the spell checking mechanism with the ability to construct a new query from the suggestions. It has become the main functionality of every search engine so we will also make use of it.<\/p>\n\n\n<!--more-->\n\n\n<h2>Requirements specification<\/h2>\n<p>Our car database is so large that it contains many different names of makes and models. Some of that names could be really hard to spell\/write:<\/p>\n<ol>\n<li>\n<ul>\n<li><em>make<\/em>: Bugatti<\/li>\n<li><em>model<\/em>: Veyron<\/li>\n<\/ul>\n<\/li>\n<li>\n<ul>\n<li><em>make<\/em>: Daewoo<\/li>\n<li><em>model<\/em>: Lacetti<\/li>\n<\/ul>\n<\/li>\n<li>\n<ul>\n<li><em>make<\/em>: Cadillac<\/li>\n<li><em>model<\/em>: Brougham<\/li>\n<\/ul>\n<\/li>\n<li>\n<ul>\n<li><em>make<\/em>: Ford<\/li>\n<li><em>model<\/em>: Capri<\/li>\n<\/ul>\n<\/li>\n<li>\n<ul>\n<li><em>make<\/em>: Maserati<\/li>\n<li><em>model<\/em>: Coupe<\/li>\n<\/ul>\n<\/li>\n<\/ol>\n<p>The query examples, where misspelled words caused the query to provide no search results:<\/p>\n<ol>\n<li>?q=bugati+weyron<\/li>\n<li>?q=daewo+laceti<\/li>\n<li>?q=cadilac+brogham<\/li>\n<li>?q=ford+kapri<\/li>\n<li>?q=maseratti+coupe<\/li>\n<\/ol>\n<p>We would like to add the functionality, that in case of entering incorrect names will be able to suggest the phrase which probably was the intention of an application user. Then we will be able to make use of it to find the documents related to the proper phrase.<\/p>\n<h2>solrconfig.xml changes<\/h2>\n<p>The most important element, which should be added to the solrconfig.xml configuration file is the <em>solr.SpellCheckComponent<\/em>. Let&#8217;s try to add the simple standard configuration of this component and find out how it works:\n<\/p>\n<pre class=\"brush:xml\">&lt;searchComponent name=\"spellcheck\" class=\"solr.SpellCheckComponent\"&gt;\n    &lt;lst name=\"spellchecker\"&gt;\n      &lt;str name=\"classname\"&gt;solr.IndexBasedSpellChecker&lt;\/str&gt;\n      &lt;str name=\"spellcheckIndexDir\"&gt;.\/spellchecker&lt;\/str&gt;\n      &lt;str name=\"field\"&gt;content&lt;\/str&gt;\n      &lt;str name=\"buildOnCommit\"&gt;true&lt;\/str&gt;\n    &lt;\/lst&gt;\n&lt;\/searchComponent&gt;<\/pre>\n<p>Let&#8217;s explain the attributes used in this component:<\/p>\n<ol>\n<li>\n<ul>\n<li><em>classname<\/em> \u2013 the name of the class which is the implementation of our spellcheck mechanism. We are using the solr.IndexBasedSpellChecker class,  which use a spelling dictionary that is based on the Solr\/Lucene index.<\/li>\n<\/ul>\n<ul>\n<li><em>spellcheckIndexDir<\/em> \u2013 the directory name which holds the spellcheck index.<\/li>\n<\/ul>\n<ul>\n<li><em>field<\/em> \u2013 the name of the field defined in the schema.xml file, used as the  source field to generate the spellcheck index. In our case it will be the \u201ccontent\u201d field (why? &#8211; it will be explained later).<\/li>\n<\/ul>\n<ul>\n<li><em>buildOnCommit<\/em> \u2013 if the value of this attribute is set to <em>true<\/em>, then the spellcheck index will be automatically build after the main solr index commit.<\/li>\n<\/ul>\n<\/li>\n<\/ol>\n<p>Now when we have our component defined, let&#8217;s add it to some handler to be able to make use of it. The best option is to add it to our standard, default handler, which would provide query results with the suggestions when hitting only one request. Before the changes, our default handler looked like this:\n<\/p>\n<pre class=\"brush:xml\">&lt;requestHandler name=\"standard\" class=\"solr.SearchHandler\" default=\"true\"&gt;\n     &lt;lst name=\"defaults\"&gt;\n       &lt;str name=\"echoParams\"&gt;explicit&lt;\/str&gt;\n     &lt;\/lst&gt;\n&lt;\/requestHandler&gt;<\/pre>\n<p lang=\"pl-PL\">Po zmianie, wygl\u0105da tak:<\/p>\n<pre class=\"brush:xml\">&lt;requestHandler name=\"standard\" class=\"solr.SearchHandler\" default=\"true\"&gt;\n     &lt;lst name=\"defaults\"&gt;\n       &lt;str name=\"echoParams\"&gt;explicit&lt;\/str&gt;\n       &lt;str name=\"spellcheck\"&gt;true&lt;\/str&gt;\n       &lt;str name=\"spellcheck.collate\"&gt;true&lt;\/str&gt;\n     &lt;\/lst&gt;\n     &lt;arr name=\"last-components\"&gt;\n       &lt;str&gt;spellcheck&lt;\/str&gt;\n     &lt;\/arr&gt;\n&lt;\/requestHandler&gt;<\/pre>\n<p>As we can see, we have added the spellcheck component and yet two another default values:<\/p>\n<ol>\n<li>\n<ul>\n<li><em>spellcheck<\/em> \u2013 when set to <em>true<\/em> causes every request should also generate a spellcheck suggestion.<\/li>\n<\/ul>\n<ul>\n<li><em>spellcheck.collate<\/em> &#8211; when set to <em>true<\/em> causes the mechanism to choose the best suggestion for every word entered and to construct a new query containing proper words. If the spellchecker recognises a word to be correct, it leaves it unchanged.<\/li>\n<\/ul>\n<\/li>\n<\/ol>\n<h2>schema.xml changes<\/h2>\n<p>The possible changes in the schema.xml configuration file would be to add the field which would be used by the <em>solr.SpellCheckComponent<\/em> component as the source of tokens used for spell checking. The field should contain the data which we would like to be used when creating the spellcheck index. The type of that field should ensure the proper data tokenization. It should be also out of any stemming\/lametization filters that could affect the spellcheck results badly.<\/p>\n<p>Our schema already contains the field which fulfil all those requirements &#8211; \u201ccontent\u201d field. Just to remind, it is the default search field used by our search engine. The current field and type definitions look like this:\n<\/p>\n<pre class=\"brush:xml\">&lt;field name=\"content\" type=\"text\" indexed=\"true\" stored=\"false\" multiValued=\"true\"\/&gt;<\/pre>\n<pre class=\"brush:xml\">&lt;fieldType name=\"text\" positionIncrementGap=\"100\"&gt;\n &lt;analyzer&gt;\n  &lt;tokenizer class=\"solr.WhitespaceTokenizerFactory\"\/&gt;\n   &lt;filter class=\"solr.PatternReplaceFilterFactory\" pattern=\"'\" replacement=\"\" replace=\"all\" \/&gt;\n   &lt;filter class=\"solr.WordDelimiterFilterFactory\"\n    generateWordParts=\"1\"\n    generateNumberParts=\"1\"\n    catenateWords=\"1\"\n    stemEnglishPossessive=\"0\"\n  \/&gt;\n  &lt;filter class=\"solr.LowerCaseFilterFactory\"\/&gt;\n &lt;\/analyzer&gt;\n&lt;\/fieldType&gt;<\/pre>\n<p>There are values of three fields copied to the &#8220;content&#8221; field: make, model and year:\n<\/p>\n<pre class=\"brush:xml\">&lt;copyField source=\"make\" dest=\"content\"\/&gt;\n&lt;copyField source=\"model\" dest=\"content\"\/&gt;\n&lt;copyField source=\"year\" dest=\"content\"\/&gt;<\/pre>\n<h2>Let\u2019s create queries<\/h2>\n<p>Let&#8217;s take the no results queries from the requirements specification and add the spellcheck.q parameter which value will be the same as entered in the q parameter. Now, hitting only one query, we are able to get the search results wit the spellcheck suggestions:<\/p>\n<ol>\n<li>?q=bugati+weyron&amp;spellcheck.q=bugati+weyron\n<ul>\n<pre class=\"brush:xml\">&lt;result name=\"response\" numFound=\"0\" start=\"0\" \/&gt;\n&lt;lst name=\"spellcheck\"&gt;\n  &lt;lst name=\"suggestions\"&gt;\n    &lt;lst name=\"bugati\"&gt;\n      &lt;int name=\"numFound\"&gt;1&lt;\/int&gt;\n      &lt;int name=\"startOffset\"&gt;0&lt;\/int&gt;\n      &lt;int name=\"endOffset\"&gt;6&lt;\/int&gt;\n      &lt;arr name=\"suggestion\"&gt;\n        &lt;str&gt;bugatti&lt;\/str&gt;\n      &lt;\/arr&gt;\n    &lt;\/lst&gt;\n    &lt;lst name=\"weyron\"&gt;\n      &lt;int name=\"numFound\"&gt;1&lt;\/int&gt;\n      &lt;int name=\"startOffset\"&gt;7&lt;\/int&gt;\n      &lt;int name=\"endOffset\"&gt;13&lt;\/int&gt;\n      &lt;arr name=\"suggestion\"&gt;\n        &lt;str&gt;veyron&lt;\/str&gt;\n      &lt;\/arr&gt;\n    &lt;\/lst&gt;\n      &lt;str name=\"collation\"&gt;bugatti veyron&lt;\/str&gt;\n    &lt;\/lst&gt;\n&lt;\/lst&gt;<\/pre>\n<p>The spellcheck mechanism has corrected the query tokens and the collation functionality has generated the proper phrase query, which now can be simply used in order to provide us the proper search results. Let&#8217;s check the rest of the queries:<\/p>\n<\/ul>\n<\/li>\n<li>?q=daewo+laceti&amp;spellcheck.q=?q=daewo+laceti\n<ul>\n<pre class=\"brush:xml\">&lt;result name=\"response\" numFound=\"0\" start=\"0\" \/&gt;\n&lt;lst name=\"spellcheck\"&gt;\n  &lt;lst name=\"suggestions\"&gt;\n    &lt;lst name=\"daewo\"&gt;\n      &lt;int name=\"numFound\"&gt;1&lt;\/int&gt;\n      &lt;int name=\"startOffset\"&gt;0&lt;\/int&gt;\n      &lt;int name=\"endOffset\"&gt;5&lt;\/int&gt;\n      &lt;arr name=\"suggestion\"&gt;\n        &lt;str&gt;daewoo&lt;\/str&gt;\n      &lt;\/arr&gt;\n    &lt;\/lst&gt;\n    &lt;lst name=\"laceti\"&gt;\n      &lt;int name=\"numFound\"&gt;1&lt;\/int&gt;\n      &lt;int name=\"startOffset\"&gt;6&lt;\/int&gt;\n      &lt;int name=\"endOffset\"&gt;12&lt;\/int&gt;\n      &lt;arr name=\"suggestion\"&gt;\n        &lt;str&gt;lacetti&lt;\/str&gt;\n      &lt;\/arr&gt;\n    &lt;\/lst&gt;\n      &lt;str name=\"collation\"&gt;daewoo lacetti&lt;\/str&gt;\n    &lt;\/lst&gt;\n&lt;\/lst&gt;<\/pre>\n<\/ul>\n<\/li>\n<li>?q=cadilac+brogham&amp;spellcheck.q=cadilac+brogham\n<ul>\n<pre class=\"brush:xml\">&lt;result name=\"response\" numFound=\"0\" start=\"0\" \/&gt;\n&lt;lst name=\"spellcheck\"&gt;\n  &lt;lst name=\"suggestions\"&gt;\n    &lt;lst name=\"cadilac\"&gt;\n      &lt;int name=\"numFound\"&gt;1&lt;\/int&gt;\n      &lt;int name=\"startOffset\"&gt;0&lt;\/int&gt;\n      &lt;int name=\"endOffset\"&gt;7&lt;\/int&gt;\n      &lt;arr name=\"suggestion\"&gt;\n        &lt;str&gt;cadillac&lt;\/str&gt;\n      &lt;\/arr&gt;\n    &lt;\/lst&gt;\n    &lt;lst name=\"brogham\"&gt;\n      &lt;int name=\"numFound\"&gt;1&lt;\/int&gt;\n      &lt;int name=\"startOffset\"&gt;8&lt;\/int&gt;\n      &lt;int name=\"endOffset\"&gt;15&lt;\/int&gt;\n      &lt;arr name=\"suggestion\"&gt;\n        &lt;str&gt;brougham&lt;\/str&gt;\n      &lt;\/arr&gt;\n    &lt;\/lst&gt;\n      &lt;str name=\"collation\"&gt;cadillac brougham&lt;\/str&gt;\n    &lt;\/lst&gt;\n&lt;\/lst&gt;<\/pre>\n<\/ul>\n<\/li>\n<li>?q=ford+kapri&amp; spellcheck.q=?q=ford+kapri\n<ul>\n<pre class=\"brush:xml\">&lt;result name=\"response\" numFound=\"0\" start=\"0\" \/&gt;\n&lt;lst name=\"spellcheck\"&gt;\n  &lt;lst name=\"suggestions\"&gt;\n    &lt;lst name=\"kapri\"&gt;\n      &lt;int name=\"numFound\"&gt;1&lt;\/int&gt;\n      &lt;int name=\"startOffset\"&gt;5&lt;\/int&gt;\n      &lt;int name=\"endOffset\"&gt;10&lt;\/int&gt;\n      &lt;arr name=\"suggestion\"&gt;\n        &lt;str&gt;capri&lt;\/str&gt;\n      &lt;\/arr&gt;\n    &lt;\/lst&gt;\n      &lt;str name=\"collation\"&gt;ford capri&lt;\/str&gt;\n    &lt;\/lst&gt;\n&lt;\/lst&gt;<\/pre>\n<\/ul>\n<\/li>\n<li>?q=maseratti+coupe&amp;spellcheck.q=?q=maseratti+coupe\n<ul>\n<pre class=\"brush:xml\">&lt;result name=\"response\" numFound=\"0\" start=\"0\" \/&gt;\n&lt;lst name=\"spellcheck\"&gt;\n  &lt;lst name=\"suggestions\"&gt;\n    &lt;lst name=\"maseratti\"&gt;\n      &lt;int name=\"numFound\"&gt;1&lt;\/int&gt;\n      &lt;int name=\"startOffset\"&gt;0&lt;\/int&gt;\n      &lt;int name=\"endOffset\"&gt;9&lt;\/int&gt;\n      &lt;arr name=\"suggestion\"&gt;\n        &lt;str&gt;maserati&lt;\/str&gt;\n      &lt;\/arr&gt;\n    &lt;\/lst&gt;\n      &lt;str name=\"collation\"&gt;maserati coupe&lt;\/str&gt;\n    &lt;\/lst&gt;\n&lt;\/lst&gt;<\/pre>\n<\/ul>\n<\/li>\n<\/ol>\n<p>The spellcheck mechanism has worked great, correcting all of the misspellings and generating the proper phrase queries. In the last two cases (4,5) we can see that the component has not corrected the properly entered words (4 \u2013 ford, 5 \u2013 coupe) but used them to construct the proper queries (collation).<\/p>\n<h2>The end<\/h2>\n<p>Our search engine has now yet another functionality. This time it was the spell checking mechanism. Now all we have to do is to wait for some comments \u2026 and maybe some improvements can be provided \ud83d\ude42<\/p>","protected":false},"excerpt":{"rendered":"<p>The time has come to add another important functionality to our car sale application. It will be the spell checking mechanism with the ability to construct a new query from the suggestions. It has become the main functionality of every<\/p>\n","protected":false},"author":4,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":"","_links_to":"","_links_to_target":""},"categories":[27],"tags":[],"class_list":["post-268","post","type-post","status-publish","format-standard","hentry","category-solr-en"],"_links":{"self":[{"href":"https:\/\/solr.pl\/en\/wp-json\/wp\/v2\/posts\/268","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/solr.pl\/en\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/solr.pl\/en\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/solr.pl\/en\/wp-json\/wp\/v2\/users\/4"}],"replies":[{"embeddable":true,"href":"https:\/\/solr.pl\/en\/wp-json\/wp\/v2\/comments?post=268"}],"version-history":[{"count":1,"href":"https:\/\/solr.pl\/en\/wp-json\/wp\/v2\/posts\/268\/revisions"}],"predecessor-version":[{"id":269,"href":"https:\/\/solr.pl\/en\/wp-json\/wp\/v2\/posts\/268\/revisions\/269"}],"wp:attachment":[{"href":"https:\/\/solr.pl\/en\/wp-json\/wp\/v2\/media?parent=268"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/solr.pl\/en\/wp-json\/wp\/v2\/categories?post=268"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/solr.pl\/en\/wp-json\/wp\/v2\/tags?post=268"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}