”Car sale” application – WordDelimiterFilter and PatternReplaceFilter, helping to improve search results (part 2)
In the first part of our ”Car sale” application related posts we created some standard index structure by properly configuring schema.xml configuration file. It didn’t take long to hear the first complains from the website users with this kind of configuration. Why don’t I receive any search results entering the “audi a” phrase ? I would like to see some announcements with “Audi A6” and “Audi A8” for example. I entered the phrase “Honda crv” – 0 results, “Suzuki maruti” – none. Are there no related offers in the announcement database ? There are! But the current configuration of the searchable field type (field “content” – type “text”) does not allow us to find those offers using the queries we’ve entered. That’s the reason why the WordDelimiterFilter and PatternReplaceFilter need to enter the battlefield.
We need to analyze the data, that is indexed in the “content” field. Let’s examine the sample data, that will be used for helping to create the new “text” type configuration:
- Make: Audi
Model: 80, 90, A6, A8, TT
- Make: BMW
Model: M3, M5, Series 7, Series 8, X1, X3
- Make: Chevrolet
- Make: Citroen
Model: C-Crosser, C3 Pluriel, C4 Picasso
- Make: Ford
Model: C-MAX, S-MAX
- Make: Honda
Model: Accord, CR-V, FR-V, HR-V
- Make: Kia
- Make: Suzuki
Make names are simple words, that are easily handled by the current configuration (WhitespaceTokenizer + LowerCaseFilter). The problem is with the model names, as they contain additional characters and separators, that we often ignore when entering the search phrase. Let’s try to put the sample date into some groups, that will help us with the incoming configuration:
- Model names, that do not need to be processed by any additional filters (the current “text” type configuration is sufficient) – 80, 90, TT, Series 7, Series 8, Accord
- Model names, which contain letters and numbers, where we want to split on letter-number transitions – A6, A8, M3, M5, X1, X3, C3 Pluriel, C4 Picasso. We would like to be able to find those models when entering only a letter, only a number and whole model name too.
- Models, which have the case transitions in the name – TrailBlazer. We would like to find the document with this name when entering “trail”, “blazer”, “trailBlazer”, “trailblazer”.
- Model names, that contain intra-word delimiters, which we want to ignore or split on them – C-Crosser, C-MAX, S-MAX, CR-V, FR-V, HR-V, Alto/Maruti.
Example: we would like to find the document with the model name “C-MAX” entering the phrases “c”, “max”, “c-max” “cmax”.
- We intentionally omitted the “Cee’d” model name in the 4th point as we would like to treat this example a little differently. We don’t want to be able to find this model when entering the “cee” and “d” phrases. We treat the name only as the whole word – “cee’d” or “ceed”.
With the given configuration we’ve described above, we are going to add proper values to the WordDelimiterFilter attributes in order to satisfy our needs:
- WordDelimiterFilter is needless in this case, as the current “text” type configuration (WhitespaceTokenizer + LowerCaseFilter) is sufficient.
- In order to face the 2nd point requirements we need to set the proper values of the following attributes:
- generateWordParts=”1″ – the value must be set to “1” if we want to be able to generate parts of words
- generateNumberParts=”1″ – the value must be set to “1” if we want to be able to generate parts of number words
- splitOnNumerics=”1″ – the value must be set to “1” if we want to be able to generate a new parts from alphabet => number transitions
- In order to face the 3rd point requirements we need to set the proper values of the following attributes:
- splitOnCaseChange=”1″ – the value must be set to “1” if we want to be able to split on lowercase => uppercase transitions
- In order to face the 4th point requirements we need to set the proper values of the following attributes:
- catenateWords=”1″ – the value must be set to “1” if we want to be able to ignore the intra-word delimiters by joining the subwords
So let’s take a look at our WordDelimiterFilter configuration:
<filter class="solr.WordDelimiterFilterFactory" splitOnNumerics="1" splitOnNumerics="1" generateWordParts="1" generateNumberParts="1" catenateWords="1" />
Additionaly we may notice that the default value of the “splitOnNumerics” and “splitOnNumerics” attributes is “1”. The rest of the WordDelimiterFilter’s attributes (except the “stemEnglishPossessive”) have the default value set to “0”. So our configuration can be reduced to:
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" stemEnglishPossessive="0" />
What about the 5th point of our data specification ? As we have stated, we wouldn’t like to treat the “‘” sign as the intra-word delimiter. So maybe we could use the protected=”protwords.txt” option of the WordDelimiterFilter which will keep the word “Cee’d” unchanged ? Ok, but we would also like to be able to find this model when entering the “ceed” phrase, so this option is not good for us. The best solution would be to take care of this case in the separate filter and leave the WordDelimiterFilter with nothing to do.
we are going to put the PatternReplaceFilter before the WordDelimiterFilter. Using the PatternReplaceFilter we will be able to ignore the ” ‘ ” sign by replacing it with the empty sign. Configuring the filter this way, the WordDelimiterFilter will receive the “Ceed” token and will not modify this value. The configuration of the filters will be the same for indexing and searching, so a user will be able to find the offer with the “Cee’d” model when entering the phrases “cee’d” and “ceed”:
<filter class="solr.PatternReplaceFilterFactory" pattern="'" replacement="" replace="all" />
New “text” type configuration visualization
Let’s take a look at our new “text” type:
<fieldType name="text" positionIncrementGap="100"> <analyzer> <tokenizer class="solr.WhitespaceTokenizerFactory"/> <filter class="solr.PatternReplaceFilterFactory" pattern="'" replacement="" replace="all" /> <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" stemEnglishPossessive="0" /> <filter class="solr.LowerCaseFilterFactory"/> </analyzer> </fieldType>
We are going to use the solr’s administration panel to find out if the configuration we’ve created is correct:
- (Model: “80”) As we’ve expected, our new filters don’t influence the data typical for the 1st point.
- (Model: “A8”) WordDelimiterFilter did the split on letter-number transitions.
- (Model: “TrailBlazer”) WordDelimiterFilter did the case transition generating “trail” and “Blazer” tokens. Additionaly we have the opportunity to enter the “trailblazer” phrase. Superb!
- (Model: “CR-V”) WordDelimiterFilter ignored the intra-word delimiters by generating subwords(“cr” and “v”) and joining the subwords additionaly (“crv”).
- (Model: “Cee’d”) PatternReplaceFilter have replaced the “Cee’d” word to “Ceed” and the WordDelimiterFilter have only passed the value. That’s what we needed.
In this post we’ve showed how to configure two new filters in order to improve the search results quality – WordDelimiterFilter and PatternReplaceFilter. Our website users are satisfied … for now.
This post is also available in: Polish