Solr filters: KeepWordFilter

This time I decided to look at one of the unusual filters available in the standard distribution of Solr. The first one in my hands is a filter called KeepWordFilter.

Let’s start

First, a few words about what this filter does. As the name might indicate the main purpose of this filter is to “stop” words. More specifically, the filter does the opposite of filter called StopFilter. So how does this filter work ? I’ll talk about this in a moment – let’s start with the definition of the type and fields in the schema.xml file:

<fieldtype name="keepwords" class="solr.TextField">
   <analyzer>
      <tokenizer class="solr.WhitespaceTokenizerFactory"/>
      <filter class="solr.KeepWordFilterFactory" words="words.txt" ignoreCase="true"/>
   </analyzer>
</fieldtype>

As shown in the above definition in addition to the standard class and name attributes the filter has two additional attributes::

words – the list of words to keep
ignoreCase – true | false value indicating case ignore functionality.

File contents

Let’s assume that the words.txt file contain the following words:

ala
ma
kota

If you would like to index the phrase “Ala ma kota, a kot ma Alę” the following tokens will be written into the index: “ala”, “ma”, “kota”, “ma” because only those terms are defined in the words.txt file. This is clearly visible evident in the Solr administration panel:

A few words at the end

Although I never used the filter it seems to me that this is a good filter to use when you need to store the values of enumerated types, or in situations where we are interested in finite, or even better – a small and known in advance list of values, such as the categories where we can not filter information at the application level, or when it is very difficult.

Solr.pl

Solr filters: KeepWordFilter

Let’s start

File contents

A few words at the end

Leave a Reply Cancel reply