Developing Your Own Solr Filter

Sometimes Lucene and Solr out of the box functionality is not enough. When such time comes, we need to extend what Lucene and Solr gives us and create our own plugin. In todays post I’ll try to show how to develop a custom filter and use it in Solr.

Solr Version

The following code is based on Solr 3.6. We will make an updated version of this post, that match Solr 4.0, after its release.


Lets assume, that we need a filter that would allow us to reverse every word we have in a given field. So, if the input is “” the output would be “lp.rlos”. It’s not the hardest example, but for the purpose of this entry it will be enough. One more thing – I decided to omit describing how to setup your IDE, how to compile your code, build jar and stuff like that. We will only focus on the code.

Additional Information

Code, which is presented in this post was created using Solr 3.6 libraries, although you shouldn’t have much problems with compiling it with Solr 4 binaries. Keep in mind though that some slight modifications may be needed (in case something changes before Solr 4.0 release).

What We Need

In order for Solr to be able to use our filter, we need two classes. The first class is the actual filter implementation, which will be responsible for handling the actual logic. The second class is the filter factory, which will be responsible for creating instances of the filter. Lets get it done then.


In order to implement our filter we will extends the TokenFilter class from the org.apache.lucene.analysis and we will override the incrementToken method. This method returns a boolean value – if a value is still available for processing in the token stream, this method should return true, is the token in the token stream shouldn’t be further analyzed this method should return false. The implementation should look like the one below:

package pl.solr.analysis;


import org.apache.lucene.analysis.TokenFilter;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;

public final class ReverseFilter extends TokenFilter {
  private CharTermAttribute charTermAttr;

  protected ReverseFilter(TokenStream ts) {
    this.charTermAttr = addAttribute(CharTermAttribute.class);

  public boolean incrementToken() throws IOException {
    if (!input.incrementToken()) {
      return false;

    int length = charTermAttr.length();
    char[] buffer = charTermAttr.buffer();
    char[] newBuffer = new char[length];
    for (int i = 0; i < length; i++) {
      newBuffer[i] = buffer[length - 1 - i];
    charTermAttr.copyBuffer(newBuffer, 0, newBuffer.length);
    return true;

Description of the Above Implementation

A few words about some of the lines of code in the above implementation:

  • Line 9 – class which extends TokenFilter class and will be used as a filter should be marked as final (Lucene requirement).
  • Line 10 – token stream attribute, which allows us to get and modify the text contents of the term. If we would like, our filter could have used more than a single stream attribute, for example one like attribute for getting and changing position in the token stream or payload one. List of Attribute interface implementation can be found in Lucene API (ie.
  • Lines 12 – 15 – constructor which takes token stream as an argument and then adding (line 14) appropriate token stream attribute.
  • Lines 18 – 30incrementToken method implementation.
  • Lines 19 – 21 – check if token is available for processing. If not return false.
  • Line 23 – getting the size of the buffer contents of which we want to reverse.
  • Line 24 – getting the buffer in which we have the word we want to reverse.Β  Term text in stored as char array and thus the best one, will be to use it and not construct String object.
  • Lines 25 – 28 – create a new buffer and reverse the actual one.
  • Line 29 – clean the original buffer (needed in case of using append methods).
  • Line 30 – copy the changes we made to the buffer of the token stream attribute.
  • Line 31 – return true in order to inform that there is a token available for further processing.

Filter Factory

As I wrote earlier, in order for Solr to be able to use our filter, we need to implement filter factory class. Because, we don’t have any special configuration values and such, factory implementation should be very simple. We will extends BaseTokenFilterFactory class from the org.apache.solr.analysis package. The implementation can look like the following:

package pl.solr.analysis;

import org.apache.lucene.analysis.TokenStream;
import org.apache.solr.analysis.BaseTokenFilterFactory;

public class ReverseFilterFactory extends BaseTokenFilterFactory {
  public TokenStream create(TokenStream ts) {
    return new ReverseFilter(ts);

As you can see filter factory implementation is simple – we only needed to override a single create method in which we instantiate our filter and return it.


After compilation and jar file preparation, we copy the jar to a directory Solr will be able to see it. We can do this by creating the lib directory in the Solr home directory and then adding the following entry to the solrconfig.xml file:

<lib dir="../lib/" regex="*.jar" />

Then we change the schema.xml file and we add a new field type that will use our filter:

<fieldType name="text_reversed" class="solr.TextField">
    <tokenizer class="solr.WhitespaceTokenizerFactory"/>
    <filter class="pl.solr.analysis.ReverseFilterFactory" />

It is worth to note, that as class attribute value of the filter tag we provide the full package and class names of the factory we created, not the filter itself. It is important to remember that, otherwise Solr will throw errors.

Does it Work ?

In order to show you that it works, I provide the following screen shot of the Solr administration panel:

To Sum Up

As you can see on the above example creating your own filter is not a complicated thing. Of course, the idea of the filter was very simple and thus its implementation was simple too. I hope this post will be helpful when the time comes that you need to create your own filter for Solr.

This post is also available in: Polish

This entry was posted on Monday, May 14th, 2012 at 07:27 and is filed under Solr. You can follow any responses to this entry through the RSS 2.0 feed. You can leave a response, or trackback from your own site.

23 Responses to “Developing Your Own Solr Filter”

  1. akw Says:

    Thanks for the tutorial. I have a few questions to followup.

    for the method incrementToken() is it overriding the method in TokenStream, which is being called during input.incrementToken()? If so, how do we know that we are at the end of the stream to return false?

    Thanks in advance!

  2. gr0 Says:

    Look at the code – in lines 19 – 21 there is a conditional which checks if there is “something” in the token stream to be processed. If there is none, we just return false, if there is something we process it.

    Is that what you are asking about ? πŸ™‚

  3. akw Says:

    Yeah that’s the line I was pointing to. Suppose that if we were overriding the same incrementToken() that input is calling. How are we actually checking to see if there is more tokens in the stream?

    Is incrementToken() called by input different from the incrementToken() that we are writing here?

    Sorry if the questions seem elementary, I am rather new at this.

    Thanks again.

  4. gr0 Says:

    We are overriding the method, but the “input” variable (which we have because we extend TokenFilter) will always be there and you can use it.

    As for questions – they are good, you ask how analysis process is done in Lucene and Solr. I’ll make a note to myself, to write a post about it πŸ™‚

  5. Logik Says:

    How one can return more than one token from one token. E.g. during stemming, when one token can be derived from more stems.
    I find the way to buffer the additional stems, but it breaks the proximity searches: e.g. if A B (both came from two possible stems) is “rewrited” to Astem Astem’ Bstem Bstem’, Astem is at once three word long from Bstem: so
    query “Astem Bstem'”~1 fail. Is there any way how to avoid this problem?

  6. gr0 Says:

    If you want more than a single token from a one you are currently processing, you probably would like them to be on the same position. Use position and set it to 0 for the second token.

  7. deniz Says:

    Well for solr 4.0 the BaseTokenFilterFactory is now org.apache.lucene.analysis.util.TokenFilterFactory

    hopefully this will prevent other guys going crazy while trying to find BaseTokenFilterFactory class πŸ˜€

    thank you for the post by the way, seriously useful for me πŸ™‚

  8. gr0 Says:

    Yes, Solr 4.0 is a bit different when it comes to some classes (actually many of them). I’ll make an update and state that this is for 3.6 πŸ™‚

  9. Chris Says:

    Hi gr0,

    Could you please expand further on returning more than one token for indexing? Using your example as a template, I want to return two words (not just the input reversed) – if you could help that would be great!

  10. gr0 Says:

    Sure, we’ll make such a post. However if you need it fast, you can look at Lucene/Solr code and see how that is done, for example in synonyms.

  11. Chris Says:

    I had a look at that source code and found it very complex. If you could show an example returning “word 1” and “word 2” that would be great!

  12. Chris Says:

    Hi gr0 did you have any luck?

  13. gr0 Says:

    We will try to create a post like this, however we are quite busy right now. I hope that we will have at the weekend, so we will try to publish something like this on Monday.

  14. Chris Says:

    Thanks for your help πŸ™‚

  15. Genius Says:

    Can anyone tell me how can i connect mysql from this user created filter ?

  16. gr0 Says:

    But why you want to connect to the database from a filter? Keep in mind, that the filter will be called very often, because it is a part of analysis. Whenever you’ll send a document to indexing or send a query it will probably use your custom filter for each token in the token stream (of course when using the field which is using the filter).

  17. bastjan Says:

    can I apply this filter also to the query (and how?)? If I change the text in my field, I should also alter the query so both match.

  18. gr0 Says:

    After you implement your filter, you need to include it in the mappings file, where you defined your analyzers. It needs to be included in the filters for your analyzer.

  19. john999926 Says:

    what if the filter has an attribute, for example,

  20. john999926 Says:

    filter class=”pl.solr.analysis.ReverseFilterFactory” maxSize=”1000″

  21. jochen Says:

    I tried your code in solr 4.10.
    I’ve changed the TokenFilterFactory package and everything seems to work fine, but when i click on a filter the results are empty. Any idea? Thanks!

  22. FrΓ©dΓ©ric Baroz Says:


    Thanks for the post, very usefull!
    I find that tokenizing on the basis of whitespaces can sometimes be not accurate. Some times, we search for a specific entity which is represented by many tokens. Users dont search for word, but for entire things…

    I would find usefull to workaround this problem and i m trying to find a way to combine multiple tokens when appropriated. I found “AutoPhraseTokenFilter” to be the first piece of a solution but I need to refine it a bit.

    In incrementTokens() is there a way to see beyond the current token?`

    Thanks in advance

  23. gr0 Says:

    You could try creating a queue in your filter implementation and buffer the tokens from the stream in that queue/list. That way you should see the whole token stream for a given field and document.