Solr Text Tagger – Quick Look into Solr 7.4

We haven’t looked into any Solr functionalities on Solr.pl for a while so it is time to look into one of the new features – the text tagger. It works on the basis of Solr index and is able to check documents sent to a new handler and return occurrences of names with offsets and other metadata that we added to Solr index. However keep in mind that Solr Text Tagger doesn’t do any kind of natural language processing, so no what we will give it, that is what we can expect to get back.

Preparations

We will keep it as minimal as it can be. We start with running our Solr instance:

And we create a test collection:

We will be using the standard data-driven schema. Though it is not suggested to be used in production, it is more than enough for our Text Tagger test.

Data For Tagging

Let’s focus on the data for a second. First of all, we need data that will be used for tagging. For the purpose of the blog post we will just go with simple name tagging, for example:

A few names – you may recognize them 😉 We will use those names for tagging purposes.

To index them to Solr we will need a new field type, two fields, and a copy field, just like this:

So let’s stop here and see why we added those. First of all, we added a new field type called tag. We needed that for the text tagger to work. There are two crucial things there – the postingsFormat which needs to be set to FST50 and the solr.ConcatenateGraphFilterFactory at index time analysis. Those two settings are required.

Next – we need to keep the names themselves. So we will keep the name in a text_general field that is stored and can be easily retrieved, we will have the name_tag field using our newly created tag field type and finally we have a copy field that we will use to copy the value from the name field to the name_tag field, so that we don’t have to send the data twice.

Now let’s index those names that we had by using the following request:

Tagging Text

Now that we have our data that will be used for tagging sent to Solr we can try tagging a real document. For that, we took a part of the document from Wikipedia describing the life of George Washington. To send it to Solr for tagging we first need to create the configuration for a new handler in Solr, for example like this:

We added a new request handler called /tagger using the solr.TaggerRequestHandler and we set its defaults to use a field called name_tag – the one that we created the new field type for. Once that is done we can finally look at tagging the text. We do that by sending our document to newly added request handler:

The response returned by Solr looks as follows:

As you can see we have both the tags that were found along with the offsets, but also the name of the tag and its identifier. From the response that Solr gave us we see that three tags were found and where the tags were found – nice! 🙂

Tagger Parameters

Of course, you can control tagger. As you already seen in the above request that we’ve used fl parameter in the request and the required field property when adding new request handler. However, this is not everything. We can add filter queries, specify the maximum number of rows, choose an algorithm for overlapping tags and so on. You can find the full list of options in the Solr documentation at https://lucene.apache.org/solr/guide/7_4/the-tagger-handler.html.

Performance Considerations

As usual, there are a few things when it comes to best practices for handling the tagger collection and its layout as well as tagging of the documents. So the first problem is that in Solr 7.4.0 Text Tagger doesn’t support batching, so you are only able to send the documents one by one. This will probably change in the future, but for now, you can just combine multiple documents in a single request and add a dummy concatenations character between them. In addition to that, you should consider force merge. As you know, the fewer segments you have in your Lucene index – so your shard, the faster the queries will be. This is also true for tagging – try to keep the number of segments to the minimum to improve throughput and latency of your tagging queries.

Limitations

One last thing that I wanted to mention is that when we’ve been writing this text, so as of Solr 7.4.0 the text tagger was not supporting sharded index. Maybe this will change in the future, but for now, if you want to use text tagger functionality in Solr, you have to be aware of that limitation.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.