Category Routed Aliases

Through the lifetime of Solr we were given the possibility to work with cores, then collections and finally aliases – the alternative names for collections. Aliases allow the user to give your collection a new, virtual name and group multiple collections under that single virtual name. This allows isolation of the real collection name from the name that the client application is using. That allows changing the collection in the background without the need of bringing down the whole cluster and make your application or product unavailable. In Solr we have the option to use two aliases groups:

  • standard aliases that group collections under a virtual name,
  • routed aliases that route your requests.

The routed aliases can be further divided into two categories – time routed aliases and category routed aliases. We will be looking into the second category in this blog post.

Why Do I Need Category Routed Aliases?

The problem of the standard aliases is the indexing limitation. If you will create an alias that covers two collections, we will not be able to really control where the data will be put – i.e., in Solr 8.2 the first collection from the alias will be used. Let me demonstrate you how that works.

First, let’s create two collections using API v2. We will start with the first collection:

curl -XPOST -H 'Content-type:application/json' 'http://localhost:8983/v2/c' -d '{ 
 "create" : {
  "name" : "test_1",
  "numShards" : 1,
  "replicationFactor" : 1
 }
}'

And then the second collection:

curl -XPOST -H 'Content-type:application/json' 'http://localhost:8983/v2/c' -d '{ 
 "create" : {
  "name" : "test_2",
  "numShards" : 1,
  "replicationFactor" : 1
 }
}'

So we have two collections one called test_1 and the second one called test_2. We should create an alias grouping those two collections called test. We can do that using the following command:

curl -XPOST -H 'Content-type:application/json' 'http://localhost:8983/v2/c' -d '{ 
 "create-alias" : {
  "name" : "test",
  "collections" : [ "test_1", "test_2" ]
 }
}'

Now let’s index the document by using the following command:

curl -XPOST -H 'Content-type:application/json' 'http://localhost:8983/solr/test/update?commit=true' -d '[
 {
  "id" : 1,
  "name" : "Test indexing"
 }
]'

That command will be successful in Solr 8.2 and the document will be put into the test_1 collection, you can check it yourself 😉

And of course this is not the only problem. If you would like to physically separate the data of multiple tenants you need to prepare your application that handles sending data to Solr to do that. You could still use the alias for reading part though.

To overcome that problem we can use category routed aliases and let SolrCloud do the work.

Category Routed Aliases

The idea behind the category routed aliases is to manage the alias and the collections that are grouped using it by using a value inside a certain, defined field. Given that we basically get partitioning based on a certain field – for example on the company name. That way, each company that we store index data for will be able to have their own collection and we will not have to worry about that manually during indexing or querying time.

Simplified view over category routed aliases

Creating Category Routed Aliases

The process of creation of the category routed aliases is slightly different from what we did with the standard aliases. We do not start with the creation of the collections – those will be created automatically. Instead we are starting with alias creation.

Let’s create our first category based alias by using the following command:

curl -XPOST -H 'Content-type:application/json' 'http://localhost:8983/v2/c' -d '{ 
 "create-alias" : {
  "name" : "test_cra_company",
  "router" : {
   "name" : "category",
   "field" : "company_name",
   "maxCardinality" : 2
  },
  "create-collection" : {
   "numShards" : "1",
   "replicationFactor" : "1",
   "config" : "_default"
  }
 }
}'

Let’s stop for a minute here. We’ve used the create-alias command to create a new alias called test_cra_company. We provided a few properties here. First of all you ca see the router object that includes three properties:

  • name – the name of the router that can be used. For now Solr supports time and category. The time creates time-routed alias, while the category created the category based alias.
  • field – the name of the field that will be used for routing.
  • maxCardinality – the maximum number of unique values that the field can take to avoid creating too many collections.

Because the routed aliases will create collections in the background we need to provide the properties for the collections. We do that using the create-collection object in our request. We provided the number of shards, the replication factor and the name of the configuration that will be used.

In the examples, for simplicity we are using the default, data-driven schema. This shouldn’t be used for production when using time or category routed aliases.

How Does the Updates in the Routed Aliases Work Internally?

When Solr processes an update request an UpdateRequestProcessor is initialized. In the SolrCloud, like in our case, the DistributedUpdateProcessor is initialized, at least that is the case when we use standard aliases. When the time or category routed aliases are used the RoutedAliasUpdateProcessor is used before the actual DistributedUpdateProcessor. It is done automatically and doesn’t have to be done manually. Then the RoutedAliasUpdateProcessor is responsible for routing the data to the appropriate collection and as we saw if we don’t have any kind of routed aliases Solr will just use the first collection on the alias list and use it to perform the update operation.

What you should also know is that there will be a special place-holder collection created called test_cra_company__CRA__NEW_CATEGORY_ROUTED_ALIAS_WAITING_FOR_DATA__TEMP that will be eventually deleted and the collections that are created in the background will be named test_cra_company__CRA__<VALUE_OF_THE_FIELD>, so for example test_cra_company__CRA__company1 if the company_name field will have the value of the company1. This provides naming limitations, but we will talk about it later in the blog post.

Indexing Data With Routed Aliases

Let’s now see the alias in action. To do that we will index two documents using the following commands:

curl -XPOST -H 'Content-type:application/json' 'http://localhost:8983/solr/test_cra_company/update?commit=true' -d '[
 {
  "id" : 1,
  "name" : "Test indexing",
  "company_name" : "company1"
 }
]'
curl -XPOST -H 'Content-type:application/json' 'http://localhost:8983/solr/test_cra_company/update?commit=true' -d '[
 {
  "id" : 2,
  "name" : "Test indexing",
  "company_name" : "company2"
 }
]'

Remember the maxCardinality property that we’ve set to the value of 2? If we will try to use a third company name, for example using the following command:

curl -XPOST -H 'Content-type:application/json' 'http://localhost:8983/solr/test_cra_company/update?commit=true' -d '[
 {
  "id" : 3,
  "name" : "Test indexing",
  "company_name" : "company3"
 }
]'

Solr will get back to us with the following error:

{
  "responseHeader":{
    "status":400,
    "QTime":1},
  "error":{
    "metadata":[
      "error-class","org.apache.solr.common.SolrException",
      "root-error-class","org.apache.solr.common.SolrException"],
    "msg":"Max cardinality 2 reached for Category Routed Alias: test_cra_company",
    "code":400}}

Queries

We can also try to search on our data, for example using a match all query:

http://localhost:8983/solr/test_cra_company/select?q=*:*

Solr will return both of the indexed documents:

{
  "responseHeader": {
    "zkConnected": true,
    "status": 0,
    "QTime": 12,
    "params": {
      "q": "*:*"
    }
  },
  "response": {
    "numFound": 2,
    "start": 0,
    "maxScore": 1,
    "docs": [
      {
        "id": "1",
        "name": [
          "Test indexing"
        ],
        "company_name": [
          "company1"
        ],
        "_version_": 1647547697554522000
      },
      {
        "id": "2",
        "name": [
          "Test indexing"
        ],
        "company_name": [
          "company2"
        ],
        "_version_": 1647547721335177200
      }
    ]
  }
}

We can also filter our query as we would usually do:

http://localhost:8983/solr/test_cra_company/select?q=*:*&fq=company_name:company2

And the response will contain only a single document, which is expected:

{
  "responseHeader": {
    "zkConnected": true,
    "status": 0,
    "QTime": 25,
    "params": {
      "q": "*:*",
      "fq": "company_name:company2"
    }
  },
  "response": {
    "numFound": 1,
    "start": 0,
    "maxScore": 1,
    "docs": [
      {
        "id": "2",
        "name": [
          "Test indexing"
        ],
        "company_name": [
          "company2"
        ],
        "_version_": 1647547721335177200
      }
    ]
  }
}

As we can see, everything works as intended.

The collections that were created during our simple test looks as follows:

Limitations

There are a few limitations when it comes to the routed aliases, especially the category routed aliases.

The first limitation is naming. The values of our field that we use for routing needs to be ASCII based. Otherwise Solr will not be able to use the name for the collection name and we will run into issues. You should remember about that.

The second thing is deletion of the alias or collections that are handled. There is no automated way of doing that, so you can’t easily remove a category. So the procedure of removing a category would be:

  • Ensuring that there will not be more document with the category that we want to remove.
  • Modifying the alias definition in Zookeeper by removing the collection that is responsible for handling the data of the category that we want to remove.
  • Delete the collection using Solr API – you need to remove it from the alias first, otherwise Solr will fail to delete it.

Currently (in Solr 8.2 as of writing this post) the update is distributed to an appropriate collection based on the value of the category, while the query is run against all the collections defined by the alias. Improving the query time execution is mentioned as one of the possible improvements for the routed aliases feature in Solr.

Finally remember that the collection creation takes time, usually up to 3 seconds, depending on how loaded your SolrCloud cluster is. Take that in mind when designing and implementing a system that will use Solr with routed aliases. Your system needs to be able to handle a longer delay when a first value will appear in the document leading to collection creation.

Summary

As you can see the category routed aliases provide a very nice and convenient way of automatically creating collections based on the value of a field. So if we need that and we want Solr to take care of that for us – this is one of the ways to go, especially in the newer Solr versions. Is the feature perfect – no. Is there a room for improvement – yes and there are already possible improvements that can be done and are mentioned in the official documentation of Solr. Hopefully we will see them in the next Solr versions.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.