6 deadly sins in the context of query

In my work related to Lucene and Solr I have seen various queries. While in the case of Lucene, developer usually knows what he/she wants to achieve and use more or less optimal solution, but when it comes to Solr it is not always like this. Solr is a product which could theoretically be used by everyone, both the person who knows Java, one that does not have a broad and specialized technical knowledge, as well as programmer. Precisely because of that Solr is a product which is easy to run and use it, at least when it comes to simple functionalities. I suppose, that is why not many people are worried about reading Solr wiki or at least review the mailing list. As a result, sooner or later people tend to make mistakes. Those errors arise from various shortcomings – lack of knowledge about Solr, lack of skills, lack of experience or simply a lack of time and tight deadlines. Today I would like to show some major mistakes when submitting queries to Solr and how to avoid those mistakes.

1. Lack of filters

One of the fundamental errors that I encounter from time to time is the lack of filters, which in context of query means no fq parameter. Let us remember that filters are out friends 😉 Remember that because of filters Solr cache is used more optimally. Filter do not affect relevance of the document in the context of query and search results (score factor), and thus, we can perform filtering without fear of the change in the score value of individual documents (useful for example in e-commerce for product groups narrowing).

2. Logical conditions and q parameter

Another of the “sins” that I come across quite often is a one with close relationship with the previous point. It’s not a bug in the literal sense, but it is an area where a simple change will have a significant influence on performance. Assuming that the default logical operator is OR, imagine your query in the form of: q=(java+design+patterns)+AND+category:books+AND+promotion:true+AND+publisher:ABC. This query is correct from the perspective of the application logic, where we get the appropriate group search results. But what if we also want to optimally use Solr cache and thus boost performance. The anserw is quite simple – move some of the terms to filters. By changing out a query to q=java+design+patterns&fq=category:books&fq=promotion:true&fq=publisher:ABC Solr benefit from two types of cache – queryDocumentCache to retrieve documents for a query with parameter q and the filterCache for each of the filters. With the change of the query we were able to optimize the query to use two types of cache and in addition to optimize the entries of queryDocumentsCache (due to the shortening of the query parameter q).

3. Hugh numbers of facet queries

Another “sin” associated with handling groups of documents. Quite often, especially, in applications that can categorize products in many ways, I have met queries with a lot of facet.query parameters that correspond to the grouping of documents. Grouping by price, location, product groups, and so on. A good example is grouping by price where the business customer can set the price ranges for each category and then application must group products by those ranges. This leads to queries that have 100, 200 or more facet.query parameters added. Please remember that each facet.query has an impact on efficiency, not to mention 100 or 200 of them. If we are interested in a quick response from Solr we can not make such a queries. In such cases, I always propose modifying the index structure if needed, and they are needed in most cases. Some modifications (like defining ranges at index time) allows to eliminate tens or hundreds of facet.query for one facet.field parameter. But this method is associated to another problem – explanation for the customer, why “re-index button” must be pressed after range changes. As a rule, however, performance testing at high loads and large variety of queries speak for themselves.

4. Facet limits

The problem appears in the line where Solr meets business logic. An example of this “sin” is a simple list of categories that a customer wants to have displayed depending on user location on the website. When we have a small numbers of categories we do not have a problem, but what about thousands of categories. Very often, I met with the approach taken by the developers to retrieve all categories of Solr (with increased facet.limit parameter compared to the default value) and choose the right categories in the application that is using Solr. I think this approach can generate problems – first of all faceting requires memory, second aggregating of facet elements need time, and of course getting all of the 50.000 categories with it’s counts can be painful to Solr. If we want a fast queries, try to use the parameter facet.limit reasonable. If You need many facet results try to build your application so it can use facet.offset parameter and therfore use paging. If this is not possible, at least configure Your container to have enough memory to handle parallel queries and get ready to have queries that can perform longer when having high facet.limit value.

5. Downloading of unnecessary data

Very common problem is the retrieval of all information, not just those that we need. Of course, the problem does not apply to deployments where Sole offers information such as, for example, only the product ID. However, a large number of deployments that I’ve deal with was based almost entirely on Solr, and hence Solr index was made up of multiple stored fields. Developers using Solr very rarely used the parameter fl and the possibility of limiting the fields that are returned. In extreme cases, this led to problems with the amount of data that had to be sent over the network.

6. Many requests to obtain count of groups of documents

In some applications more important than the actual search capabilities were the navigation, where users can browse document repository by it’s feature, like department, category, subcategory, and so on. Very often, in addition to names there are also numbers displayed – the numbers of documents with this feature. I met cases were the number of documents were obtained using a separate query. Effect – 100 categories displayed on a web page led to 100 separate queries to Solr. Do not go this way, if You have to modify the Solr index to use the facet mechanism. Maybe at that time it will be more work, but certainly in the long run this is worth it.

A few words at the end

Please note that these are just examples that I think are fairly universal, at least, which I quite often encountered during my work. They are not all the errors that happen when using the Solr, but I hope to highlight some of the mistakes people tend to make and how to go around them.

This post is also available in: Polish

This entry was posted on Wednesday, August 11th, 2010 at 15:16 and is filed under About Solr. You can follow any responses to this entry through the RSS 2.0 feed. You can leave a response, or trackback from your own site.