“Car sale application” – schema.xml designing to gain what we really need (part 1)

One of the fundamental solr’s configuration file is the schema.xml file. It is a kind of connector between what we need and what solr understands. If we want to have a search engine, that gives us search results we really expect, then it is very important to properly design the schema.xml configuration file.
We would like to introduce you the first of the series of articles which will hopefully show us how to design schema.xml file and how to handle and modify all of the file’s components.

Requirements specification

Imagine we would like to use solr to provide our car sale website with a search engine. The functional part of our website is at the beginning rather primitive and takes the advantage of only the small piece of every car information:

  • make
  • model
  • year of production
  • price
  • engine size
  • mileage
  • colour
  • damaged

We would like to design a simple configuration schema file, which will make possible to index data from the given fields. But before we open the schema.xml file and start typing, let’s answer the seven fundamental questions related to our fields:

1. What is the field type ?

Let’s determine the type of every field:

  • make – text field
  • model – text field
  • year of production – integer field
  • price – float field
  • engine size – integer field
  • mileage – integer field
  • colour – textual field
  • damaged – logical field

So what ?

So we will need some basic type definitions like string, boolean, int, float.

2. Is it the field used in search process ?

We would like to use the data from some fields in order to enable our search engine to find the proper documents (car sale announcements). To accomplish that we are going to use 3 fields: make, model, year of production.

So what ?

So we will need to create another field type, which will contain some filters to make finding the documents easy and efficient. We will create another field of the newly created type, where we will put all the data from make, model and year of production fields.

3. Is it the field used in faceting or sorting operation ?

In our website we would like to sort search results using 4 fields: model, year of production, price and mileage. We would also like to be able to to use facet operation on fields: make, model, year of production and colour.

So what ?

When we want to create a field type for fields used for sorting/faceting, then we need to know that this type cannot contain tokenizers and filters which can tokenize values in those fields. But still we want the values to be lowercased, so the letters size does not influence the sorting/faceting results. So that’s the kind of another field type we will need to create.

4. Is it the field used to filter search results?

We would like to have the possibility to filter search results using ranges on fields: year of production, price, engine size and mileage.

So what ?

So let’s use the field types which will accelerate range queries.

5. Are there any fields which are not mentioned in the questions number 2, 3 or 4 ?

There is a field “damaged” which is not supposed to be involved in any of the mentioned operations.

So what?

So we will set the value of the “indexed” attribute to “false”.

6. Is the field required ?

We assume that there are 3 fields which are supposed to be required: make, model and year. We don’t want to have documents in index (car sale announcements available in the search process), which do not have values in those fields.

So what ?

So we will set the value of the “required” attribute to “true”.

7. Do we need to retrieve the information from the field in the original state?

We would like to retrieve the information from all of the fields mentioned in the requirements specification and present them directly on the website.

So what?

So we will set the value of the “stored” attribute to “true”.

Let’s add field type definitions

We’ve answered our questions, we’ve come to some conclusions so let’s add field types to the schema file:

We add the solr.StrField type, which is not analysed and can be used for example as the type for the unique document key.

Add the boolean type:

Now the numerical types. Remember that we need types that can help us to accelerate range queries. So let’s use the tint and tfloat types:

Now let’s create the textual type, which will be a definition type for the catch-all field used for searching. For now, the type with the whitespace tokenizer and the lowercase filter will be just fine:

And last, but not least, the type for the sortable/facetable fields. What we need is the type that lowercases the entire field value, keeping it as a single token. KeywordTokenizer does no actual tokenizing, so it is the ideal tokenizer for our need. The TrimFilterFactory removes any leading or trailing whitespace:

Time to add field definitions

Document id:

Make and model:

Now why is the value of the “indexed” attribute set to “false” ? As far as we know, we need those fields to search, sort and facet operations. That’s true … but we need to notice that for the searching purposes we will copy the data from those fields to one catch-all field:

and for the sorting/faceting purposes we will copy the data yet to other fields of the type “lowercase”:

So the fields make and model will not take part in the operations itself and we can set the “indexed” attribute to “false” for best index size.

The rest of the fields:

Remember about the “false” value of the “indexed” attribute of the “damaged” field:

copyField – let’s index the same data differently

We have mentioned the field values copying several times already so now let’s define copy fields.

Fields used for searching are copied to catch-all “content” field. There is more than one source field, that’s why the “content” field definition contains the multiValued attribute set to “true”:

Copying the sortable/facetable fields:

Anything else ?

We shall add 3 more elements to the schema:

The unique key of the document:

Default search field:

Default query parser operator. Let’s set it to “AND”.

It’s done! The schema.xml configuration file is ready and looks like this:

The end

In today’s post we have created the simple schema.xml file, which allows us to index data, so that we are able to face our car sale website search functionalities. But still we want to develop our website which will surely affects the schema … and not only the schema. In the next “car sale” related post we will try to face some new requirements and provide next modifications.

7 thoughts on ““Car sale application” – schema.xml designing to gain what we really need (part 1)

  • 14 February 2011 at 14:13

    Aw, this was a really quality post. In theory I’d like to write like this also – taking time and real effort to make a good article… but what can I say… I procrastinate alot and never seem to get anything done… Regards

    [WORDPRESS HASHCASH] The poster sent us ‘1361634688 which is not a hashcash value.

  • 29 September 2011 at 06:55

    Really nice post.
    Thank you so much…

  • 9 November 2011 at 13:39

    Thanks alot.Am designing a website for classifieds car and house sales,rentals and solr has been killing me.Thanks alot.

  • 18 April 2013 at 15:36

    I am trying to sort displayName field it doesnt work for that but if i do for manufacturer field it does i dont know what is the issue. everything is same for both of them.

    • 18 April 2013 at 15:40

      Sorry, but the blog post doesn’t say about a field called displayName.


Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.