howto – Solr.pl

“Car sale application” – Spatial Search, adding location data (part 3)

Rafał Kuć — Mon, 14 Mar 2011 08:19:01 +0000

The amount of announcements in our database is so large, that our web site users started to look for another option to filter search results and another way of sorting them. We need to add the functionality, which allows us to operate with localization data related to the cars.

Requirements specification

We would like to add two new functionalities:

Filtering the results in order to display only those announcements, that are located not farther than x kilometres from the given place, where x = 50,100,200,500,1000 km.
Sorting the results using the distance between the given place and the given car’s localization.

In order to face the requirements, we need to use solr’s functionality called “Spatial Search”, that is available in solr distribution from version 3.1. The changes we need to provide are related to schema.xml file modifications and the input data changes, where we have to add the information about the localization of every car. In the end we will create proper requests.

Schema.xml changes

New field types definitions:
- the first definition is nothing more than another numerical type:
- the second definition uses the “solr.LatLonType” class, which allows us to index localization data using the dynamic field with suffix “_coordinate”:
New fields definitions:
- field, that will be used to accumulate the city name data, that is related to every car:
- “loc” field will be used to index localization data:
- the dynamic field used internally to accumulate the information provided by the “loc” field:

Input data analysis

In order to present how to modify the input data, let’s take 5 announcements from the cities:

Koszalin
- latitude: 54.12
- longitude: 16.11
Białystok
- latitude: 53.08
- longitude: 23.09
Szczecin
- latitude: 53.25
- longitude: 14.35
Gdańsk
- latitude: 54.21
- longitude: 18.40
Warszawa
- latitude: 52.15
- longitude: 21.00

We provide the localization data by entering the latitude and longitude separated by the comma in the “loc” field. Our data might look like this:


   
      1
      Audi
      80
      2008
      9774
      2000
      92467
      green
      false
      Koszalin
      54.12,16.11
   
   
      2
      Audi
      A8
      2009
      9078
      1000
      31369
      black
      false
      Białystok
      53.08,23.09
   
   
      3
      Audi
      TT
      1997
      1109
      1299
      116987
      silver
      true
      Szczecin
      53.25,14.35
   
   
      4
      BMW
      Seria 7
      2007
      140000
      3000
      418000
      green
      false
      Gdańsk
      54.21,18.40
   
   
      5
      Chevrolet
      TrailBlazer
      2007
      140000
      3000
      418000
      green
      false
      Warszawa
      52.15,21.00

Let’s create queries

We have our localization data in the index, so all we need right now is to create queries that will satisfy our needs. Let’s imagine, that we are searching for announcements when being in Białystok city, which is located about 200 km away from the Warszawa city, about 400 km away from the Gdańsk city, about 550 km away from the Koszalin city and about 650 km away from the Szczecin city.

To execute the first point from the requirements specification, we add the special filter query to our request:

...&fq={!geofilt sfield=loc}&pt=53.08,23.09&d=50

where:

sfield – the name of the field, where we have our localization data indexed.
pt – the localization of the starting point, it is the Białystok city in our case.
d – the distance used to narrow the search results. By using the 50,100,200,500,1000 values we can satisfy all our needs.

Example:

Query:

q=*:*&fq={!geofilt sfield=loc}&pt=53.08,23.09&d=200

Search results:


   
      Białystok
      black
      false
      1000
      2
      Audi
      31369
      A8
      9078.0
      2009
   
   
      Warszawa
      green
      false
      3000
      5
      Chevrolet 
      418000
      TrailBlazer
      140000.0
      2007

That’s great, we don’t have any announcements from the Koszalin, Gdańsk or Szczecin city, as these cities are located farther than 200 km from the Białystok city.

To execute the second point from the requirements specification, we use the possibility to sort the search results by using the geodist function. The query would look like this:

...&sfield=loc&pt=53.08,23.09&sort=geodist()+desc

The example of sorting the search results using the distance, starting from the Białystok city:

Query:

q=*:*&sfield=loc&pt=53.08,23.09&sort=geodist()+asc

Search results:


   
      Bialystok
      black
      false
      1000
      2
      Audi
      31369
      A8
      9078.0
      2009
   
   
      Warszawa
      green
      false
      3000
      5
      Chevrolet 
      418000
      TrailBlazer
      140000.0
      2007
   
   
      Gdańsk
      green
      false
      3000
      4
      BMW
      418000
      Seria 7
      140000.0
      2007
   
   
      Koszalin
      green
      false
      2000
      1
      Audi
      92467
      80
      9774.0
      2008
   
   
      Szczecin
      silver
      true
      1299
      3
      Audi
      116987
      TT
      1109.0
      1997

That’s correct! Mission accomplished.

The end

Once more we are up to our website users expectations. This time we have added the functionalities, which allow our users to filter and sort the search results using the localization and distance data. Full success!

”Car sale” application – WordDelimiterFilter and PatternReplaceFilter, helping to improve search results (part 2)

Rafał Kuć — Mon, 14 Feb 2011 08:09:28 +0000

In the first part of our ”Car sale” application related posts we created some standard index structure by properly configuring schema.xml configuration file. It didn’t take long to hear the first complains from the website users with this kind of configuration. Why don’t I receive any search results entering the “audi a” phrase ? I would like to see some announcements with “Audi A6” and “Audi A8” for example. I entered the phrase “Honda crv” – 0 results, “Suzuki maruti” – none. Are there no related offers in the announcement database ? There are! But the current configuration of the searchable field type (field “content” – type “text”) does not allow us to find those offers using the queries we’ve entered. That’s the reason why the WordDelimiterFilter and PatternReplaceFilter need to enter the battlefield.

Requirements specification

We need to analyze the data, that is indexed in the “content” field. Let’s examine the sample data, that will be used for helping to create the new “text” type configuration:

Make: Audi
Model: 80, 90, A6, A8, TT

Make: BMW
Model: M3, M5, Series 7, Series 8, X1, X3

Make: Chevrolet
Model: TrailBlazer

Make: Citroen
Model: C-Crosser, C3 Pluriel, C4 Picasso

Make: Ford
Model: C-MAX, S-MAX

Make: Honda
Model: Accord, CR-V, FR-V, HR-V

Make: Kia
Model: Cee’d

Make: Suzuki
Model: Alto/Maruti

Make names are simple words, that are easily handled by the current configuration (WhitespaceTokenizer + LowerCaseFilter). The problem is with the model names, as they contain additional characters and separators, that we often ignore when entering the search phrase. Let’s try to put the sample date into some groups, that will help us with the incoming configuration:

Model names, that do not need to be processed by any additional filters (the current “text” type configuration is sufficient) – 80, 90, TT, Series 7, Series 8, Accord
Model names, which contain letters and numbers, where we want to split on letter-number transitions – A6, A8, M3, M5, X1, X3, C3 Pluriel, C4 Picasso. We would like to be able to find those models when entering only a letter, only a number and whole model name too.
Models, which have the case transitions in the name – TrailBlazer. We would like to find the document with this name when entering “trail”, “blazer”, “trailBlazer”, “trailblazer”.
Model names, that contain intra-word delimiters, which we want to ignore or split on them – C-Crosser, C-MAX, S-MAX, CR-V, FR-V, HR-V, Alto/Maruti.
Example: we would like to find the document with the model name “C-MAX” entering the phrases “c”, “max”, “c-max” “cmax”.
We intentionally omitted the “Cee’d” model name in the 4th point as we would like to treat this example a little differently. We don’t want to be able to find this model when entering the “cee” and “d” phrases. We treat the name only as the whole word – “cee’d” or “ceed”.

WordDelimiterFilter configuration

With the given configuration we’ve described above, we are going to add proper values to the WordDelimiterFilter attributes in order to satisfy our needs:

WordDelimiterFilter is needless in this case, as the current “text” type configuration (WhitespaceTokenizer + LowerCaseFilter) is sufficient.
In order to face the 2nd point requirements we need to set the proper values of the following attributes:
- generateWordParts=”1″ – the value must be set to “1” if we want to be able to generate parts of words
- generateNumberParts=”1″ – the value must be set to “1” if we want to be able to generate parts of number words
- splitOnNumerics=”1″ – the value must be set to “1” if we want to be able to generate a new parts from alphabet => number transitions
In order to face the 3rd point requirements we need to set the proper values of the following attributes:
- generateWordParts=”1″
- splitOnCaseChange=”1″ – the value must be set to “1” if we want to be able to split on lowercase => uppercase transitions
In order to face the 4th point requirements we need to set the proper values of the following attributes:
- generateWordParts=”1″
- catenateWords=”1″ – the value must be set to “1” if we want to be able to ignore the intra-word delimiters by joining the subwords

So let’s take a look at our WordDelimiterFilter configuration:

Additionaly we may notice that the default value of the “splitOnNumerics” and “splitOnNumerics” attributes is “1”. The rest of the WordDelimiterFilter’s attributes (except the “stemEnglishPossessive”) have the default value set to “0”. So our configuration can be reduced to:

What about the 5th point of our data specification ? As we have stated, we wouldn’t like to treat the “‘” sign as the intra-word delimiter. So maybe we could use the protected=”protwords.txt” option of the WordDelimiterFilter which will keep the word “Cee’d” unchanged ? Ok, but we would also like to be able to find this model when entering the “ceed” phrase, so this option is not good for us. The best solution would be to take care of this case in the separate filter and leave the WordDelimiterFilter with nothing to do.

PatternReplaceFilter configuration

we are going to put the PatternReplaceFilter before the WordDelimiterFilter. Using the PatternReplaceFilter we will be able to ignore the ” ‘ ” sign by replacing it with the empty sign. Configuring the filter this way, the WordDelimiterFilter will receive the “Ceed” token and will not modify this value. The configuration of the filters will be the same for indexing and searching, so a user will be able to find the offer with the “Cee’d” model when entering the phrases “cee’d” and “ceed”:

New “text” type configuration visualization

Let’s take a look at our new “text” type:

We are going to use the solr’s administration panel to find out if the configuration we’ve created is correct:

(Model: “80”) As we’ve expected, our new filters don’t influence the data typical for the 1st point.
(Model: “A8”) WordDelimiterFilter did the split on letter-number transitions.
(Model: “TrailBlazer”) WordDelimiterFilter did the case transition generating “trail” and “Blazer” tokens. Additionaly we have the opportunity to enter the “trailblazer” phrase. Superb!
(Model: “CR-V”) WordDelimiterFilter ignored the intra-word delimiters by generating subwords(“cr” and “v”) and joining the subwords additionaly (“crv”).
(Model: “Cee’d”) PatternReplaceFilter have replaced the “Cee’d” word to “Ceed” and the WordDelimiterFilter have only passed the value. That’s what we needed.

The end

In this post we’ve showed how to configure two new filters in order to improve the search results quality – WordDelimiterFilter and PatternReplaceFilter. Our website users are satisfied … for now.

“Car sale application” – schema.xml designing to gain what we really need (part 1)

Rafał Kuć — Mon, 31 Jan 2011 08:07:42 +0000

One of the fundamental solr’s configuration file is the schema.xml file. It is a kind of connector between what we need and what solr understands. If we want to have a search engine, that gives us search results we really expect, then it is very important to properly design the schema.xml configuration file.
We would like to introduce you the first of the series of articles which will hopefully show us how to design schema.xml file and how to handle and modify all of the file’s components.

Requirements specification

Imagine we would like to use solr to provide our car sale website with a search engine. The functional part of our website is at the beginning rather primitive and takes the advantage of only the small piece of every car information:

make
model
year of production
price
engine size
mileage
colour
damaged

We would like to design a simple configuration schema file, which will make possible to index data from the given fields. But before we open the schema.xml file and start typing, let’s answer the seven fundamental questions related to our fields:

1. What is the field type ?

Let’s determine the type of every field:

make – text field
model – text field
year of production – integer field
price – float field
engine size – integer field
mileage – integer field
colour – textual field
damaged – logical field

So what ?

So we will need some basic type definitions like string, boolean, int, float.

2. Is it the field used in search process ?

We would like to use the data from some fields in order to enable our search engine to find the proper documents (car sale announcements). To accomplish that we are going to use 3 fields: make, model, year of production.

So what ?

So we will need to create another field type, which will contain some filters to make finding the documents easy and efficient. We will create another field of the newly created type, where we will put all the data from make, model and year of production fields.

3. Is it the field used in faceting or sorting operation ?

In our website we would like to sort search results using 4 fields: model, year of production, price and mileage. We would also like to be able to to use facet operation on fields: make, model, year of production and colour.

So what ?

When we want to create a field type for fields used for sorting/faceting, then we need to know that this type cannot contain tokenizers and filters which can tokenize values in those fields. But still we want the values to be lowercased, so the letters size does not influence the sorting/faceting results. So that’s the kind of another field type we will need to create.

4. Is it the field used to filter search results?

We would like to have the possibility to filter search results using ranges on fields: year of production, price, engine size and mileage.

So what ?

So let’s use the field types which will accelerate range queries.

5. Are there any fields which are not mentioned in the questions number 2, 3 or 4 ?

There is a field “damaged” which is not supposed to be involved in any of the mentioned operations.

So what?

So we will set the value of the “indexed” attribute to “false”.

6. Is the field required ?

We assume that there are 3 fields which are supposed to be required: make, model and year. We don’t want to have documents in index (car sale announcements available in the search process), which do not have values in those fields.

So what ?

So we will set the value of the “required” attribute to “true”.

7. Do we need to retrieve the information from the field in the original state?

We would like to retrieve the information from all of the fields mentioned in the requirements specification and present them directly on the website.

So what?

So we will set the value of the “stored” attribute to “true”.

Let’s add field type definitions

We’ve answered our questions, we’ve come to some conclusions so let’s add field types to the schema file:

We add the solr.StrField type, which is not analysed and can be used for example as the type for the unique document key.

Add the boolean type:

Now the numerical types. Remember that we need types that can help us to accelerate range queries. So let’s use the tint and tfloat types:

Now let’s create the textual type, which will be a definition type for the catch-all field used for searching. For now, the type with the whitespace tokenizer and the lowercase filter will be just fine:

And last, but not least, the type for the sortable/facetable fields. What we need is the type that lowercases the entire field value, keeping it as a single token. KeywordTokenizer does no actual tokenizing, so it is the ideal tokenizer for our need. The TrimFilterFactory removes any leading or trailing whitespace:

Time to add field definitions

Document id:

Make and model:

Now why is the value of the “indexed” attribute set to “false” ? As far as we know, we need those fields to search, sort and facet operations. That’s true … but we need to notice that for the searching purposes we will copy the data from those fields to one catch-all field:

and for the sorting/faceting purposes we will copy the data yet to other fields of the type “lowercase”:

So the fields make and model will not take part in the operations itself and we can set the “indexed” attribute to “false” for best index size.

The rest of the fields:

Remember about the “false” value of the “indexed” attribute of the “damaged” field:

copyField – let’s index the same data differently

We have mentioned the field values copying several times already so now let’s define copy fields.

Fields used for searching are copied to catch-all “content” field. There is more than one source field, that’s why the “content” field definition contains the multiValued attribute set to “true”:

Copying the sortable/facetable fields:

Anything else ?

We shall add 3 more elements to the schema:

The unique key of the document:

id

Default search field:

 content

Default query parser operator. Let’s set it to “AND”.

It’s done! The schema.xml configuration file is ready and looks like this:

The end

In today’s post we have created the simple schema.xml file, which allows us to index data, so that we are able to face our car sale website search functionalities. But still we want to develop our website which will surely affects the schema … and not only the schema. In the next “car sale” related post we will try to face some new requirements and provide next modifications.