schema – Solr.pl

Solr 4.2: Index structure reading API

Rafał Kuć — Mon, 20 May 2013 11:58:51 +0000

With the release of Solr 4.2 we’ve got the possibility to use the HTTP protocol to get information about Solr index structure. Of course, if one wanted to do that prior to Solr 4.2 it could be achieved by fetching the schema.xml file, parsing it and then getting the needed information. However when Solr 4.2 was released we’ve got a dedicated API which can return the information we need without the need of parsing the whole schema.xml file.

Possibilities

Let’s look at the new API by example.

Getting information in XML format

Many Solr users are used to getting their data in the XML format, at least when using Solr HTTP API. However, the schema API uses JSON as the default format. In order to get the data in the XML format in all the below examples, you’ll need to appeng the wt=xml parameter to the call, for example like that:

$curl 'http://localhost:8983/solr/collection1/schema/fieldtypes?wt=xml'

Defined fields information

Let’s start by looking at how to fetch information about the fields that are defined in Solr. In order to do that we have the following possibilities:

Get information about all the fields defined in the index
Get information for a one, explicitly defined field

In the first case we should use the following command:

$curl 'http://localhost:8983/solr/collection1/schema/fields'

In second case we should add the / character and the field name to the above command. For example in order to get the information about the author field we should use the following command:

$curl 'http://localhost:8983/solr/collection1/schema/fields/author'

Solr response for the first command will be similar to the following one:

{
  "responseHeader":{
    "status":0,
    "QTime":1},
  "fields":[{
      "name":"_version_",
      "type":"long",
      "indexed":true,
      "stored":true},
    {
      "name":"author",
      "type":"text_general",
      "indexed":true,
      "stored":true},
    {
      "name":"cat",
      "type":"string",
      "multiValued":true,
      "indexed":true,
      "stored":true},
    {
      "name":"category",
      "type":"text_general",
      "indexed":true,
      "stored":true},
    {
      "name":"id",
      "type":"string",
      "multiValued":false,
      "indexed":true,
      "required":true,
      "stored":true,
      "uniqueKey":true},
    {
      "name":"url",
      "type":"text_general",
      "indexed":true,
      "stored":true},
    {
      "name":"weight",
      "type":"float",
      "indexed":true,
      "stored":true}]}

On the other hand the response for the second command would be as follows:

{
  "responseHeader":{
    "status":0,
    "QTime":0},
  "field":{
    "name":"author",
    "type":"text_general",
    "indexed":true,
    "stored":true}}

Getting information about defined dynamic fields

Similar to what information we can get about the fields defined in the schema.xml we can get the information about dynamic fields. Again we have to options:

Get information about all dynamic fields
Get information about specific dynamic field pattern

In order to get all the information about dynamic fields we should use the following command:

$curl 'http://localhost:8983/solr/collection1/schema/dynamicfields'

In order to get information about a specific pattern we append the / character followed by the pattern, for example like this:

$curl 'http://localhost:8983/solr/collection1/schema/dynamicfields/random_*'

Solr will return the following response for the first query:

{
  "responseHeader":{
    "status":0,
    "QTime":2},
  "dynamicfields":[{
      "name":"*_coordinate",
      "type":"tdouble",
      "indexed":true,
      "stored":false},
    {
      "name":"ignored_*",
      "type":"ignored",
      "multiValued":true},
    {
      "name":"random_*",
      "type":"random"},
    {
      "name":"*_p",
      "type":"location",
      "indexed":true,
      "stored":true},
    {
      "name":"*_c",
      "type":"currency",
      "indexed":true,
      "stored":true}]}

And the following response will be returned for the second command:

{
  "responseHeader":{
    "status":0,
    "QTime":1},
  "dynamicfield":{
    "name":"random_*",
    "type":"random"}}

Getting field types

As you probably guess, in a way similar to the above describes examples, we can also get the information about the field types defined in our schema.xml files. We can fetch the following information:

All the field types defined in the schema.xml file
A single type

To get all the defined field types we should run the following command:

$curl 'http://localhost:8983/solr/collection1/schema/fieldtypes'

The get information about a single type we should again add the / character and append the field type name to it, for example like this:

$curl 'http://localhost:8983/solr/collection1/schema/fieldtypes/text_gl'

Solr will return the following information in response to the first command:

{
  "responseHeader":{
    "status":0,
    "QTime":3},
  "fieldTypes":[{
      "name":"alphaOnlySort",
      "class":"solr.TextField",
      "sortMissingLast":true,
      "omitNorms":true,
      "analyzer":{
        "class":"solr.TokenizerChain",
        "tokenizer":{
          "class":"solr.KeywordTokenizerFactory"},
        "filters":[{
            "class":"solr.LowerCaseFilterFactory"},
          {
            "class":"solr.TrimFilterFactory"},
          {
            "class":"solr.PatternReplaceFilterFactory",
            "replace":"all",
            "replacement":"",
            "pattern":"([^a-z])"}]},
      "fields":[],
      "dynamicFields":[]},
    {
      "name":"boolean",
      "class":"solr.BoolField",
      "sortMissingLast":true,
      "fields":["inStock"],
      "dynamicFields":["*_bs",
        "*_b"]},
    {
      "name":"text_gl",
      "class":"solr.TextField",
      "positionIncrementGap":"100",
      "analyzer":{
        "class":"solr.TokenizerChain",
        "tokenizer":{
          "class":"solr.StandardTokenizerFactory"},
        "filters":[{
            "class":"solr.LowerCaseFilterFactory"},
          {
            "class":"solr.StopFilterFactory",
            "words":"lang/stopwords_gl.txt",
            "ignoreCase":"true",
            "enablePositionIncrements":"true"},
          {
            "class":"solr.GalicianStemFilterFactory"}]},
      "fields":[],
      "dynamicFields":[]},
    {
      "name":"tlong",
      "class":"solr.TrieLongField",
      "precisionStep":"8",
      "positionIncrementGap":"0",
      "fields":[],
      "dynamicFields":["*_tl"]}]}

In response to the second command Solr will return the following:

{
  "responseHeader":{
    "status":0,
    "QTime":2},
  "fieldType":{
    "name":"text_gl",
    "class":"solr.TextField",
    "positionIncrementGap":"100",
    "analyzer":{
      "class":"solr.TokenizerChain",
      "tokenizer":{
        "class":"solr.StandardTokenizerFactory"},
      "filters":[{
          "class":"solr.LowerCaseFilterFactory"},
        {
          "class":"solr.StopFilterFactory",
          "words":"lang/stopwords_gl.txt",
          "ignoreCase":"true",
          "enablePositionIncrements":"true"},
        {
          "class":"solr.GalicianStemFilterFactory"}]},
    "fields":[],
    "dynamicFields":[]}}

As you can see, the amount information is nice as we are getting all the information about the field types and in addition to that the information which field are using give field (both dynamic and non-dynamic.

Retrieving information about copyFields

In addition to what we’ve discussed so far we are able to get information about copyFields section from the schema.xml. In order to do that one should run the following command:

$curl 'http://localhost:8983/solr/collection1/schema/copyfields'

And in response we will get the following data:

{
  "responseHeader":{
    "status":0,
    "QTime":1},
  "copyfields":[{
      "source":"author",
      "dest":"text"},
    {
      "source":"cat",
      "dest":"text"},
    {
      "source":"content",
      "dest":"text"},
    {
      "source":"content_type",
      "dest":"text"},
    {
      "source":"description",
      "dest":"text"},
    {
      "source":"features",
      "dest":"text"},
    {
      "source":"author",
      "dest":"author_s",
      "destDynamicBase":"*_s"}]}

The future

In Solr 4.3 the described API was improved and is now being prepared to enable not only reading of the index structure, but also writing modifications to it with the use of HTTP requests. We can expect that feature in one of the upcoming versions of Apache Solr, so its worth waiting in my opinion, at least by those who needs it.

W Solr 4.3 opisywane API zostało usprawnione oraz jest przygotowywane do umożliwienia zmian w strukturze indeksu za pomocą protokołu HTTP. Możemy zatem spodziewać się, iż w jednej z kolejnych wersji serwera wyszukiwania Solr otrzymamy możliwość łatwej zmiany struktury indeksu, przynajmniej takich, które nie będą powodować konfliktów z już zaindeksowanymi danymi.

“Car sale application” – Spatial Search, adding location data (part 3)

Rafał Kuć — Mon, 14 Mar 2011 08:19:01 +0000

The amount of announcements in our database is so large, that our web site users started to look for another option to filter search results and another way of sorting them. We need to add the functionality, which allows us to operate with localization data related to the cars.

Requirements specification

We would like to add two new functionalities:

Filtering the results in order to display only those announcements, that are located not farther than x kilometres from the given place, where x = 50,100,200,500,1000 km.
Sorting the results using the distance between the given place and the given car’s localization.

In order to face the requirements, we need to use solr’s functionality called “Spatial Search”, that is available in solr distribution from version 3.1. The changes we need to provide are related to schema.xml file modifications and the input data changes, where we have to add the information about the localization of every car. In the end we will create proper requests.

Schema.xml changes

New field types definitions:
- the first definition is nothing more than another numerical type:
- the second definition uses the “solr.LatLonType” class, which allows us to index localization data using the dynamic field with suffix “_coordinate”:
New fields definitions:
- field, that will be used to accumulate the city name data, that is related to every car:
- “loc” field will be used to index localization data:
- the dynamic field used internally to accumulate the information provided by the “loc” field:

Input data analysis

In order to present how to modify the input data, let’s take 5 announcements from the cities:

Koszalin
- latitude: 54.12
- longitude: 16.11
Białystok
- latitude: 53.08
- longitude: 23.09
Szczecin
- latitude: 53.25
- longitude: 14.35
Gdańsk
- latitude: 54.21
- longitude: 18.40
Warszawa
- latitude: 52.15
- longitude: 21.00

We provide the localization data by entering the latitude and longitude separated by the comma in the “loc” field. Our data might look like this:


   
      1
      Audi
      80
      2008
      9774
      2000
      92467
      green
      false
      Koszalin
      54.12,16.11
   
   
      2
      Audi
      A8
      2009
      9078
      1000
      31369
      black
      false
      Białystok
      53.08,23.09
   
   
      3
      Audi
      TT
      1997
      1109
      1299
      116987
      silver
      true
      Szczecin
      53.25,14.35
   
   
      4
      BMW
      Seria 7
      2007
      140000
      3000
      418000
      green
      false
      Gdańsk
      54.21,18.40
   
   
      5
      Chevrolet
      TrailBlazer
      2007
      140000
      3000
      418000
      green
      false
      Warszawa
      52.15,21.00

Let’s create queries

We have our localization data in the index, so all we need right now is to create queries that will satisfy our needs. Let’s imagine, that we are searching for announcements when being in Białystok city, which is located about 200 km away from the Warszawa city, about 400 km away from the Gdańsk city, about 550 km away from the Koszalin city and about 650 km away from the Szczecin city.

To execute the first point from the requirements specification, we add the special filter query to our request:

...&fq={!geofilt sfield=loc}&pt=53.08,23.09&d=50

where:

sfield – the name of the field, where we have our localization data indexed.
pt – the localization of the starting point, it is the Białystok city in our case.
d – the distance used to narrow the search results. By using the 50,100,200,500,1000 values we can satisfy all our needs.

Example:

Query:

q=*:*&fq={!geofilt sfield=loc}&pt=53.08,23.09&d=200

Search results:


   
      Białystok
      black
      false
      1000
      2
      Audi
      31369
      A8
      9078.0
      2009
   
   
      Warszawa
      green
      false
      3000
      5
      Chevrolet 
      418000
      TrailBlazer
      140000.0
      2007

That’s great, we don’t have any announcements from the Koszalin, Gdańsk or Szczecin city, as these cities are located farther than 200 km from the Białystok city.

To execute the second point from the requirements specification, we use the possibility to sort the search results by using the geodist function. The query would look like this:

...&sfield=loc&pt=53.08,23.09&sort=geodist()+desc

The example of sorting the search results using the distance, starting from the Białystok city:

Query:

q=*:*&sfield=loc&pt=53.08,23.09&sort=geodist()+asc

Search results:


   
      Bialystok
      black
      false
      1000
      2
      Audi
      31369
      A8
      9078.0
      2009
   
   
      Warszawa
      green
      false
      3000
      5
      Chevrolet 
      418000
      TrailBlazer
      140000.0
      2007
   
   
      Gdańsk
      green
      false
      3000
      4
      BMW
      418000
      Seria 7
      140000.0
      2007
   
   
      Koszalin
      green
      false
      2000
      1
      Audi
      92467
      80
      9774.0
      2008
   
   
      Szczecin
      silver
      true
      1299
      3
      Audi
      116987
      TT
      1109.0
      1997

That’s correct! Mission accomplished.

The end

Once more we are up to our website users expectations. This time we have added the functionalities, which allow our users to filter and sort the search results using the localization and distance data. Full success!

5 sins of schema.xml modifications

Rafał Kuć — Mon, 30 Aug 2010 12:08:35 +0000

I made a promise and here it is – the entry on the most common mistakes when designing Solr index, which is when You create or modify the schema.xml file for Your system implementation. Feel free to read on

Each of us knows what is schema.xml file and what is (if not, I invite you to read the entry located at: http://solr.pl/2010/08/16/what-is-schema-xml/?lang=en). What are the most frequently commit errors creating or updating this file? I personally met with the following:

1. Trash in the configuration

I confess that the first principle is to keep the file schema.xml in the simplest possible form. Linked to this is a very important issue – this file should not be synonymous with chaos. In other word, do not stick with unnecessary comments, unwanted types, fields and so on. Order in the structure of the schema.xml file not only helps us to maintain this file and its modifications with ease, but also assures us that no information that is unnecessary will be stored in Solr index.

2. Cosmetic changes to the default configuration

How many of those who use Solr in their daily work took the default schema.xml file supplied in the example implementation Solr and only slightly modified the contents – for example, changing only the names of the fields ? I should raise my hand too, because I did it once. This is a pretty big mistake. Someone may ask why. Are you sure You need English stemming when implementing search for content written in Polish ? I think not. The same applies to field and type attributes like term vectors.

3. No updates

Sometimes I find the implementation of search based application, where update of Solr does not mean an update of schema.xml file. If it is a conscious decision, dictated by such costly or even impossible re-indexing of all data, I understand the situation. But there are cases where an upgrade would bring only benefits, and where costs of such upgrade would be minimal (eg less expensive re-index or slight changes in the application). Do not be afraid to update the schema.xml file – whether it is to update the fields, update types, whether the addition of newer stuff. A good example is the migration from Solr 1.3 to version 1.4 – newer version introduced significant changes associated with numeric types, where migration to the new types would result in great increase in query performance using those types (such as queries using value ranges).

4. “I`ll use it one day”

Adding new types, not removing unnecessary now, the same in the case of fields, or copyField definition. Most of us think – that old definition can be useful in the future, but remember that each type is some extra portion of memory needed by Solr, each field is a place in the index. My small advice – if you stop to use the type, field, or whatever else you have in your configuration file (not only in the schema.xml), simply remove it from this file. Applying this principle throughout the life cycle of the applications using Solr will ensure You that the index is in optimal condition, and after a few months since another feature implementation You will not need to be puzzled and as a result You will not need to dig into the application code to determine if the field is used in some forgotten code fragment.

5. Attributes, attributes and again attributes

Preservation of original values, adding term vectors and its properties are just examples of things we don`t need in every implementation. Sometimes we have more than required by the application index. A larger index, lower productivity, at least in some cases (eg, indexing). It is worth considering if you really need all this information, which we say to Solr to calculate and store. Removing some unnecessary, of course, from our point of view of information, may surprise us. Sometimes it is worth a try;)

Feel free to comment, because I will read eagerly, for what else we should pay attention to when modifying schema.xml file.

Finally, I think that it is worth to mention the article “The Seven Deadly Sins of Solr” LucidImagination published on the website at: http://www.lucidimagination.com/blog/2010/01/21/the-seven-deadly-sins-of-solr. It describes bad practices when working with Solr. In my opinion, interesting reading. I highly recommend it.

What is schema.xml?

Rafał Kuć — Mon, 16 Aug 2010 12:05:34 +0000

One of the configuration files that describe each implementation Solr is schema.xml file. It describes one of the most important things of the implementation – the structure of the data index. The information contained in this file allow you to control how Solr behaves when indexing the data, or when making queries. Schema.xml is not only the very structure of the index, is also detailed information about data types that have a large influence on the behavior Solr, and usually are treated with neglect. This entry will try to bring some insight about schema.xml.

Schema.xml file consists of several parts:

version,
type definitions,
field definitions,
copyField section,
additional definitions.

Version

The first thing we come across in the schema.xml file is the version. This is the information for Solr how to treat some of the attributes in schema.xml file. The definition is as follows:

Please note that this is not the definition of the version from the perspective of your project. At this point Solr supports four versions of a schema.xml file:

1.0 – multiValued attribute does not exist, all fields are multivalued by default.
1.1 – introduced multiValued attribute, the default attribute value is false.
1.2 – introduced omitTermFreqAndPositions attribute, the default value is true for all fields, besides text fields.
1.3 – removed the possibility of an optional compression of fields.

Type definitions

Type definitions can be logically divided into two separate sections – the simple types and complex types. Simple types as opposed to the complex types do not have a defined filters and tokenizer.

Simple types

First thing we see in the schema.xml file after version are types definition. Each type is described as a number of attributes defining the behavior of that type. First, some attributes that describe each type and are mandatory:

name – name of the type (required attribute).
class – class that is responsible for the implementation. Please note that classes are delivered from standard Solr packaged will have names with ‘solr’ prefix.

Besides the two mentioned above, types can have the following optional attributes:

sortMissingLast – attribute specifying how values in a field based on this type should be treated in case of sorting. When set to true documents without value in a field of this type will always be at the end of the results list regardless of sort order. The default attribute value is false. Attribute can be used only for types that are considered by Lucene as a string.
sortMissingFirst – attribute specifying how values in a field based on this type should be treated in case of sorting. When set to true documents without value in a field of this type will always be at the first positions of the results list regardles of sort order. The default attribute value is false. Attribute can be used only for types that are considered by Lucene as a string.
omitNorms – attribute specifying whether field normalization should take place.
omitTermFreqAndPositions – attribute specifying whether term frequency and term positions should be calculated.
indexed – attribute specifying whether the field based on this type will keep their original values.
positionIncrementGap – attribute specifying how many position Lucene should skip.

It is worth remembering that in the default settings sortMissingLast and sortMissingFirst attributes Lucene will apply behavior of placing a document with blank field values at the beginning of the ascending sort, and at the end of the list of results for descending sorting.

One more options for simple types, but only those based on Trie*Field classes:

precisionStep – attribute specifying the number of bits of precision. The greater the number of bits, the faster the queries based on numerical ranges. This however, also increases the size of the index, as more values are indexed. Set attribute value to 0 to disable the functionality of indexing at various precisions.

An example of a simple type defined:

Complex types

In addition to simple types, schema.xml file may include types consisting of a tokenizer and filters. Tokenizer is responsible for dividing the contents of the field in the tokens, while the filters are responsible for further token analysis. For example, the type that is responsible for dealing with the texts in Polish, would consist of a tokenizer in charge of the division of words based on whitespace, commas and periods. Filters for that type could be responsible for bringing generated tokens to lowercase, further division of tokens (for example on the basis of dashes), and then bringing tokens to the basic form.

Complex types, like simple types, have their name (name attribute) and the class which is responsible for implementation (class attribute). They can also be characterized by other attributes as described in the case of simple types (on the same basis). In addition, however, complex types can have a definition of tokenizer and filters to be used at the stage of indexing, and at the stage of query. As most of you know, for a given phase (indexing, or query) there can can be many filters defined but only one tokenizer. For example, just looks like a text type definition look like in the example provided with Solr:

It is worth noting that there is an additional attribute for the text field type:

autoGeneratePhraseQueries

This attribute is responsible for telling filters how to behave when dividing tokens. Some filters (such as WordDelimiterFilter) can divide tokens into a set of tokens. Setting the attribute to true (default value) will automatically generate phrase queries. This means that WordDelimiterFilter will divide the word “wi-fi” into two tokens “wi” and “fi”. With autoGeneratePhraseQueries set to true query sent to Lucene will look like "field:wi fi", while with set to false Lucene query will look like field:wi OR field:fi. However, please note, that this attribute only behaves well with tokenizers based on white spaces.

Returning to the type definition. As you can see, I gave an example which has two main sections:

and

The first section is responsible for the definition of the type, which will be used for indexing documents, the second section is responsible for the definition of type used for queries to fields based on this type. Note that if you want to use the same definitions for indexing and query phase, you can opt out of the two sections. Then our definition will look like this:

As I mentioned in the definition of each complex type there is a tokenizer and a series of filters (though not necessarily). I will not describe each filter and tokenizer available in Solr. This information is available at the following address: http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters.

At the end I wanted to add an important thing. Starting from 1.4 Solr tokenizer does not need to be the first mechanism that deals with the analysis of the field. Solr 1.4 introduced new filters – CharFilters that operate on the field before tokenizer and transmit the result to the tokenizer. It is worth to know because it might come in useful.

Multi-dimensional types

At the end I left myself a little addition – a novelty in Solr 1.4 – multi-dimensional fields – fields consisting of a number of other fields. Generally speaking, the assumption of this type of field was simple – to store in Solr pairs of values, triples or more related data, such as georaphical point coordinates. In practice this is realized by means of dynamic fields, but let me not get into the implementation details. The sample type definition that will consist two fields:

In addition to standard attributes: name and class there are two others:

dimension – the number of dimensions (used by the class attribute solr.PointType).
subFieldSuffix – suffix, which will be added to the dynamic fields created by that type. It is important to remember that the field based on the presented type will create three fields in the index – the actual field (for example named mylocation and two additional dynamic fields).

Field Definitions

Definitions of the fields is another section in the schema.xml file, the section, which in theory should be of interest to us the most during the design of Solr index. As a rule, we find here two kinds of field definitions:

Static Fields
Dynamic Fields

These fields are treated differently by the Solr. The first type of fields, are fields that are available under one name. Dynamic fields are fields that are available under many names – actually their name are a simple regular expression (name starting or ending with a ‘*’ sign). Please note that Solr first selects the static field, then the dynamic field. In addition, if the field name matches more than one definition, Solr will select a field with a longer name pattern.

Returning to the definition of the fields (both static and dynamic), they consist of the following attributes:

name – the name of the field (required attribute).
type – type of field, which is one of the pre-defined types (required attribute).
indexed – if a field is to be indexed (set to true, if you want to search or sort on this field).
stored – whether you want to store the original values (set to true, if we want to retrieve the original value of the field).
omitNorms – whether you want norms to be ignored (set to true for the fields for which You will apply the full-text search).
termVectors – set to true in the case when we want to keep so called term vectors. The default parameter value is false. Some features require setting this parameter to true (eg MoreLikeThis or FastVectorHighlighting).
termPositions – set to true, if You want to keep term positions with the term vector. Setting to true will cause the index to expand its size.
termOffsets – set to true, if You want to keep term offsets together with term vector. Setting to true will cause the index to expand its size.
default – the default value to be given to the field when the document was not given any value in this field.

The following examples of definitions of fields:

And finally, additional information to remember. In addition to the attributes listed above in the fields definition, we can overwrite the attributes that have been defined for type (eg whether a field is to be multiValued – the above example for a field called timestamp). Sometimes, this functionality can be useful if you need a specific field whose type is slightly different from other types (as in the example – only multiValued attribute). Of course, keep in mind the limitations imposed on the individual attributes associated with types.

CopyField section

In short, this section is responsible for copying the contents of fields to other fields. We define the field which value should be copied, and the destination field. Please note that copying takes place before the field value is analyzed. Example copyField definition:

For the sake of accuracy, occurring attributes mean:

source – the source field,
dest – the destination field.

Additional definitions

1. Unique key definition

The definition of a unique key that makes possible to unambiguously identify the document. Defining a unique key is not necessary, but is recommended. Sample definition:

id

2. Default search field definition

The Section is responsible for defining a default search field, which Solr use in case You have not given any field. Sample definition:

content

3. Default logical operator definition

This section is responsible for the definition of default logical operator that will be used. Sample definition looks as follows:

Possible values are: OR and AND.

4. Defining similarity

Finally we define the similarity that we will use. It is rather a topic for another post, but you must know that if necessary You can change the default similarity (currently in Solr trunk there are already two classes of similarity). The sample definition is as follows:

A few words at the end

Information presented above should give some insight on what schema.xml file is and what correspond to the different sections in this file. Soon I will try to write what You should avoid when designing the index.