<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>analysis &#8211; Solr.pl</title>
	<atom:link href="https://solr.pl/en/tag/analysis/feed/" rel="self" type="application/rss+xml" />
	<link>https://solr.pl/en/</link>
	<description>All things to be found - Blog related to Apache Solr &#38; Lucene projects - https://solr.apache.org</description>
	<lastBuildDate>Wed, 11 Nov 2020 22:44:26 +0000</lastBuildDate>
	<language>en-US</language>
	<sy:updatePeriod>
	hourly	</sy:updatePeriod>
	<sy:updateFrequency>
	1	</sy:updateFrequency>
	<generator>https://wordpress.org/?v=6.9</generator>
	<item>
		<title>Solr 4.0 and Polish language analysis</title>
		<link>https://solr.pl/en/2012/04/02/solr-4-0-and-polish-language-analysis/</link>
					<comments>https://solr.pl/en/2012/04/02/solr-4-0-and-polish-language-analysis/#respond</comments>
		
		<dc:creator><![CDATA[Rafał Kuć]]></dc:creator>
		<pubDate>Mon, 02 Apr 2012 21:43:51 +0000</pubDate>
				<category><![CDATA[Solr]]></category>
		<category><![CDATA[4.0]]></category>
		<category><![CDATA[analysis]]></category>
		<category><![CDATA[hunspell]]></category>
		<category><![CDATA[language]]></category>
		<category><![CDATA[morfologik]]></category>
		<category><![CDATA[polish]]></category>
		<category><![CDATA[solr]]></category>
		<guid isPermaLink="false">http://sematext.solr.pl/?p=449</guid>

					<description><![CDATA[Because Polish language analysis functionality is present in Lucene (and Solr) for some time I decided to take a look and compare it on the basis of upcoming Lucene and Solr 4.0. Options At the time of writing, the following]]></description>
										<content:encoded><![CDATA[<p>Because Polish language analysis functionality is present in Lucene (and Solr) for some time I decided to take a look and compare it on the basis of upcoming Lucene and Solr 4.0.</p>
<p><span id="more-449"></span></p>
<h3>Options</h3>
<p>At the time of writing, the following options were present when it comes to analyzing Polish:</p>
<ul>
<li>Use Stempel library (available since Solr 3.1)</li>
<li>Use Hunspell and Polish dictionaries (available since Solr 3.5)</li>
<li>Use Morfologik library (will be available in Solr 4.0, <a href="https://issues.apache.org/jira/browse/SOLR-3272" target="_blank" rel="noopener noreferrer">SOLR-3272</a>).</li>
</ul>
<h3>Configuration</h3>
<p>Lets look how to configure all the above options in Solr (please remember that all the following configuration examples are based on Solr 4.0).</p>
<h4>Stempel</h4>
<p>In order to add Polish stemming using Stempel library, we just need to add the following filter to our type definition:
</p>
<pre class="brush:xml">&lt;filter class="solr.StempelPolishStemFilterFactory" /&gt;</pre>
<p>In addition to that, you need to add <em>lucene-analyzers-stempel-4.0.jar</em> library and <em>apache-solr-analysis-extras-4.0.jar</em> library to&nbsp;<em>SOLR_HOME/lib</em>. It&#8217;s also a good idea to use<em> solr.LowerCaseFilterFactory</em> before Stempel filter.</p>
<h4>Hunspell</h4>
<p>Similar to the configuration above, to use Hunspell you need to add a new filter to your type definition. For example in the following way:
</p>
<pre class="brush:xml">&lt;filter class="solr.HunspellStemFilterFactory" dictionary="pl_PL.dic" affix="pl_PL.aff" ignoreCase="true" /&gt;</pre>
<p>Parameters <em>dictionary</em> and <em>affix</em> are responsible for dictionary definition that we want to use. The <em>ignoreCase</em> parameter set to <em>true</em> tells Hunspell to ignore character case. You can find Hunspell dictionaries at the following URL: <a href="http://wiki.services.openoffice.org/wiki/Dictionaries" target="_blank" rel="noopener noreferrer">http://wiki.services.openoffice.org/wiki/Dictionaries</a>.</p>
<h4>Morfologik</h4>
<p>Similar to the two above examples all you need to change in your <em>schema.xml</em> is adding a new filter, this time the following way:
</p>
<pre class="brush:xml">&lt;filter class="solr.MorfologikFilterFactory" dictionary="MORFOLOGIK" /&gt;</pre>
<p>The <em>dictionary</em> parameter tell Solr which dictionary you would like to use. You can choose the one from the following three:</p>
<ul>
<li>MORFOLOGIK</li>
<li>MORFEUSZ</li>
<li>COMBINED</li>
</ul>
<p>In addition to that, you need to add the following libraries to the <em>SOLR_HOME/lib</em>: <em>lucene-analyzers-morfologik-4.0.jar, </em><em>apache-solr-analysis-extras-4.0.jar, morfologik-fsa-1.5.2.jar</em>, <em>morfologik-polish-1.5.2.jar</em> and <em>morfologik-stemming-1.5.2.jar</em>.</p>
<h3>Results Comparison</h3>
<p>Of course I wasn&#8217;t able to judge the results of analysis from the above three filters on the whole Polish language corpus and that&#8217;s why I decided to choose four work, to see the each of the filters behave. Those words are: &#8220;<em>urodzić urodzony urodzona urodzeni&#8221;</em> (this words are variations of the <em>born</em> word in Polish)<em>. </em>The results are as follows:<em><br />
</em></p>
<h4>Stempel</h4>
<p>The terms I got from Stempel were the following ones:
</p>
<pre>[urodzić] [urodzo] [urodzona] [urodzeni]</pre>
<p>Not all of them are words, but you have to remember that Stempel is a stemmer and because of that it produce stems which can be different from the actual words or their root forms. It is important to have the words we are interested in to be processed to the same tokens, which will allow to find those words by Lucene/Solr. Remembering that, I have to say, that the results of analysis using Stempel are not as good as I would like them to be. For example by searching for <em>urodzić</em> word you won&#8217;t be able to find documents with words like <em>urodzona</em> or <em>urodzić</em>.</p>
<h4>Hunspell</h4>
<p>The result of Hunspell analysis were as follows:
</p>
<pre>[urodzić, urodzić] [urodzony, urodzić] [urodzić] [urodzić, urodzony, urodzenie]</pre>
<p>Comparing the results I got when using Hunspell to those Stempel produced we can see the difference. Our sample query for the <em>urodzić</em> word, would find documents with words like <em>urodzony</em>, <em>urodzona</em> oraz <em>urodzeni</em>, which is quite nice. You can also notice, that with three words we got more than one term on the same positions. The results I got when using Hunspell are OK and I think they should satisfy most of the users (they do satisfy me), but lets have a look on the newly introduced filter in Lucene and Solr &#8211; Morrfologik.</p>
<h4>Morfologik</h4>
<p>The results of Morfologik analysis were as follows:
</p>
<pre>[urodzić] [urodzony, urodzić] [urodzić] [urodzić, urodzony]</pre>
<p>Again, if you compare those the the ones got when using Hunspell you can hardly see the difference (of course in this particular case). The only difference between Hunspell and Morfologik is the last term for which we got different results. In my opinion the results achieved with Morfologik, are satisfying.</p>
<h3>Performance</h3>
<p>The performance test was done in a simple manner &#8211; for each filter I&#8217;ve indexed 5 million documents, where all the text fields were based on Polish language analysis with appropriate filter (in addition to that some standard filters like stopwords, synonyms and so on). Every time the indexation was done on a clean Solr 4.0 instance. Because of using Data Import Handler I&#8217;ve sent commit every 100k documents. The index contained several fields, but the actual index structure was not crucial for the test as I indexed the same set of documents every time. Following are the test results:</p>
[table “21” not found /]<br />

<p><strong>Warning<em>:</em></strong> At the time of writing, according to&nbsp; <a href="https://issues.apache.org/jira/browse/SOLR-3245">SOLR-3245</a> JIRA issue there is a problem with Hunspell performance with Polish dictionaries and Solr 4.0. I&#8217;m almost certain that this situation will be resolved by the time Solr 4.0 will be released. But right now performance of Hunspell with Polish dictionaries and Solr 4.0 may not be sufficient.</p>
<h3>Short Summary</h3>
<p>Despite not having performance results for Hunspell (because I don&#8217;t count the ones I have right now as correct ones) we can see that Hunspell and Morfologik are a good candidates for Polish language analysis. Looking at Morfologik we have similar performance to Stempel, but Morfologik results are better in my opinion and that will make your user more happy.</p>
]]></content:encoded>
					
					<wfw:commentRss>https://solr.pl/en/2012/04/02/solr-4-0-and-polish-language-analysis/feed/</wfw:commentRss>
			<slash:comments>0</slash:comments>
		
		
			</item>
		<item>
		<title>Solr filters: PatternReplaceCharFilter</title>
		<link>https://solr.pl/en/2011/05/09/solr-filters-patternreplacecharfilter/</link>
					<comments>https://solr.pl/en/2011/05/09/solr-filters-patternreplacecharfilter/#respond</comments>
		
		<dc:creator><![CDATA[Rafał Kuć]]></dc:creator>
		<pubDate>Mon, 09 May 2011 18:45:16 +0000</pubDate>
				<category><![CDATA[Solr]]></category>
		<category><![CDATA[analysis]]></category>
		<category><![CDATA[configuration]]></category>
		<category><![CDATA[filter]]></category>
		<category><![CDATA[filtering]]></category>
		<category><![CDATA[solr]]></category>
		<guid isPermaLink="false">http://sematext.solr.pl/?p=266</guid>

					<description><![CDATA[Continuing the overview of the filters included in Solr today we look at the PatternReplaceCharFilter. As you might guess the task of the filter is to change the matching input stream parts that match the given regular expression. You have]]></description>
										<content:encoded><![CDATA[<p>Continuing the overview of the filters included in Solr today we look at the PatternReplaceCharFilter.</p>
<p>As you might guess the task of the filter is to change the matching input stream parts that match the given regular expression.</p>
<p><span id="more-266"></span></p>
<p>You have the following parameters:</p>
<ul>
<li><em>pattern</em> (required) – the value to be changed (regular expressions)</li>
<li><em>replacement</em> (default: &#8220;&#8221;) &#8211; the value that will be used as a replament for the fragment that matched the regular expression</li>
<li><em>blockDelimiters</em></li>
<li><em>maxBlockChars</em> (default: 10000, must be greater than 0) – buffer used for comparison</li>
</ul>
<h2>Use examples</h2>
<p>The use of a filter is simple &#8211; we add its definition to the type definition in schema.xml file, for example:
</p>
<pre class="brush:xml">&lt;fieldType name="textCharNorm" class="solr.TextField"&gt;
  &lt;analyzer&gt;
    &lt;charFilter class="solr.PatternReplaceCharFilterFactory" …/&gt;
    &lt;charFilter class="solr.MappingCharFilterFactory" mapping="mapping-ISOLatin1Accent.txt"/&gt;
    &lt;tokenizer class="solr.WhitespaceTokenizerFactory"/&gt;
  &lt;/analyzer&gt;
&lt;/fieldType&gt;</pre>
<p>Poniżej przykładowe definicji dla różnych przypadków.</p>
<p>Below are examples of definitions for different cases.</p>
<h3>Cut pieces of text</h3>
<p>You just need to specify, in the pattern attribute, what we want to cut. Example:
</p>
<pre class="brush:xml">&lt;charFilter class="solr.PatternReplaceCharFilterFactory" pattern="#TAG" /&gt;</pre>
<p>which will suppress the content of the data elements: &#8220;#TAG&#8221;</p>
<h3>Text fragments replacement</h3>
<p>A similar case to the one above, but we want to convert text to another.
</p>
<pre class="brush:xml">&lt;charFilter class="solr.PatternReplaceCharFilterFactory" pattern="#TAG" replacement="[CENZORED]"/&gt;</pre>
<h3>Changing patterns</h3>
<p>The two above cases were trivial. What is the strength of this filter is handling regular expressions. (You use regular expressions, right?) The following example is simple &#8211; it hides all the numbers by turning them into stars. It also handles the numbers separated by hyphens, treating them as a single number.
</p>
<pre class="brush:xml">&lt;charFilter class="solr.PatternReplaceCharFilterFactory" pattern="(\\d+-*\\d+)+" replacement="*"/&gt;</pre>
<h3>Text Manipulation</h3>
<p>The replacement doesn&#8217;t have to be plain text. This filter supports references which allow you to refer to parts of the matched pattern. For details, refer to the documentation of regular expressions. In the following example, all multiplied characters are replaced with a single sign.
</p>
<pre class="brush:xml">&lt;charFilter class="solr.PatternReplaceCharFilterFactory" pattern="(.)\\1" replacement="$1"/&gt;</pre>
<h2>Advanced Parameters</h2>
<p>So far I have not mentioned the following parameters: <em>blockDelimiters </em>and <em>maxBlockChars</em>. If you look at the source code you would see that those parameters are related to the way the filter is implemented. <em>CharFilter </em> operates on a single character, and pattern matching requires an internal buffer to read more characters. <em>MaxBlockChars </em>allows you to specify the size of the buffer. You do not have to worry about it, if the pattern you defined, does not match piece of text larger than 10k characters). <em>BlockDelimiters </em>can further optimize filling of the buffer. It can be used if the information in the analyzed field is somehow divided into sections (eg, it is a CSV, sentences, etc.). It  is a text that informs the scanner, that a new section starts,  therefore, parts matched in the previous section are no longer useful.</p>
<h2>Limits</h2>
<p>An  important limitation of the filter is that it directly manipulates the  input data and does not keep information related to the original text. This  means that if the filter removes a portion of the string, or add a new  fragment, tokenizer will not notice that and the location of tokens in  the original box will not be saved properly. You should be aware of that when using queries that operate on the relative positions of tokens or if you use highlighting.</p>
]]></content:encoded>
					
					<wfw:commentRss>https://solr.pl/en/2011/05/09/solr-filters-patternreplacecharfilter/feed/</wfw:commentRss>
			<slash:comments>0</slash:comments>
		
		
			</item>
		<item>
		<title>What is schema.xml?</title>
		<link>https://solr.pl/en/2010/08/16/what-is-schema-xml/</link>
					<comments>https://solr.pl/en/2010/08/16/what-is-schema-xml/#respond</comments>
		
		<dc:creator><![CDATA[Rafał Kuć]]></dc:creator>
		<pubDate>Mon, 16 Aug 2010 12:05:34 +0000</pubDate>
				<category><![CDATA[Solr]]></category>
		<category><![CDATA[analysis]]></category>
		<category><![CDATA[field]]></category>
		<category><![CDATA[filter]]></category>
		<category><![CDATA[schema]]></category>
		<category><![CDATA[schema.xml]]></category>
		<category><![CDATA[solr]]></category>
		<category><![CDATA[token]]></category>
		<category><![CDATA[tokenizer]]></category>
		<category><![CDATA[type]]></category>
		<guid isPermaLink="false">http://sematext.solr.pl/?p=64</guid>

					<description><![CDATA[One of the configuration files that describe each implementation Solr is schema.xml file. It describes one of the most important things of the implementation &#8211; the structure of the data index. The information contained in this file allow you to]]></description>
										<content:encoded><![CDATA[<p>One of the configuration files that describe each implementation Solr is <em>schema.xml</em> file. It describes one of the most important things of the implementation &#8211; the structure of the data index. The information contained in this file allow you to control how Solr behaves when indexing the data, or when making queries. <em>Schema.xml</em> is not only the very structure of the index, is also detailed information about data types that have a large influence on the behavior Solr, and usually are treated with neglect. This entry will try to bring some insight about <em>schema.xml</em>.</p>
<p><span id="more-64"></span></p>
<p><em>Schema.xml</em> file consists of several parts:</p>
<ul>
<li>version,</li>
<li>type definitions,</li>
<li>field definitions,</li>
<li>copyField section,</li>
<li>additional definitions.</li>
</ul>
<h3>Version</h3>
<p>The first thing we come across in the <em>schema.xml</em> file is the version. This is the information for Solr how to treat some of the attributes in <em>schema.xml</em> file. The definition is as follows:
</p>
<pre class="brush:xml">&lt;schema name="example" version="1.3"&gt;</pre>
<p>Please note that this is not the definition of the version from the perspective of your project. At this point Solr supports four versions of a <em>schema.xml</em> file:</p>
<ul>
<li>1.0 &#8211; <em>multiValued </em>attribute does not exist, all fields are multivalued by default.</li>
<li>1.1 &#8211; introduced <em>multiValued </em>attribute, the default attribute value is <em>false</em>.</li>
<li>1.2 &#8211; introduced <em>omitTermFreqAndPositions </em>attribute, the default value is <em>true</em> for all fields, besides text fields.</li>
<li>1.3 &#8211; removed the possibility of an optional compression of fields.</li>
</ul>
<h3>Type definitions</h3>
<p>Type definitions can be logically divided into two separate sections &#8211; the simple types and complex types. Simple types as opposed to the complex types do not have a defined filters and tokenizer.</p>
<p><strong>Simple types</strong></p>
<p>First thing we see in the <em>schema.xml</em> file after version are types definition. Each type is described as a number of attributes defining the behavior of that type. First, some attributes that describe each type and are mandatory:</p>
<ul>
<li><em>name </em>&#8211; name of the type (required attribute).</li>
<li><em>class </em>&#8211; class that is responsible for the implementation. Please note  that classes are delivered from standard Solr packaged will have names  with &#8216;solr&#8217; prefix.</li>
</ul>
<p>Besides the two mentioned above, types can have the following optional attributes:</p>
<ul>
<li> <em>sortMissingLast </em>&#8211; attribute specifying how values in a field based on this type should be treated in case of sorting. When set to <em>true</em> documents without value in a field of this type will always be at the end of the results list regardless of sort order. The default attribute value is <em>false</em>. Attribute can be used only for types that are considered by Lucene as a string.</li>
<li><em>sortMissingFirst </em>&#8211; attribute specifying how values in a field based on  this type should be treated in case of sorting. When set to <em>true</em> documents without value in a field of this type will always be at the  first positions of the results list regardles of sort order. The default  attribute value is <em>false</em>. Attribute can be used only for types that are considered by Lucene as a string.</li>
<li><em>omitNorms </em>&#8211; attribute specifying whether field normalization should take place.</li>
<li><em>omitTermFreqAndPositions </em>&#8211; attribute specifying whether term frequency and term positions should be calculated.</li>
<li><em>indexed </em>&#8211; attribute specifying whether the field based on this type will keep their original values.</li>
<li><em>positionIncrementGap </em>&#8211; attribute specifying how many position Lucene should skip.</li>
</ul>
<p>It is worth remembering that in the default settings <em>sortMissingLast </em>and <em>sortMissingFirst</em> attributes Lucene will apply behavior of placing a document with blank field values at the beginning of the ascending sort, and at the end of the list of results for descending sorting.</p>
<p>One more options for simple types, but only those based on <em>Trie*Field</em> classes:</p>
<ul>
<li><em>precisionStep</em> &#8211; attribute specifying the number of bits of precision. The greater the number of bits, the faster the queries based on numerical ranges. This however, also increases the size of the index, as more values are indexed. Set attribute value to 0 to disable the functionality of indexing at various precisions.</li>
</ul>
<p>An example of a simple type defined:
</p>
<pre class="brush:xml">&lt;fieldType name="string" class="solr.StrField" sortMissingLast="<em>true</em>" omitNorms="<em>true</em>"/&gt;</pre>
<p><strong>Complex types</strong></p>
<p>In addition to simple types, <em>schema.xml</em> file may include types consisting of a tokenizer and filters. Tokenizer is responsible for dividing the contents of the field in the tokens, while the filters are responsible for further token analysis. For example, the type that is responsible for dealing with the texts in Polish, would consist of a tokenizer in charge of the division of words based on whitespace, commas and periods. Filters for that type could be responsible for bringing generated tokens to lowercase, further division of tokens (for example on the basis of dashes), and then bringing tokens to the basic form.</p>
<p>Complex types, like simple types, have their name (<em>name </em>attribute) and the class which is responsible for implementation (<em>class </em>attribute). They can also be characterized by other attributes as described in the case of simple types (on the same basis). In addition, however, complex types can have a definition of tokenizer and filters to be used at the stage of indexing, and at the stage of query. As most of you know, for a given phase (indexing, or query) there can can be many filters defined but only one tokenizer. For example, just looks like a text type definition look like in the example provided with Solr:
</p>
<pre class="brush:xml">&lt;fieldType name="text" class="solr.TextField" positionIncrementGap="100" autoGeneratePhraseQueries="<em>true</em>"&gt;
   &lt;analyzer type="index"&gt;
      &lt;tokenizer class="solr.WhitespaceTokenizerFactory"/&gt;
      &lt;filter class="solr.StopFilterFactory" ignoreCase="<em>true</em>" words="stopwords.txt" enablePositionIncrements="<em>true</em>" /&gt;
      &lt;filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/&gt;
      &lt;filter class="solr.LowerCaseFilterFactory"/&gt;
      &lt;filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/&gt;
      &lt;filter class="solr.PorterStemFilterFactory"/&gt;
   &lt;/analyzer&gt;
   &lt;analyzer type="query"&gt;
      &lt;tokenizer class="solr.WhitespaceTokenizerFactory"/&gt;
      &lt;filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="<em>true</em>" expand="<em>true</em>"/&gt;
      &lt;filter class="solr.StopFilterFactory" ignoreCase="<em>true</em>" words="stopwords.txt" enablePositionIncrements="<em>true</em>" /&gt;
      &lt;filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/&gt;
      &lt;filter class="solr.LowerCaseFilterFactory"/&gt;
      &lt;filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/&gt;
      &lt;filter class="solr.PorterStemFilterFactory"/&gt;
   &lt;/analyzer&gt;
&lt;/fieldType&gt;</pre>
<p>It is worth noting that there is an additional attribute for the text field type:</p>
<ul>
<li> <em>autoGeneratePhraseQueries</em></li>
</ul>
<p>This attribute is responsible for telling filters how to behave when dividing tokens. Some filters (such as <em>WordDelimiterFilter</em>) can divide tokens into a set of tokens. Setting the attribute to <em>true</em> (default value) will automatically generate phrase queries. This means that <em>WordDelimiterFilter </em>will divide the word &#8220;wi-fi&#8221; into two tokens &#8220;wi&#8221; and &#8220;fi&#8221;. With autoGeneratePhraseQueries set to <em>true</em> query sent to Lucene will look like <code>"field:wi fi"</code>, while with set to <em>false</em> Lucene query will look like <code>field:wi OR field:fi</code>. However, please note, that this attribute only behaves well with tokenizers based on white spaces.</p>
<p>Returning to the type definition. As you can see, I gave an example which has two main sections:
</p>
<pre class="brush:xml">&lt;analyzer type="index"&gt;</pre>
<p>and
</p>
<pre class="brush:xml">&lt;analyzer type="query"&gt;</pre>
<p>The first section is responsible for the definition of the type, which will be used for indexing documents, the second section is responsible for the definition of type used for queries to fields based on this type. Note that if you want to use the same definitions for indexing and query phase, you can opt out of the two sections. Then our definition will look like this:
</p>
<pre class="brush:xml">&lt;fieldType name="text" class="solr.TextField" positionIncrementGap="100" autoGeneratePhraseQueries="<em>true</em>"&gt;
   &lt;tokenizer class="solr.WhitespaceTokenizerFactory"/&gt;
   &lt;filter class="solr.StopFilterFactory" ignoreCase="<em>true</em>" words="stopwords.txt" enablePositionIncrements="<em>true</em>" /&gt;
   &lt;filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/&gt;
   &lt;filter class="solr.LowerCaseFilterFactory"/&gt;
   &lt;filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/&gt;
   &lt;filter class="solr.PorterStemFilterFactory"/&gt;
&lt;/fieldType&gt;</pre>
<p>As I mentioned in the definition of each complex type there is a tokenizer and a series of filters (though not necessarily). I will not describe each filter and tokenizer available in Solr. This information is available at the following address: <a href="http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters" target="_blank" rel="noopener noreferrer">http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters</a>.</p>
<p>At the end I wanted to add an important thing. Starting from 1.4 Solr tokenizer does not need to be the first mechanism that deals with the analysis of the field. Solr 1.4 introduced new filters &#8211; <em>CharFilters </em>that operate on the field before tokenizer and transmit the result to the tokenizer. It is worth to know because it might come in useful.</p>
<p><strong>Multi-dimensional types</strong></p>
<p>At the end I left myself a little addition &#8211; a novelty in Solr 1.4 &#8211; multi-dimensional fields &#8211;  fields consisting of a number of other fields. Generally speaking, the assumption of this type of field was simple &#8211; to store in Solr pairs of values, triples or more related data, such as georaphical point coordinates. In practice this is realized by means of dynamic fields, but let me not get into the implementation details. The sample type definition that will consist two fields:
</p>
<pre class="brush:xml">&lt;fieldType name="location" class="solr.PointType" dimension="2" subFieldSuffix="_d"/&gt;</pre>
<p>In addition to standard attributes: name and class there are two others:</p>
<ul>
<li> dimension &#8211; the number of dimensions (used by the class attribute <em>solr.PointType</em>).</li>
<li>subFieldSuffix &#8211; suffix, which will be added to the dynamic fields  created by that type. It is important to remember that the field based  on the presented type will create three fields in the index &#8211; the actual  field (for example named mylocation and two additional dynamic fields).</li>
</ul>
<h3><strong>Field Definitions</strong></h3>
<p>Definitions of the fields is another section in the <em>schema.xml</em> file, the section, which in theory should be of interest to us the most during the design of Solr index. As a rule, we find here two kinds of field definitions:</p>
<ol>
<li>Static Fields</li>
<li>Dynamic Fields</li>
</ol>
<p>These fields are treated differently by the Solr. The first type of fields, are fields that are available under one name. Dynamic fields are fields that are available under many names &#8211; actually their name are a simple regular expression (name starting or ending with a &#8216;*&#8217; sign). Please note that Solr first selects the static field, then the dynamic field. In addition, if the field name matches more than one definition, Solr will select a field with a longer name pattern.</p>
<p>Returning to the definition of the fields (both static and dynamic), they consist of the following attributes:</p>
<ul>
<li><em>name </em>&#8211; the name of the field (required attribute).</li>
<li><em>type </em>&#8211; type of field, which is one of the pre-defined types (required attribute).</li>
<li><em>indexed </em>&#8211; if a field is to be indexed (set to <em>true</em>, if you want to search or sort on this field).</li>
<li><em>stored </em>&#8211; whether you want to store the original values (set to <em>true</em>, if we want to retrieve the original value of the field).</li>
<li><em>omitNorms </em>&#8211; whether you want norms to be ignored (set to <em>true</em> for the fields for which You will apply the full-text search).</li>
<li><em>termVectors </em>&#8211; set to <em>true</em> in the case when we want to keep so called term vectors. The default parameter value is <em>false</em>. Some features require setting this parameter to <em>true</em> (eg <em>MoreLikeThis </em>or <em>FastVectorHighlighting</em>).</li>
<li><em>termPositions </em>&#8211; set to <em>true</em>, if You want to keep term positions with the term vector. Setting to <em>true</em> will cause the index to expand its size.</li>
<li><em>termOffsets </em>&#8211; set to <em>true</em>, if You want to keep term offsets together with term vector. Setting to <em>true</em> will cause the index to expand its size.</li>
<li><em>default </em>&#8211; the default value to be given to the field when the document was not given any value in this field.</li>
</ul>
<p>The following examples of definitions of fields:
</p>
<pre class="brush:xml">&lt;field name="id" type="string" indexed="<em>true</em>" stored="<em>true</em>" required="<em>true</em>" /&gt;
&lt;field name="includes" type="text" indexed="<em>true</em>" stored="<em>true</em>" termVectors="<em>true</em>" termPositions="<em>true</em>" termOffsets="<em>true</em>" /&gt;
&lt;field name="timestamp" type="date" indexed="<em>true</em>" stored="<em>true</em>" default="NOW" multiValued="<em>false</em>"/&gt;
&lt;dynamicField name="*_i" type="int" indexed="<em>true</em>" stored="<em>true</em>"/&gt;</pre>
<p>And finally, additional information to remember. In addition to the attributes listed above in the fields definition, we can overwrite the attributes that have been defined for type (eg whether a field is to be multiValued &#8211; the above example for a field called timestamp). Sometimes, this functionality can be useful if you need a specific field whose type is slightly different from other types (as in the example &#8211; only multiValued attribute). Of course, keep in mind the limitations imposed on the individual attributes associated with types.</p>
<h3>CopyField section</h3>
<p>In short, this section is responsible for copying the contents of fields to other fields. We define the field which value should be copied, and the destination field. Please note that copying takes place before the field value is analyzed. Example copyField definition:
</p>
<pre class="brush:xml">&lt;copyField source="category" dest="text"/&gt;</pre>
<p>For the sake of accuracy, occurring attributes mean:</p>
<ul>
<li>source &#8211; the source field,</li>
<li>dest &#8211; the destination field.</li>
</ul>
<h3>Additional definitions</h3>
<p><strong>1. Unique key definition</strong></p>
<p>The definition of a unique key that makes possible to unambiguously identify the document. Defining a unique key is not necessary, but is recommended. Sample definition:
</p>
<pre class="brush:xml">&lt;uniqueKey&gt;id&lt;/uniqueKey&gt;</pre>
<p><strong>2. Default search field definition</strong></p>
<p>The Section is responsible for defining a default search field, which Solr use in case You have not given any field. Sample definition:
</p>
<pre class="brush:xml">&lt;defaultSearchField&gt;content&lt;/defaultSearchField&gt;</pre>
<p><strong>3. Default logical operator definition</strong></p>
<p>This section is responsible for the definition of default logical operator that will be used. Sample definition looks as follows:
</p>
<pre class="brush:xml">&lt;solrQueryParser defaultOperator="OR" /&gt;</pre>
<p>Possible values are: <em>OR </em>and <em>AND</em>.</p>
<p><strong>4. Defining similarity</strong></p>
<p>Finally we define the similarity that we will use. It is rather a topic for another post, but you must know that if necessary You can change the default similarity (currently in Solr trunk there are already two classes of similarity). The sample definition is as follows:
</p>
<pre class="brush:xml">&lt;similarity class="pl.solr.similarity.CustomSimilarity" /&gt;</pre>
<h3>A few words at the end</h3>
<p>Information presented above should give some insight on what <em>schema.xml</em> file is and what correspond to the different sections in this file. Soon I will try to write what You should avoid when designing the index.</p>
]]></content:encoded>
					
					<wfw:commentRss>https://solr.pl/en/2010/08/16/what-is-schema-xml/feed/</wfw:commentRss>
			<slash:comments>0</slash:comments>
		
		
			</item>
	</channel>
</rss>
