<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>schema.xml &#8211; Solr.pl</title>
	<atom:link href="https://solr.pl/en/tag/schema-xml-2/feed/" rel="self" type="application/rss+xml" />
	<link>https://solr.pl/en/</link>
	<description>All things to be found - Blog related to Apache Solr &#38; Lucene projects - https://solr.apache.org</description>
	<lastBuildDate>Thu, 12 Nov 2020 12:59:24 +0000</lastBuildDate>
	<language>en-US</language>
	<sy:updatePeriod>
	hourly	</sy:updatePeriod>
	<sy:updateFrequency>
	1	</sy:updateFrequency>
	<generator>https://wordpress.org/?v=6.9</generator>
	<item>
		<title>Solr 4.2: Index structure reading API</title>
		<link>https://solr.pl/en/2013/05/20/solr-4-2-index-structure-reading-api/</link>
					<comments>https://solr.pl/en/2013/05/20/solr-4-2-index-structure-reading-api/#respond</comments>
		
		<dc:creator><![CDATA[Rafał Kuć]]></dc:creator>
		<pubDate>Mon, 20 May 2013 11:58:51 +0000</pubDate>
				<category><![CDATA[Solr]]></category>
		<category><![CDATA[4.2]]></category>
		<category><![CDATA[api]]></category>
		<category><![CDATA[index]]></category>
		<category><![CDATA[schema]]></category>
		<category><![CDATA[schema api]]></category>
		<category><![CDATA[schema.xml]]></category>
		<category><![CDATA[solr]]></category>
		<category><![CDATA[structure]]></category>
		<guid isPermaLink="false">http://sematext.solr.pl/?p=554</guid>

					<description><![CDATA[With the release of Solr 4.2 we&#8217;ve got the possibility to use the HTTP protocol to get information about Solr index structure. Of course, if one wanted to do that prior to Solr 4.2 it could be achieved by fetching]]></description>
										<content:encoded><![CDATA[<p>With the release of Solr 4.2 we&#8217;ve got the possibility to use the HTTP protocol to get information about Solr index structure. Of course, if one wanted to do that prior to Solr 4.2 it could be achieved by fetching the <em>schema.xml</em> file, parsing it and then getting the needed information. However when Solr 4.2 was released we&#8217;ve got a dedicated API which can return the information we need without the need of parsing the whole <em>schema.xml</em> file.</p>
<p><span id="more-554"></span></p>
<h3>Possibilities</h3>
<p>Let&#8217;s look at the new API by example.</p>
<h4>Getting information in XML format</h4>
<p>Many Solr users are used to getting their data in the XML format, at least when using Solr HTTP API. However, the schema API uses JSON as the default format. In order to get the data in the XML format in all the below examples, you&#8217;ll need to appeng the <em>wt=xml</em> parameter to the call, for example like that:
</p>
<pre class="brush:bash">$curl 'http://localhost:8983/solr/collection1/schema/fieldtypes?wt=xml'</pre>
<h4>Defined fields information</h4>
<p>Let&#8217;s start by looking at how to fetch information about the fields that are defined in Solr. In order to do that we have the following possibilities:</p>
<ol>
<li>Get information about all the fields defined in the index</li>
<li>Get information for a one, explicitly defined field</li>
</ol>
<p>In the first case we should use the following command:
</p>
<pre class="brush:bash">$curl 'http://localhost:8983/solr/collection1/schema/fields'</pre>
<p>In second case we should add the <em>/</em> character and the field name to the above command. For example in order to get the information about the <em>author</em> field we should use the following command:
</p>
<pre class="brush:bash">$curl 'http://localhost:8983/solr/collection1/schema/fields/author'</pre>
<p>Solr response for the first command will be similar to the following one:
</p>
<pre class="brush:plain">{
  "responseHeader":{
    "status":0,
    "QTime":1},
  "fields":[{
      "name":"_version_",
      "type":"long",
      "indexed":true,
      "stored":true},
    {
      "name":"author",
      "type":"text_general",
      "indexed":true,
      "stored":true},
    {
      "name":"cat",
      "type":"string",
      "multiValued":true,
      "indexed":true,
      "stored":true},
    {
      "name":"category",
      "type":"text_general",
      "indexed":true,
      "stored":true},
    {
      "name":"id",
      "type":"string",
      "multiValued":false,
      "indexed":true,
      "required":true,
      "stored":true,
      "uniqueKey":true},
    {
      "name":"url",
      "type":"text_general",
      "indexed":true,
      "stored":true},
    {
      "name":"weight",
      "type":"float",
      "indexed":true,
      "stored":true}]}</pre>
<p>On the other hand the response for the second command would be as follows:
</p>
<pre class="brush:plain">{
  "responseHeader":{
    "status":0,
    "QTime":0},
  "field":{
    "name":"author",
    "type":"text_general",
    "indexed":true,
    "stored":true}}</pre>
<h4>Getting information about defined dynamic fields</h4>
<p>Similar to what information we can get about the fields defined in the <em>schema.xml</em> we can get the information about dynamic fields. Again we have to options:</p>
<ol>
<li>Get information about all dynamic fields</li>
<li>Get information about specific dynamic field pattern</li>
</ol>
<p>In order to get all the information about dynamic fields we should use the following command:
</p>
<pre class="brush:bash">$curl 'http://localhost:8983/solr/collection1/schema/dynamicfields'</pre>
<p>In order to get information about a specific pattern we append the <em>/&nbsp;</em>character followed by the pattern, for example like this:
</p>
<pre class="brush:bash">$curl 'http://localhost:8983/solr/collection1/schema/dynamicfields/random_*'</pre>
<p>Solr will return the following response for the first query:
</p>
<pre class="brush:plain">{
  "responseHeader":{
    "status":0,
    "QTime":2},
  "dynamicfields":[{
      "name":"*_coordinate",
      "type":"tdouble",
      "indexed":true,
      "stored":false},
    {
      "name":"ignored_*",
      "type":"ignored",
      "multiValued":true},
    {
      "name":"random_*",
      "type":"random"},
    {
      "name":"*_p",
      "type":"location",
      "indexed":true,
      "stored":true},
    {
      "name":"*_c",
      "type":"currency",
      "indexed":true,
      "stored":true}]}</pre>
<p>And the following response will be returned for the second command:
</p>
<pre class="brush:plain">{
  "responseHeader":{
    "status":0,
    "QTime":1},
  "dynamicfield":{
    "name":"random_*",
    "type":"random"}}</pre>
<h4>Getting field types</h4>
<p>As you probably guess, in a way similar to the above describes examples, we can also get the information about the field types defined in our <em>schema.xml</em> files. We can fetch the following information:</p>
<ol>
<li>All the field types defined in the <em>schema.xml</em> file</li>
<li>A single type</li>
</ol>
<p>To get all the defined field types we should run the following command:
</p>
<pre class="brush:bash">$curl 'http://localhost:8983/solr/collection1/schema/fieldtypes'</pre>
<p>The get information about a single type we should again add the <em>/</em> character and append the field type name to it, for example like this:
</p>
<pre class="brush:bash">$curl 'http://localhost:8983/solr/collection1/schema/fieldtypes/text_gl'</pre>
<p>Solr will return the following information in response to the first command:
</p>
<pre class="brush:plain">{
  "responseHeader":{
    "status":0,
    "QTime":3},
  "fieldTypes":[{
      "name":"alphaOnlySort",
      "class":"solr.TextField",
      "sortMissingLast":true,
      "omitNorms":true,
      "analyzer":{
        "class":"solr.TokenizerChain",
        "tokenizer":{
          "class":"solr.KeywordTokenizerFactory"},
        "filters":[{
            "class":"solr.LowerCaseFilterFactory"},
          {
            "class":"solr.TrimFilterFactory"},
          {
            "class":"solr.PatternReplaceFilterFactory",
            "replace":"all",
            "replacement":"",
            "pattern":"([^a-z])"}]},
      "fields":[],
      "dynamicFields":[]},
    {
      "name":"boolean",
      "class":"solr.BoolField",
      "sortMissingLast":true,
      "fields":["inStock"],
      "dynamicFields":["*_bs",
        "*_b"]},
    {
      "name":"text_gl",
      "class":"solr.TextField",
      "positionIncrementGap":"100",
      "analyzer":{
        "class":"solr.TokenizerChain",
        "tokenizer":{
          "class":"solr.StandardTokenizerFactory"},
        "filters":[{
            "class":"solr.LowerCaseFilterFactory"},
          {
            "class":"solr.StopFilterFactory",
            "words":"lang/stopwords_gl.txt",
            "ignoreCase":"true",
            "enablePositionIncrements":"true"},
          {
            "class":"solr.GalicianStemFilterFactory"}]},
      "fields":[],
      "dynamicFields":[]},
    {
      "name":"tlong",
      "class":"solr.TrieLongField",
      "precisionStep":"8",
      "positionIncrementGap":"0",
      "fields":[],
      "dynamicFields":["*_tl"]}]}</pre>
<p>In response to the second command Solr will return the following:
</p>
<pre class="brush:plain">{
  "responseHeader":{
    "status":0,
    "QTime":2},
  "fieldType":{
    "name":"text_gl",
    "class":"solr.TextField",
    "positionIncrementGap":"100",
    "analyzer":{
      "class":"solr.TokenizerChain",
      "tokenizer":{
        "class":"solr.StandardTokenizerFactory"},
      "filters":[{
          "class":"solr.LowerCaseFilterFactory"},
        {
          "class":"solr.StopFilterFactory",
          "words":"lang/stopwords_gl.txt",
          "ignoreCase":"true",
          "enablePositionIncrements":"true"},
        {
          "class":"solr.GalicianStemFilterFactory"}]},
    "fields":[],
    "dynamicFields":[]}}</pre>
<p>As you can see, the amount information is nice as we are getting all the information about the field types and in addition to that the information which field are using give field (both dynamic and non-dynamic.</p>
<h4>Retrieving information about copyFields</h4>
<p>In addition to what we&#8217;ve discussed so far we are able to get information about copyFields section from the <em>schema.xml</em>. In order to do that one should run the following command:
</p>
<pre class="brush:bash">$curl 'http://localhost:8983/solr/collection1/schema/copyfields'</pre>
<p>And in response we will get the following data:
</p>
<pre class="brush:plain">{
  "responseHeader":{
    "status":0,
    "QTime":1},
  "copyfields":[{
      "source":"author",
      "dest":"text"},
    {
      "source":"cat",
      "dest":"text"},
    {
      "source":"content",
      "dest":"text"},
    {
      "source":"content_type",
      "dest":"text"},
    {
      "source":"description",
      "dest":"text"},
    {
      "source":"features",
      "dest":"text"},
    {
      "source":"author",
      "dest":"author_s",
      "destDynamicBase":"*_s"}]}</pre>
<h3>The future</h3>
<p>In Solr 4.3 the described API was improved and is now being prepared to enable not only reading of the index structure, but also writing modifications to it with the use of HTTP requests. We can expect that feature in one of the upcoming versions of Apache Solr, so its worth waiting in my opinion, at least by those who needs it.</p>
<p>W Solr 4.3 opisywane API zostało usprawnione oraz jest przygotowywane do umożliwienia zmian w strukturze indeksu za pomocą protokołu HTTP. Możemy zatem spodziewać się, iż w jednej z kolejnych wersji serwera wyszukiwania Solr otrzymamy możliwość łatwej zmiany struktury indeksu, przynajmniej takich, które nie będą powodować konfliktów z już zaindeksowanymi danymi.</p>
]]></content:encoded>
					
					<wfw:commentRss>https://solr.pl/en/2013/05/20/solr-4-2-index-structure-reading-api/feed/</wfw:commentRss>
			<slash:comments>0</slash:comments>
		
		
			</item>
		<item>
		<title>”Car sale” application – WordDelimiterFilter and PatternReplaceFilter, helping to improve search results (part 2)</title>
		<link>https://solr.pl/en/2011/02/14/car-sale-application-worddelimiterfilter-and-patternreplacefilter-helping-to-improve-search-results-part-2/</link>
					<comments>https://solr.pl/en/2011/02/14/car-sale-application-worddelimiterfilter-and-patternreplacefilter-helping-to-improve-search-results-part-2/#respond</comments>
		
		<dc:creator><![CDATA[Rafał Kuć]]></dc:creator>
		<pubDate>Mon, 14 Feb 2011 08:09:28 +0000</pubDate>
				<category><![CDATA[Solr]]></category>
		<category><![CDATA[howto]]></category>
		<category><![CDATA[schema.xml]]></category>
		<guid isPermaLink="false">http://sematext.solr.pl/?p=196</guid>

					<description><![CDATA[In the first part of our ”Car sale” application related posts we created some standard index structure by properly configuring schema.xml configuration file. It didn&#8217;t take long to hear the first complains from the website users with this kind of]]></description>
										<content:encoded><![CDATA[<p>In the <a href="http://solr.pl/en/2011/01/31/car-sale-application-schema-xml-designing-to-gain-what-we-really-need-part-1/" target="_blank" rel="noopener noreferrer">first part of our ”Car sale” application</a> related posts we created some standard index structure by properly configuring schema.xml configuration file. It didn&#8217;t take long to hear the first complains from the website users with this kind of configuration. Why don&#8217;t I receive any search results entering the &#8220;audi a&#8221; phrase ? I would like to see some announcements with &#8220;Audi A6&#8221; and &#8220;Audi A8&#8221; for example. I entered the phrase &#8220;Honda crv&#8221; – 0 results, &#8220;Suzuki maruti&#8221; – none. Are there no related offers in the announcement database ? There are! But the current configuration of the searchable field type (field &#8220;content&#8221; – type &#8220;text&#8221;) does not allow us to find those offers using the queries we&#8217;ve entered. That&#8217;s the reason why the WordDelimiterFilter and PatternReplaceFilter need to enter the battlefield.</p>
<p><span id="more-196"></span></p>
<h2>Requirements specification</h2>
<p>We need to analyze the data, that is indexed in the &#8220;content&#8221; field. Let&#8217;s examine the sample data, that will be used for helping to create the new &#8220;text&#8221; type configuration:</p>
<ul>
<li><em>Make</em>: Audi<br />
<em>Model</em>: 80, 90, A6, A8, TT</li>
</ul>
<ul>
<li><em>Make</em>: BMW<br />
<em>Model</em>: M3, M5, Series 7, Series 8, X1, X3</li>
</ul>
<ul>
<li><em>Make</em>: Chevrolet<br />
<em>Model</em>: TrailBlazer</li>
</ul>
<ul>
<li><em>Make</em>: Citroen<br />
<em>Model</em>: C-Crosser, C3 Pluriel, C4 Picasso</li>
</ul>
<ul>
<li><em>Make</em>: Ford<br />
<em>Model</em>: C-MAX, S-MAX</li>
</ul>
<ul>
<li><em>Make</em>: Honda<br />
<em>Model</em>: Accord, CR-V, FR-V, HR-V</li>
</ul>
<ul>
<li><em>Make</em>: Kia<br />
<em>Model</em>: Cee&#8217;d</li>
</ul>
<ul>
<li><em>Make</em>: Suzuki<br />
<em>Model</em>: Alto/Maruti</li>
</ul>
<p>Make names are simple words, that are easily handled by the current configuration (WhitespaceTokenizer + LowerCaseFilter). The problem is with the model names, as they contain additional characters and separators, that we often ignore when entering the search phrase. Let&#8217;s try to put the sample date into some groups, that will help us with the incoming configuration:</p>
<ol>
<li>Model names, that do not need to be processed by any additional filters (the current &#8220;text” type configuration is sufficient) &#8211; 80, 90, TT, Series 7, Series 8, Accord</li>
<li>Model names, which contain letters and numbers, where we want to split on letter-number transitions &#8211;  A6, A8, M3, M5, X1, X3, C3 Pluriel, C4 Picasso. We would like to be able to find those models when entering only a letter, only a number and whole model name too.</li>
<li>Models, which have the case transitions in the name – TrailBlazer. We would like to find the document with this name when entering &#8220;trail&#8221;, &#8220;blazer&#8221;, &#8220;trailBlazer&#8221;, &#8220;trailblazer&#8221;.</li>
<li>Model names, that contain intra-word delimiters, which we want to ignore or split on them &#8211; C-Crosser, C-MAX, S-MAX, CR-V, FR-V, HR-V, Alto/Maruti.<br />
Example: we would like to find the document with the model name &#8220;C-MAX&#8221; entering the phrases &#8220;c&#8221;, &#8220;max&#8221;, &#8220;c-max&#8221; &#8220;cmax&#8221;.</li>
<li>We intentionally omitted the &#8220;Cee&#8217;d&#8221; model name in the 4th point as we would like to treat this example a little differently. We don&#8217;t want to be able to find this model when entering the &#8220;cee&#8221; and &#8220;d&#8221; phrases. We treat the name only as the whole word &#8211; &#8220;cee&#8217;d&#8221; or &#8220;ceed&#8221;.</li>
</ol>
<h2>WordDelimiterFilter configuration</h2>
<p>With the given configuration we&#8217;ve described above, we are going to add proper values to the WordDelimiterFilter attributes in order to satisfy our needs:</p>
<ol>
<li>WordDelimiterFilter is needless in this case, as the current &#8220;text&#8221; type configuration (WhitespaceTokenizer + LowerCaseFilter) is sufficient.</li>
<li>In order to face the 2nd point requirements we need to set the proper values of the following attributes:
<ul>
<li> <em>generateWordParts=&#8221;1&#8243;</em> &#8211; the value must be set to &#8220;1&#8221; if we want to be able to generate parts of words</li>
<li><em>generateNumberParts=&#8221;1&#8243;</em> &#8211; the value must be set to &#8220;1&#8221; if we want to be able to generate parts of number words</li>
<li><em>splitOnNumerics=&#8221;1&#8243;</em> &#8211; the value must be set to &#8220;1&#8221; if we want to be able to generate a new parts from alphabet =&gt; number transitions</li>
</ul>
</li>
<li>In order to face the 3rd point requirements we need to set the proper values of the following attributes:
<ul>
<li> <em>generateWordParts=&#8221;1&#8243;</em></li>
<li><em>splitOnCaseChange=&#8221;1&#8243;</em> &#8211; the value must be set to &#8220;1&#8221; if we want to be able to split on lowercase =&gt; uppercase transitions</li>
</ul>
</li>
<li>In order to face the 4th point requirements we need to set the proper values of the following attributes:
<ul>
<li> <em>generateWordParts=&#8221;1&#8243;</em></li>
<li><em>catenateWords=&#8221;1&#8243; </em>&#8211;  the value must be set to &#8220;1&#8221; if we want to be able to ignore the intra-word delimiters by joining the subwords</li>
</ul>
</li>
</ol>
<p>So let&#8217;s take a look at our WordDelimiterFilter configuration:
</p>
<pre class="brush:xml">&lt;filter class="solr.WordDelimiterFilterFactory"
 splitOnNumerics="1"
 splitOnNumerics="1"
 generateWordParts="1"
 generateNumberParts="1"
 catenateWords="1"
/&gt;</pre>
<p>Additionaly we may notice that the default value of the &#8220;splitOnNumerics&#8221; and &#8220;splitOnNumerics&#8221; attributes is &#8220;1&#8221;. The rest of the WordDelimiterFilter&#8217;s attributes (except the &#8220;stemEnglishPossessive&#8221;) have the default value set to &#8220;0&#8221;. So our configuration can be reduced to:
</p>
<pre class="brush:xml">&lt;filter class="solr.WordDelimiterFilterFactory"
 generateWordParts="1"
 generateNumberParts="1"
 catenateWords="1"
 stemEnglishPossessive="0"
/&gt;</pre>
<p>What about the 5th point of our data specification ? As we have stated, we wouldn&#8217;t like to treat the &#8220;&#8216;&#8221; sign as the intra-word delimiter. So maybe we could use the protected=&#8221;protwords.txt&#8221; option of the WordDelimiterFilter which will keep the word &#8220;Cee&#8217;d&#8221; unchanged ? Ok, but we would also like to be able to find this model when entering the &#8220;ceed&#8221; phrase, so this option is not good for us. The best solution would be to take care of this case in the separate filter and leave the WordDelimiterFilter with nothing to do.</p>
<h2>PatternReplaceFilter configuration</h2>
<p>we are going to put the PatternReplaceFilter before the WordDelimiterFilter. Using the PatternReplaceFilter we will be able to ignore the &#8221; &#8216; &#8221; sign by replacing it with the empty sign. Configuring the filter this way, the WordDelimiterFilter will receive the &#8220;Ceed&#8221; token and will not modify this value. The configuration of the filters will be the same for indexing and searching, so a user will be able to find the offer with the &#8220;Cee&#8217;d&#8221; model when entering the phrases &#8220;cee&#8217;d&#8221; and &#8220;ceed&#8221;:
</p>
<pre>&lt;filter class="solr.PatternReplaceFilterFactory" pattern="'" replacement="" replace="all" /&gt;</pre>
<h2>New &#8220;text&#8221; type configuration visualization</h2>
<p>Let&#8217;s take a look at our new &#8220;text&#8221; type:
</p>
<pre class="brush:xml">&lt;fieldType name="text" positionIncrementGap="100"&gt;
 &lt;analyzer&gt;
  &lt;tokenizer class="solr.WhitespaceTokenizerFactory"/&gt;
   &lt;filter class="solr.PatternReplaceFilterFactory" pattern="'" replacement="" replace="all" /&gt;
   &lt;filter class="solr.WordDelimiterFilterFactory"
    generateWordParts="1"
    generateNumberParts="1"
    catenateWords="1"
    stemEnglishPossessive="0"
  /&gt;
  &lt;filter class="solr.LowerCaseFilterFactory"/&gt;
 &lt;/analyzer&gt;
&lt;/fieldType&gt;</pre>
<p>We are going to use the solr&#8217;s administration panel to find out if the configuration we&#8217;ve created is correct:</p>
<p><a href="http://solr.pl/wp-content/uploads/2011/02/11.jpg"><img fetchpriority="high" decoding="async" class="aligncenter size-full wp-image-840" src="http://solr.pl/wp-content/uploads/2011/02/11.jpg" alt="" width="725" height="70"></a></p>
<p><a href="http://solr.pl/wp-content/uploads/2011/02/2.jpg"><img decoding="async" class="aligncenter size-full wp-image-841" src="http://solr.pl/wp-content/uploads/2011/02/2.jpg" alt="" width="694" height="740"></a></p>
<ol>
<li> (Model: &#8220;80&#8221;) As we&#8217;ve expected, our new filters don&#8217;t influence the data typical for the 1st point.</li>
<li>(Model: &#8220;A8&#8221;) WordDelimiterFilter did the split on letter-number transitions.</li>
<li>(Model: &#8220;TrailBlazer&#8221;) WordDelimiterFilter did the case transition generating &#8220;trail&#8221; and &#8220;Blazer&#8221; tokens. Additionaly we have the opportunity to  enter the  &#8220;trailblazer&#8221; phrase. Superb!</li>
<li>(Model: &#8220;CR-V&#8221;) WordDelimiterFilter ignored the intra-word delimiters by generating subwords(&#8220;cr&#8221; and &#8220;v&#8221;) and joining the subwords additionaly (&#8220;crv&#8221;).</li>
<li>(Model: &#8220;Cee&#8217;d&#8221;) PatternReplaceFilter have replaced the &#8220;Cee&#8217;d&#8221; word to &#8220;Ceed&#8221; and the WordDelimiterFilter have only passed the value. That&#8217;s what we needed.</li>
</ol>
<h2>The end</h2>
<p>In this post we&#8217;ve showed how to configure two new filters in order to improve the search results quality – WordDelimiterFilter and PatternReplaceFilter. Our website users are satisfied &#8230; for now.</p>
]]></content:encoded>
					
					<wfw:commentRss>https://solr.pl/en/2011/02/14/car-sale-application-worddelimiterfilter-and-patternreplacefilter-helping-to-improve-search-results-part-2/feed/</wfw:commentRss>
			<slash:comments>0</slash:comments>
		
		
			</item>
		<item>
		<title>&#8220;Car sale application&#8221; &#8211; schema.xml designing to gain what we really need (part 1)</title>
		<link>https://solr.pl/en/2011/01/31/car-sale-application-schema-xml-designing-to-gain-what-we-really-need-part-1/</link>
					<comments>https://solr.pl/en/2011/01/31/car-sale-application-schema-xml-designing-to-gain-what-we-really-need-part-1/#respond</comments>
		
		<dc:creator><![CDATA[Rafał Kuć]]></dc:creator>
		<pubDate>Mon, 31 Jan 2011 08:07:42 +0000</pubDate>
				<category><![CDATA[Solr]]></category>
		<category><![CDATA[howto]]></category>
		<category><![CDATA[schema.xml]]></category>
		<category><![CDATA[solr]]></category>
		<guid isPermaLink="false">http://sematext.solr.pl/?p=192</guid>

					<description><![CDATA[One of the fundamental solr&#8217;s configuration file is the schema.xml file. It is a kind of connector between what we need and what solr understands. If we want to have a search engine, that gives us search results we really]]></description>
										<content:encoded><![CDATA[<p>One of the fundamental solr&#8217;s configuration file is the schema.xml file. It is a kind of connector between what we need and what solr understands. If we want to have a search engine, that gives us search results we really expect, then it is very important to properly design the schema.xml configuration file.<br />
We would like to introduce you the first of the series of articles which will hopefully show us how to design schema.xml file and how to handle and modify all of the file&#8217;s components.</p>
<p><span id="more-192"></span></p>
<h2>Requirements specification</h2>
<p>Imagine we would like to use solr to provide our car sale website with a search engine. The functional part of our website is at the beginning rather primitive and takes the advantage of only the small piece of every car information:</p>
<ul>
<li>make</li>
<li>model</li>
<li>year of production</li>
<li>price</li>
<li>engine size</li>
<li>mileage</li>
<li>colour</li>
<li>damaged</li>
</ul>
<p>We would like to design a simple configuration schema file, which will make possible to index data from the given fields. But before we open the schema.xml file and start typing, let&#8217;s answer the seven fundamental questions related to our fields:</p>
<h3>1. What is the field type ?</h3>
<p>Let&#8217;s determine the type of every field:</p>
<ul>
<li>make &#8211; text field</li>
<li>model &#8211; text field</li>
<li>year of production &#8211; integer field</li>
<li>price &#8211; float field</li>
<li>engine size &#8211; integer field</li>
<li>mileage &#8211; integer field</li>
<li>colour &#8211; textual field</li>
<li>damaged &#8211; logical field</li>
</ul>
<h4>So what ?</h4>
<p>So we will need some basic type definitions like string, boolean, int, float.</p>
<h3>2. Is it the field used in search process ?</h3>
<p>We would like to use the data from some fields in order to enable our search engine to find the proper documents (car sale announcements). To accomplish that we are going to use 3 fields: make, model, year of production.</p>
<h4>So what ?</h4>
<p>So we will need to create another field type, which will contain some filters to  make finding the documents easy and efficient. We will create another field of the newly created type, where we will put all the data from make, model and year of production fields.</p>
<h3>3. Is it the field used in faceting or sorting operation ?</h3>
<p>In our website we would like to sort search results using 4 fields: model, year of production, price and mileage. We would also like to be able to to use facet operation on fields: make, model, year of production and colour.</p>
<h4>So what ?</h4>
<p>When we want to create a field type for fields used for sorting/faceting, then we need to know that this type cannot contain tokenizers and filters which can tokenize values in those fields. But still we want the values to be lowercased, so the letters size does not influence the sorting/faceting results. So that&#8217;s the kind of another field type we will need to create.</p>
<h3>4. Is it the field used to filter search results?</h3>
<p>We would like to have the possibility to filter search results using ranges on fields: year of production, price, engine size and mileage.</p>
<h4>So what ?</h4>
<p>So let&#8217;s use the field types which will accelerate range queries.</p>
<h3>5. Are there any fields which are not mentioned in the questions number 2, 3 or 4 ?</h3>
<p>There is a field &#8220;damaged&#8221; which is not supposed to be involved in any of the mentioned operations.</p>
<h4>So what?</h4>
<p>So we will set the value of the &#8220;indexed&#8221; attribute to &#8220;false&#8221;.</p>
<h3>6. Is the field required ?</h3>
<p>We assume that there are 3 fields which are supposed to be required: make, model and year. We don&#8217;t want to have documents in index (car sale announcements available in the search process), which do not have values in those fields.</p>
<h4>So what ?</h4>
<p>So we will set the value of the &#8220;required&#8221; attribute to &#8220;true&#8221;.</p>
<h3>7. Do we need to retrieve the information from the field in the original state?</h3>
<p>We would like to retrieve the information from all of the fields mentioned in the requirements specification and present them directly on the website.</p>
<h4>So what?</h4>
<p>So we will set the value of the &#8220;stored&#8221; attribute to &#8220;true&#8221;.</p>
<h2>Let&#8217;s add field type definitions</h2>
<p>We&#8217;ve answered our questions, we&#8217;ve come to some conclusions so let&#8217;s add field types to the schema file:</p>
<p>We add the solr.StrField type, which is not analysed and can be used for example as the type for the unique document key.
</p>
<pre class="brush:xml">&lt;fieldType name="string" class="solr.StrField" sortMissingLast="true" omitNorms="true"/&gt;</pre>
<p>Add the boolean type:
</p>
<pre class="brush:xml">&lt;fieldType name="boolean" class="solr.BoolField" sortMissingLast="true" omitNorms="true"/&gt;</pre>
<p>Now the numerical types. Remember that we need types that can help us to accelerate range queries. So let&#8217;s use the tint and tfloat types:
</p>
<pre class="brush:xml">    &lt;fieldType name="tint" class="solr.TrieIntField" precisionStep="8" omitNorms="true" positionIncrementGap="0"/&gt;
    &lt;fieldType name="tfloat" class="solr.TrieFloatField" precisionStep="8" omitNorms="true" positionIncrementGap="0"/&gt;</pre>
<p>Now let&#8217;s create the textual type, which will be a definition type for the catch-all field used for searching. For now, the type with the whitespace tokenizer and the lowercase filter will be just fine:
</p>
<pre class="brush:xml">    &lt;fieldType name="text" class="solr.TextField" positionIncrementGap="100"&gt;
      &lt;analyzer&gt;
        &lt;tokenizer class="solr.WhitespaceTokenizerFactory"/&gt;
        &lt;filter class="solr.LowerCaseFilterFactory"/&gt;
      &lt;/analyzer&gt;
    &lt;/fieldType&gt;</pre>
<p>And last, but not least, the type for the sortable/facetable fields. What we need is the type that lowercases the entire field value, keeping it as a single token. KeywordTokenizer does no actual tokenizing, so it is the ideal tokenizer for our need. The TrimFilterFactory removes any leading or trailing whitespace:
</p>
<pre class="brush:xml">    &lt;fieldType name="lowercase" class="solr.TextField" positionIncrementGap="100"&gt;
      &lt;analyzer&gt;
        &lt;tokenizer class="solr.KeywordTokenizerFactory"/&gt;
        &lt;filter class="solr.LowerCaseFilterFactory" /&gt;
        &lt;filter class="solr.TrimFilterFactory" /&gt;
      &lt;/analyzer&gt;
    &lt;/fieldType&gt;</pre>
<h2>Time to add field definitions</h2>
<p>Document id:
</p>
<pre class="brush:xml">   &lt;field name="id" type="string" indexed="true" stored="true" required="true" /&gt;</pre>
<p>Make and model:
</p>
<pre class="brush:xml">   &lt;field name="make" type="text" indexed="false" stored="true" required="true" /&gt;
   &lt;field name="model" type="text" indexed="false" stored="true" required="true" /&gt;</pre>
<p>Now why is the value of the &#8220;indexed&#8221; attribute set to &#8220;false&#8221; ? As far as we know, we need those fields to search, sort and facet operations. That&#8217;s true &#8230; but we need to notice that for the searching purposes we will copy the data from those fields to one catch-all field:
</p>
<pre class="brush:xml">   &lt;field name="content" type="text" indexed="true" stored="false" multiValued="true"/&gt;</pre>
<p>and for the sorting/faceting purposes we will copy the data yet to other fields of the type &#8220;lowercase&#8221;:
</p>
<pre class="brush:xml">   &lt;field name="make_sort" type="lowercase" indexed="true" stored="false" /&gt;
   &lt;field name="model_sort" type="lowercase" indexed="true" stored="false" /&gt;</pre>
<p>So the fields make and model will not take part in the operations itself and we can set the &#8220;indexed&#8221; attribute to &#8220;false&#8221; for best index size.</p>
<p>The rest of the fields:
</p>
<pre class="brush:xml">   &lt;field name="year" type="tint" indexed="true" stored="true" required="true" /&gt;
   &lt;field name="price" type="tfloat" indexed="true" stored="true" /&gt;
   &lt;field name="engine_size" type="tint" indexed="true" stored="true" /&gt;
   &lt;field name="mileage" type="tint" indexed="true" stored="true" /&gt;
   &lt;field name="colour" type="lowercase" indexed="true" stored="true" /&gt;</pre>
<p>Remember about the &#8220;false&#8221; value of the &#8220;indexed&#8221; attribute of the &#8220;damaged&#8221; field:
</p>
<pre class="brush:xml"> &lt;field name="damaged" type="boolean" indexed="false" stored="true" /&gt;</pre>
<h2>copyField &#8211; let&#8217;s index the same data differently</h2>
<p>We have mentioned the field values copying several times already so now let&#8217;s define copy fields.</p>
<p>Fields used for searching are copied to catch-all &#8220;content&#8221; field. There is more than one source field, that&#8217;s why the &#8220;content&#8221; field definition contains the multiValued attribute set to &#8220;true&#8221;:
</p>
<pre class="brush:xml"> &lt;copyField source="make" dest="content"/&gt;
 &lt;copyField source="model" dest="content"/&gt;
 &lt;copyField source="year" dest="content"/&gt;</pre>
<p>Copying the sortable/facetable fields:
</p>
<pre class="brush:xml"> &lt;copyField source="make" dest="make_sort"/&gt;
 &lt;copyField source="model" dest="model_sort"/&gt;</pre>
<h2>Anything else ?</h2>
<p>We shall add 3 more elements to the schema:</p>
<p>The unique key of the document:
</p>
<pre class="brush:xml"> &lt;uniqueKey&gt;id&lt;/uniqueKey&gt;</pre>
<p>Default search field:
</p>
<pre class="brush:xml"> &lt;defaultSearchField&gt;content&lt;/defaultSearchField&gt;</pre>
<p>Default query parser operator. Let&#8217;s set it to &#8220;AND&#8221;.
</p>
<pre class="brush:xml"> &lt;solrQueryParser defaultOperator="AND"/&gt;</pre>
<p>It&#8217;s done! The schema.xml configuration file is ready and looks like this:
</p>
<pre class="brush:xml">&lt;?xml version="1.0" encoding="UTF-8" ?&gt;

&lt;schema name="carsale" version="1.2"&gt;

  &lt;types&gt;
    &lt;fieldType name="string" class="solr.StrField" sortMissingLast="true" omitNorms="true"/&gt;

    &lt;fieldType name="boolean" class="solr.BoolField" sortMissingLast="true" omitNorms="true"/&gt;

     &lt;fieldType name="tint" class="solr.TrieIntField" precisionStep="8" omitNorms="true" positionIncrementGap="0"/&gt;
    &lt;fieldType name="tfloat" class="solr.TrieFloatField" precisionStep="8" omitNorms="true" positionIncrementGap="0"/&gt;

    &lt;fieldType name="text" class="solr.TextField" positionIncrementGap="100"&gt;
      &lt;analyzer&gt;
        &lt;tokenizer class="solr.WhitespaceTokenizerFactory"/&gt;
        &lt;filter class="solr.LowerCaseFilterFactory"/&gt;
      &lt;/analyzer&gt;
    &lt;/fieldType&gt;

    &lt;fieldType name="lowercase" class="solr.TextField" positionIncrementGap="100"&gt;
      &lt;analyzer&gt;
        &lt;tokenizer class="solr.KeywordTokenizerFactory"/&gt;
        &lt;filter class="solr.LowerCaseFilterFactory" /&gt;
        &lt;filter class="solr.TrimFilterFactory" /&gt;
      &lt;/analyzer&gt;
    &lt;/fieldType&gt;

 &lt;/types&gt;

 &lt;fields&gt;
   &lt;field name="id" type="string" indexed="true" stored="true" required="true" /&gt;
   &lt;field name="make" type="text" indexed="false" stored="true" required="true" /&gt;
   &lt;field name="model" type="text" indexed="false" stored="true" required="true" /&gt;
   &lt;field name="make_sort" type="lowercase" indexed="true" stored="false" /&gt;
   &lt;field name="model_sort" type="lowercase" indexed="true" stored="false" /&gt;
   &lt;field name="year" type="tint" indexed="true" stored="true" required="true" /&gt;
   &lt;field name="price" type="tfloat" indexed="true" stored="true" /&gt;
   &lt;field name="engine_size" type="tint" indexed="true" stored="true" /&gt;
   &lt;field name="mileage" type="tint" indexed="true" stored="true" /&gt;
   &lt;field name="colour" type="lowercase" indexed="true" stored="true" /&gt;
   &lt;field name="damaged" type="boolean" indexed="false" stored="true" /&gt;
   &lt;field name="content" type="text" indexed="true" stored="false" multiValued="true"/&gt;

 &lt;/fields&gt;

 &lt;uniqueKey&gt;id&lt;/uniqueKey&gt;

 &lt;defaultSearchField&gt;content&lt;/defaultSearchField&gt;

 &lt;solrQueryParser defaultOperator="AND"/&gt;

 &lt;copyField source="make" dest="content"/&gt;
 &lt;copyField source="model" dest="content"/&gt;
 &lt;copyField source="make" dest="make_sort"/&gt;
 &lt;copyField source="model" dest="model_sort"/&gt;
 &lt;copyField source="year" dest="content"/&gt;

&lt;/schema&gt;</pre>
<h2>The end</h2>
<p>In today&#8217;s post we have created the simple schema.xml file, which allows us to index data, so that we are able to face our car sale website search functionalities. But still we want to develop our website which will surely affects the schema &#8230; and not only the schema. In the next &#8220;car sale&#8221; related post we will try to face some new requirements and provide next modifications.</p>
]]></content:encoded>
					
					<wfw:commentRss>https://solr.pl/en/2011/01/31/car-sale-application-schema-xml-designing-to-gain-what-we-really-need-part-1/feed/</wfw:commentRss>
			<slash:comments>0</slash:comments>
		
		
			</item>
		<item>
		<title>5 sins of schema.xml modifications</title>
		<link>https://solr.pl/en/2010/08/30/5-sins-of-schema-xml-modifications/</link>
					<comments>https://solr.pl/en/2010/08/30/5-sins-of-schema-xml-modifications/#respond</comments>
		
		<dc:creator><![CDATA[Rafał Kuć]]></dc:creator>
		<pubDate>Mon, 30 Aug 2010 12:08:35 +0000</pubDate>
				<category><![CDATA[Solr]]></category>
		<category><![CDATA[attribute]]></category>
		<category><![CDATA[attributes]]></category>
		<category><![CDATA[error]]></category>
		<category><![CDATA[index]]></category>
		<category><![CDATA[index structure]]></category>
		<category><![CDATA[indexing]]></category>
		<category><![CDATA[mistake]]></category>
		<category><![CDATA[schema]]></category>
		<category><![CDATA[schema.xml]]></category>
		<category><![CDATA[solr]]></category>
		<category><![CDATA[structure]]></category>
		<guid isPermaLink="false">http://sematext.solr.pl/?p=71</guid>

					<description><![CDATA[I made a promise and here it is &#8211; the entry on the most common mistakes when designing Solr index, which is when You create or modify the schema.xml file for Your system implementation. Feel free to read on 😉]]></description>
										<content:encoded><![CDATA[<p>I made a promise and here it is &#8211; the entry on the most common mistakes when designing Solr index, which is when You create or modify the <em>schema.xml</em> file for Your system implementation. Feel free to read on <img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f609.png" alt="😉" class="wp-smiley" style="height: 1em; max-height: 1em;" /></p>
<p><span id="more-71"></span></p>
<p>Each of us knows what is schema.xml file and what is (if not, I invite you to read the entry located at: <a href="http://solr.pl/2010/08/16/what-is-schema-xml/?lang=en" target="_blank" rel="noopener noreferrer">http://solr.pl/2010/08/16/what-is-schema-xml/?lang=en</a>). What are the most frequently commit errors creating or updating this file? I personally met with the following:</p>
<h3>1. Trash in the configuration</h3>
<p>I confess that the first principle is to keep the file <em>schema.xml</em> in the simplest possible form. Linked to this is a very important issue &#8211; this file should not be synonymous with chaos. In other word, do not stick with unnecessary comments, unwanted types, fields and so on. Order in the structure of the <em>schema.xml</em> file not only helps us to maintain this file and its modifications with ease, but also assures us that no information that is unnecessary will be stored in Solr index.</p>
<h3>2. Cosmetic changes to the default configuration</h3>
<p>How many of those who use Solr in their daily work took the default<em> schema.xml</em> file supplied in the example implementation Solr and only slightly modified the contents &#8211; for example, changing only the names of the fields ? I should raise my hand too, because I did it once. This is a pretty big mistake. Someone may ask why. Are you sure You need English stemming when implementing search for content written in Polish ? I think not. The same applies to field and type attributes like term vectors.</p>
<h3>3. No updates</h3>
<p>Sometimes I find the implementation of search based application, where update of Solr does not mean an update of <em>schema.xml</em> file. If it is a conscious decision, dictated by such costly or even impossible re-indexing of all data, I understand the situation. But there are cases where an upgrade would bring only benefits, and where costs of such upgrade would be minimal (eg less expensive re-index or slight changes in the application). Do not be afraid to update the <em>schema.xml</em> file &#8211; whether it is to update the fields, update types, whether the addition of newer stuff. A good example is the migration from Solr 1.3 to version 1.4 &#8211; newer version introduced significant changes associated with numeric types, where migration to the new types would result in great increase in query performance using those types (such as queries using value ranges).</p>
<h3>4. &#8220;I`ll use it one day&#8221;</h3>
<p>Adding new types, not removing unnecessary now, the same in the case of fields, or <em>copyField </em>definition. Most of us think &#8211; that old definition can be useful in the future, but remember that each type is some extra portion of memory needed by Solr, each field is a place in the index. My small advice &#8211; if you stop to use the type, field, or whatever else you have in your configuration file (not only in the <em>schema.xml</em>), simply remove it from this file. Applying this principle throughout the life cycle of the applications using Solr will ensure You that the index is in optimal condition, and after a few months since another feature implementation You will not need to be puzzled and as a result You will not need to dig into the application code to determine if the field is used in some forgotten code fragment.</p>
<h3>5. Attributes, attributes and again attributes</h3>
<p>Preservation of original values, adding term vectors and its properties are just examples of things we don`t need in every implementation. Sometimes we have more than required by the application index. A larger index, lower productivity, at least in some cases (eg, indexing). It is worth considering if you really need all this information, which we say to Solr to calculate and store. Removing some unnecessary, of course, from our point of view of information, may surprise us. Sometimes it is worth a try;)</p>
<p>Feel free to comment, because I will read eagerly, for what else we should pay attention to when modifying schema.xml file.</p>
<p>Finally, I think that it is worth to mention the article <em>&#8220;The Seven Deadly Sins of Solr&#8221;</em> LucidImagination published on the website at: <a href="http://www.lucidimagination.com/blog/2010/01/21/the-seven-deadly-sins-of-solr" target="_blank" rel="noopener noreferrer">http://www.lucidimagination.com/blog/2010/01/21/the-seven-deadly-sins-of-solr</a>. It describes bad practices when working with Solr. In my opinion, interesting reading. I highly recommend it.</p>
]]></content:encoded>
					
					<wfw:commentRss>https://solr.pl/en/2010/08/30/5-sins-of-schema-xml-modifications/feed/</wfw:commentRss>
			<slash:comments>0</slash:comments>
		
		
			</item>
		<item>
		<title>What is schema.xml?</title>
		<link>https://solr.pl/en/2010/08/16/what-is-schema-xml/</link>
					<comments>https://solr.pl/en/2010/08/16/what-is-schema-xml/#respond</comments>
		
		<dc:creator><![CDATA[Rafał Kuć]]></dc:creator>
		<pubDate>Mon, 16 Aug 2010 12:05:34 +0000</pubDate>
				<category><![CDATA[Solr]]></category>
		<category><![CDATA[analysis]]></category>
		<category><![CDATA[field]]></category>
		<category><![CDATA[filter]]></category>
		<category><![CDATA[schema]]></category>
		<category><![CDATA[schema.xml]]></category>
		<category><![CDATA[solr]]></category>
		<category><![CDATA[token]]></category>
		<category><![CDATA[tokenizer]]></category>
		<category><![CDATA[type]]></category>
		<guid isPermaLink="false">http://sematext.solr.pl/?p=64</guid>

					<description><![CDATA[One of the configuration files that describe each implementation Solr is schema.xml file. It describes one of the most important things of the implementation &#8211; the structure of the data index. The information contained in this file allow you to]]></description>
										<content:encoded><![CDATA[<p>One of the configuration files that describe each implementation Solr is <em>schema.xml</em> file. It describes one of the most important things of the implementation &#8211; the structure of the data index. The information contained in this file allow you to control how Solr behaves when indexing the data, or when making queries. <em>Schema.xml</em> is not only the very structure of the index, is also detailed information about data types that have a large influence on the behavior Solr, and usually are treated with neglect. This entry will try to bring some insight about <em>schema.xml</em>.</p>
<p><span id="more-64"></span></p>
<p><em>Schema.xml</em> file consists of several parts:</p>
<ul>
<li>version,</li>
<li>type definitions,</li>
<li>field definitions,</li>
<li>copyField section,</li>
<li>additional definitions.</li>
</ul>
<h3>Version</h3>
<p>The first thing we come across in the <em>schema.xml</em> file is the version. This is the information for Solr how to treat some of the attributes in <em>schema.xml</em> file. The definition is as follows:
</p>
<pre class="brush:xml">&lt;schema name="example" version="1.3"&gt;</pre>
<p>Please note that this is not the definition of the version from the perspective of your project. At this point Solr supports four versions of a <em>schema.xml</em> file:</p>
<ul>
<li>1.0 &#8211; <em>multiValued </em>attribute does not exist, all fields are multivalued by default.</li>
<li>1.1 &#8211; introduced <em>multiValued </em>attribute, the default attribute value is <em>false</em>.</li>
<li>1.2 &#8211; introduced <em>omitTermFreqAndPositions </em>attribute, the default value is <em>true</em> for all fields, besides text fields.</li>
<li>1.3 &#8211; removed the possibility of an optional compression of fields.</li>
</ul>
<h3>Type definitions</h3>
<p>Type definitions can be logically divided into two separate sections &#8211; the simple types and complex types. Simple types as opposed to the complex types do not have a defined filters and tokenizer.</p>
<p><strong>Simple types</strong></p>
<p>First thing we see in the <em>schema.xml</em> file after version are types definition. Each type is described as a number of attributes defining the behavior of that type. First, some attributes that describe each type and are mandatory:</p>
<ul>
<li><em>name </em>&#8211; name of the type (required attribute).</li>
<li><em>class </em>&#8211; class that is responsible for the implementation. Please note  that classes are delivered from standard Solr packaged will have names  with &#8216;solr&#8217; prefix.</li>
</ul>
<p>Besides the two mentioned above, types can have the following optional attributes:</p>
<ul>
<li> <em>sortMissingLast </em>&#8211; attribute specifying how values in a field based on this type should be treated in case of sorting. When set to <em>true</em> documents without value in a field of this type will always be at the end of the results list regardless of sort order. The default attribute value is <em>false</em>. Attribute can be used only for types that are considered by Lucene as a string.</li>
<li><em>sortMissingFirst </em>&#8211; attribute specifying how values in a field based on  this type should be treated in case of sorting. When set to <em>true</em> documents without value in a field of this type will always be at the  first positions of the results list regardles of sort order. The default  attribute value is <em>false</em>. Attribute can be used only for types that are considered by Lucene as a string.</li>
<li><em>omitNorms </em>&#8211; attribute specifying whether field normalization should take place.</li>
<li><em>omitTermFreqAndPositions </em>&#8211; attribute specifying whether term frequency and term positions should be calculated.</li>
<li><em>indexed </em>&#8211; attribute specifying whether the field based on this type will keep their original values.</li>
<li><em>positionIncrementGap </em>&#8211; attribute specifying how many position Lucene should skip.</li>
</ul>
<p>It is worth remembering that in the default settings <em>sortMissingLast </em>and <em>sortMissingFirst</em> attributes Lucene will apply behavior of placing a document with blank field values at the beginning of the ascending sort, and at the end of the list of results for descending sorting.</p>
<p>One more options for simple types, but only those based on <em>Trie*Field</em> classes:</p>
<ul>
<li><em>precisionStep</em> &#8211; attribute specifying the number of bits of precision. The greater the number of bits, the faster the queries based on numerical ranges. This however, also increases the size of the index, as more values are indexed. Set attribute value to 0 to disable the functionality of indexing at various precisions.</li>
</ul>
<p>An example of a simple type defined:
</p>
<pre class="brush:xml">&lt;fieldType name="string" class="solr.StrField" sortMissingLast="<em>true</em>" omitNorms="<em>true</em>"/&gt;</pre>
<p><strong>Complex types</strong></p>
<p>In addition to simple types, <em>schema.xml</em> file may include types consisting of a tokenizer and filters. Tokenizer is responsible for dividing the contents of the field in the tokens, while the filters are responsible for further token analysis. For example, the type that is responsible for dealing with the texts in Polish, would consist of a tokenizer in charge of the division of words based on whitespace, commas and periods. Filters for that type could be responsible for bringing generated tokens to lowercase, further division of tokens (for example on the basis of dashes), and then bringing tokens to the basic form.</p>
<p>Complex types, like simple types, have their name (<em>name </em>attribute) and the class which is responsible for implementation (<em>class </em>attribute). They can also be characterized by other attributes as described in the case of simple types (on the same basis). In addition, however, complex types can have a definition of tokenizer and filters to be used at the stage of indexing, and at the stage of query. As most of you know, for a given phase (indexing, or query) there can can be many filters defined but only one tokenizer. For example, just looks like a text type definition look like in the example provided with Solr:
</p>
<pre class="brush:xml">&lt;fieldType name="text" class="solr.TextField" positionIncrementGap="100" autoGeneratePhraseQueries="<em>true</em>"&gt;
   &lt;analyzer type="index"&gt;
      &lt;tokenizer class="solr.WhitespaceTokenizerFactory"/&gt;
      &lt;filter class="solr.StopFilterFactory" ignoreCase="<em>true</em>" words="stopwords.txt" enablePositionIncrements="<em>true</em>" /&gt;
      &lt;filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/&gt;
      &lt;filter class="solr.LowerCaseFilterFactory"/&gt;
      &lt;filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/&gt;
      &lt;filter class="solr.PorterStemFilterFactory"/&gt;
   &lt;/analyzer&gt;
   &lt;analyzer type="query"&gt;
      &lt;tokenizer class="solr.WhitespaceTokenizerFactory"/&gt;
      &lt;filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="<em>true</em>" expand="<em>true</em>"/&gt;
      &lt;filter class="solr.StopFilterFactory" ignoreCase="<em>true</em>" words="stopwords.txt" enablePositionIncrements="<em>true</em>" /&gt;
      &lt;filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/&gt;
      &lt;filter class="solr.LowerCaseFilterFactory"/&gt;
      &lt;filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/&gt;
      &lt;filter class="solr.PorterStemFilterFactory"/&gt;
   &lt;/analyzer&gt;
&lt;/fieldType&gt;</pre>
<p>It is worth noting that there is an additional attribute for the text field type:</p>
<ul>
<li> <em>autoGeneratePhraseQueries</em></li>
</ul>
<p>This attribute is responsible for telling filters how to behave when dividing tokens. Some filters (such as <em>WordDelimiterFilter</em>) can divide tokens into a set of tokens. Setting the attribute to <em>true</em> (default value) will automatically generate phrase queries. This means that <em>WordDelimiterFilter </em>will divide the word &#8220;wi-fi&#8221; into two tokens &#8220;wi&#8221; and &#8220;fi&#8221;. With autoGeneratePhraseQueries set to <em>true</em> query sent to Lucene will look like <code>"field:wi fi"</code>, while with set to <em>false</em> Lucene query will look like <code>field:wi OR field:fi</code>. However, please note, that this attribute only behaves well with tokenizers based on white spaces.</p>
<p>Returning to the type definition. As you can see, I gave an example which has two main sections:
</p>
<pre class="brush:xml">&lt;analyzer type="index"&gt;</pre>
<p>and
</p>
<pre class="brush:xml">&lt;analyzer type="query"&gt;</pre>
<p>The first section is responsible for the definition of the type, which will be used for indexing documents, the second section is responsible for the definition of type used for queries to fields based on this type. Note that if you want to use the same definitions for indexing and query phase, you can opt out of the two sections. Then our definition will look like this:
</p>
<pre class="brush:xml">&lt;fieldType name="text" class="solr.TextField" positionIncrementGap="100" autoGeneratePhraseQueries="<em>true</em>"&gt;
   &lt;tokenizer class="solr.WhitespaceTokenizerFactory"/&gt;
   &lt;filter class="solr.StopFilterFactory" ignoreCase="<em>true</em>" words="stopwords.txt" enablePositionIncrements="<em>true</em>" /&gt;
   &lt;filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/&gt;
   &lt;filter class="solr.LowerCaseFilterFactory"/&gt;
   &lt;filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/&gt;
   &lt;filter class="solr.PorterStemFilterFactory"/&gt;
&lt;/fieldType&gt;</pre>
<p>As I mentioned in the definition of each complex type there is a tokenizer and a series of filters (though not necessarily). I will not describe each filter and tokenizer available in Solr. This information is available at the following address: <a href="http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters" target="_blank" rel="noopener noreferrer">http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters</a>.</p>
<p>At the end I wanted to add an important thing. Starting from 1.4 Solr tokenizer does not need to be the first mechanism that deals with the analysis of the field. Solr 1.4 introduced new filters &#8211; <em>CharFilters </em>that operate on the field before tokenizer and transmit the result to the tokenizer. It is worth to know because it might come in useful.</p>
<p><strong>Multi-dimensional types</strong></p>
<p>At the end I left myself a little addition &#8211; a novelty in Solr 1.4 &#8211; multi-dimensional fields &#8211;  fields consisting of a number of other fields. Generally speaking, the assumption of this type of field was simple &#8211; to store in Solr pairs of values, triples or more related data, such as georaphical point coordinates. In practice this is realized by means of dynamic fields, but let me not get into the implementation details. The sample type definition that will consist two fields:
</p>
<pre class="brush:xml">&lt;fieldType name="location" class="solr.PointType" dimension="2" subFieldSuffix="_d"/&gt;</pre>
<p>In addition to standard attributes: name and class there are two others:</p>
<ul>
<li> dimension &#8211; the number of dimensions (used by the class attribute <em>solr.PointType</em>).</li>
<li>subFieldSuffix &#8211; suffix, which will be added to the dynamic fields  created by that type. It is important to remember that the field based  on the presented type will create three fields in the index &#8211; the actual  field (for example named mylocation and two additional dynamic fields).</li>
</ul>
<h3><strong>Field Definitions</strong></h3>
<p>Definitions of the fields is another section in the <em>schema.xml</em> file, the section, which in theory should be of interest to us the most during the design of Solr index. As a rule, we find here two kinds of field definitions:</p>
<ol>
<li>Static Fields</li>
<li>Dynamic Fields</li>
</ol>
<p>These fields are treated differently by the Solr. The first type of fields, are fields that are available under one name. Dynamic fields are fields that are available under many names &#8211; actually their name are a simple regular expression (name starting or ending with a &#8216;*&#8217; sign). Please note that Solr first selects the static field, then the dynamic field. In addition, if the field name matches more than one definition, Solr will select a field with a longer name pattern.</p>
<p>Returning to the definition of the fields (both static and dynamic), they consist of the following attributes:</p>
<ul>
<li><em>name </em>&#8211; the name of the field (required attribute).</li>
<li><em>type </em>&#8211; type of field, which is one of the pre-defined types (required attribute).</li>
<li><em>indexed </em>&#8211; if a field is to be indexed (set to <em>true</em>, if you want to search or sort on this field).</li>
<li><em>stored </em>&#8211; whether you want to store the original values (set to <em>true</em>, if we want to retrieve the original value of the field).</li>
<li><em>omitNorms </em>&#8211; whether you want norms to be ignored (set to <em>true</em> for the fields for which You will apply the full-text search).</li>
<li><em>termVectors </em>&#8211; set to <em>true</em> in the case when we want to keep so called term vectors. The default parameter value is <em>false</em>. Some features require setting this parameter to <em>true</em> (eg <em>MoreLikeThis </em>or <em>FastVectorHighlighting</em>).</li>
<li><em>termPositions </em>&#8211; set to <em>true</em>, if You want to keep term positions with the term vector. Setting to <em>true</em> will cause the index to expand its size.</li>
<li><em>termOffsets </em>&#8211; set to <em>true</em>, if You want to keep term offsets together with term vector. Setting to <em>true</em> will cause the index to expand its size.</li>
<li><em>default </em>&#8211; the default value to be given to the field when the document was not given any value in this field.</li>
</ul>
<p>The following examples of definitions of fields:
</p>
<pre class="brush:xml">&lt;field name="id" type="string" indexed="<em>true</em>" stored="<em>true</em>" required="<em>true</em>" /&gt;
&lt;field name="includes" type="text" indexed="<em>true</em>" stored="<em>true</em>" termVectors="<em>true</em>" termPositions="<em>true</em>" termOffsets="<em>true</em>" /&gt;
&lt;field name="timestamp" type="date" indexed="<em>true</em>" stored="<em>true</em>" default="NOW" multiValued="<em>false</em>"/&gt;
&lt;dynamicField name="*_i" type="int" indexed="<em>true</em>" stored="<em>true</em>"/&gt;</pre>
<p>And finally, additional information to remember. In addition to the attributes listed above in the fields definition, we can overwrite the attributes that have been defined for type (eg whether a field is to be multiValued &#8211; the above example for a field called timestamp). Sometimes, this functionality can be useful if you need a specific field whose type is slightly different from other types (as in the example &#8211; only multiValued attribute). Of course, keep in mind the limitations imposed on the individual attributes associated with types.</p>
<h3>CopyField section</h3>
<p>In short, this section is responsible for copying the contents of fields to other fields. We define the field which value should be copied, and the destination field. Please note that copying takes place before the field value is analyzed. Example copyField definition:
</p>
<pre class="brush:xml">&lt;copyField source="category" dest="text"/&gt;</pre>
<p>For the sake of accuracy, occurring attributes mean:</p>
<ul>
<li>source &#8211; the source field,</li>
<li>dest &#8211; the destination field.</li>
</ul>
<h3>Additional definitions</h3>
<p><strong>1. Unique key definition</strong></p>
<p>The definition of a unique key that makes possible to unambiguously identify the document. Defining a unique key is not necessary, but is recommended. Sample definition:
</p>
<pre class="brush:xml">&lt;uniqueKey&gt;id&lt;/uniqueKey&gt;</pre>
<p><strong>2. Default search field definition</strong></p>
<p>The Section is responsible for defining a default search field, which Solr use in case You have not given any field. Sample definition:
</p>
<pre class="brush:xml">&lt;defaultSearchField&gt;content&lt;/defaultSearchField&gt;</pre>
<p><strong>3. Default logical operator definition</strong></p>
<p>This section is responsible for the definition of default logical operator that will be used. Sample definition looks as follows:
</p>
<pre class="brush:xml">&lt;solrQueryParser defaultOperator="OR" /&gt;</pre>
<p>Possible values are: <em>OR </em>and <em>AND</em>.</p>
<p><strong>4. Defining similarity</strong></p>
<p>Finally we define the similarity that we will use. It is rather a topic for another post, but you must know that if necessary You can change the default similarity (currently in Solr trunk there are already two classes of similarity). The sample definition is as follows:
</p>
<pre class="brush:xml">&lt;similarity class="pl.solr.similarity.CustomSimilarity" /&gt;</pre>
<h3>A few words at the end</h3>
<p>Information presented above should give some insight on what <em>schema.xml</em> file is and what correspond to the different sections in this file. Soon I will try to write what You should avoid when designing the index.</p>
]]></content:encoded>
					
					<wfw:commentRss>https://solr.pl/en/2010/08/16/what-is-schema-xml/feed/</wfw:commentRss>
			<slash:comments>0</slash:comments>
		
		
			</item>
	</channel>
</rss>
