<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>index &#8211; Solr.pl</title>
	<atom:link href="https://solr.pl/en/tag/index-2/feed/" rel="self" type="application/rss+xml" />
	<link>https://solr.pl/en/</link>
	<description>All things to be found - Blog related to Apache Solr &#38; Lucene projects - https://solr.apache.org</description>
	<lastBuildDate>Thu, 12 Nov 2020 12:59:24 +0000</lastBuildDate>
	<language>en-US</language>
	<sy:updatePeriod>
	hourly	</sy:updatePeriod>
	<sy:updateFrequency>
	1	</sy:updateFrequency>
	<generator>https://wordpress.org/?v=6.9</generator>
	<item>
		<title>Solr 4.2: Index structure reading API</title>
		<link>https://solr.pl/en/2013/05/20/solr-4-2-index-structure-reading-api/</link>
					<comments>https://solr.pl/en/2013/05/20/solr-4-2-index-structure-reading-api/#respond</comments>
		
		<dc:creator><![CDATA[Rafał Kuć]]></dc:creator>
		<pubDate>Mon, 20 May 2013 11:58:51 +0000</pubDate>
				<category><![CDATA[Solr]]></category>
		<category><![CDATA[4.2]]></category>
		<category><![CDATA[api]]></category>
		<category><![CDATA[index]]></category>
		<category><![CDATA[schema]]></category>
		<category><![CDATA[schema api]]></category>
		<category><![CDATA[schema.xml]]></category>
		<category><![CDATA[solr]]></category>
		<category><![CDATA[structure]]></category>
		<guid isPermaLink="false">http://sematext.solr.pl/?p=554</guid>

					<description><![CDATA[With the release of Solr 4.2 we&#8217;ve got the possibility to use the HTTP protocol to get information about Solr index structure. Of course, if one wanted to do that prior to Solr 4.2 it could be achieved by fetching]]></description>
										<content:encoded><![CDATA[<p>With the release of Solr 4.2 we&#8217;ve got the possibility to use the HTTP protocol to get information about Solr index structure. Of course, if one wanted to do that prior to Solr 4.2 it could be achieved by fetching the <em>schema.xml</em> file, parsing it and then getting the needed information. However when Solr 4.2 was released we&#8217;ve got a dedicated API which can return the information we need without the need of parsing the whole <em>schema.xml</em> file.</p>
<p><span id="more-554"></span></p>
<h3>Possibilities</h3>
<p>Let&#8217;s look at the new API by example.</p>
<h4>Getting information in XML format</h4>
<p>Many Solr users are used to getting their data in the XML format, at least when using Solr HTTP API. However, the schema API uses JSON as the default format. In order to get the data in the XML format in all the below examples, you&#8217;ll need to appeng the <em>wt=xml</em> parameter to the call, for example like that:
</p>
<pre class="brush:bash">$curl 'http://localhost:8983/solr/collection1/schema/fieldtypes?wt=xml'</pre>
<h4>Defined fields information</h4>
<p>Let&#8217;s start by looking at how to fetch information about the fields that are defined in Solr. In order to do that we have the following possibilities:</p>
<ol>
<li>Get information about all the fields defined in the index</li>
<li>Get information for a one, explicitly defined field</li>
</ol>
<p>In the first case we should use the following command:
</p>
<pre class="brush:bash">$curl 'http://localhost:8983/solr/collection1/schema/fields'</pre>
<p>In second case we should add the <em>/</em> character and the field name to the above command. For example in order to get the information about the <em>author</em> field we should use the following command:
</p>
<pre class="brush:bash">$curl 'http://localhost:8983/solr/collection1/schema/fields/author'</pre>
<p>Solr response for the first command will be similar to the following one:
</p>
<pre class="brush:plain">{
  "responseHeader":{
    "status":0,
    "QTime":1},
  "fields":[{
      "name":"_version_",
      "type":"long",
      "indexed":true,
      "stored":true},
    {
      "name":"author",
      "type":"text_general",
      "indexed":true,
      "stored":true},
    {
      "name":"cat",
      "type":"string",
      "multiValued":true,
      "indexed":true,
      "stored":true},
    {
      "name":"category",
      "type":"text_general",
      "indexed":true,
      "stored":true},
    {
      "name":"id",
      "type":"string",
      "multiValued":false,
      "indexed":true,
      "required":true,
      "stored":true,
      "uniqueKey":true},
    {
      "name":"url",
      "type":"text_general",
      "indexed":true,
      "stored":true},
    {
      "name":"weight",
      "type":"float",
      "indexed":true,
      "stored":true}]}</pre>
<p>On the other hand the response for the second command would be as follows:
</p>
<pre class="brush:plain">{
  "responseHeader":{
    "status":0,
    "QTime":0},
  "field":{
    "name":"author",
    "type":"text_general",
    "indexed":true,
    "stored":true}}</pre>
<h4>Getting information about defined dynamic fields</h4>
<p>Similar to what information we can get about the fields defined in the <em>schema.xml</em> we can get the information about dynamic fields. Again we have to options:</p>
<ol>
<li>Get information about all dynamic fields</li>
<li>Get information about specific dynamic field pattern</li>
</ol>
<p>In order to get all the information about dynamic fields we should use the following command:
</p>
<pre class="brush:bash">$curl 'http://localhost:8983/solr/collection1/schema/dynamicfields'</pre>
<p>In order to get information about a specific pattern we append the <em>/&nbsp;</em>character followed by the pattern, for example like this:
</p>
<pre class="brush:bash">$curl 'http://localhost:8983/solr/collection1/schema/dynamicfields/random_*'</pre>
<p>Solr will return the following response for the first query:
</p>
<pre class="brush:plain">{
  "responseHeader":{
    "status":0,
    "QTime":2},
  "dynamicfields":[{
      "name":"*_coordinate",
      "type":"tdouble",
      "indexed":true,
      "stored":false},
    {
      "name":"ignored_*",
      "type":"ignored",
      "multiValued":true},
    {
      "name":"random_*",
      "type":"random"},
    {
      "name":"*_p",
      "type":"location",
      "indexed":true,
      "stored":true},
    {
      "name":"*_c",
      "type":"currency",
      "indexed":true,
      "stored":true}]}</pre>
<p>And the following response will be returned for the second command:
</p>
<pre class="brush:plain">{
  "responseHeader":{
    "status":0,
    "QTime":1},
  "dynamicfield":{
    "name":"random_*",
    "type":"random"}}</pre>
<h4>Getting field types</h4>
<p>As you probably guess, in a way similar to the above describes examples, we can also get the information about the field types defined in our <em>schema.xml</em> files. We can fetch the following information:</p>
<ol>
<li>All the field types defined in the <em>schema.xml</em> file</li>
<li>A single type</li>
</ol>
<p>To get all the defined field types we should run the following command:
</p>
<pre class="brush:bash">$curl 'http://localhost:8983/solr/collection1/schema/fieldtypes'</pre>
<p>The get information about a single type we should again add the <em>/</em> character and append the field type name to it, for example like this:
</p>
<pre class="brush:bash">$curl 'http://localhost:8983/solr/collection1/schema/fieldtypes/text_gl'</pre>
<p>Solr will return the following information in response to the first command:
</p>
<pre class="brush:plain">{
  "responseHeader":{
    "status":0,
    "QTime":3},
  "fieldTypes":[{
      "name":"alphaOnlySort",
      "class":"solr.TextField",
      "sortMissingLast":true,
      "omitNorms":true,
      "analyzer":{
        "class":"solr.TokenizerChain",
        "tokenizer":{
          "class":"solr.KeywordTokenizerFactory"},
        "filters":[{
            "class":"solr.LowerCaseFilterFactory"},
          {
            "class":"solr.TrimFilterFactory"},
          {
            "class":"solr.PatternReplaceFilterFactory",
            "replace":"all",
            "replacement":"",
            "pattern":"([^a-z])"}]},
      "fields":[],
      "dynamicFields":[]},
    {
      "name":"boolean",
      "class":"solr.BoolField",
      "sortMissingLast":true,
      "fields":["inStock"],
      "dynamicFields":["*_bs",
        "*_b"]},
    {
      "name":"text_gl",
      "class":"solr.TextField",
      "positionIncrementGap":"100",
      "analyzer":{
        "class":"solr.TokenizerChain",
        "tokenizer":{
          "class":"solr.StandardTokenizerFactory"},
        "filters":[{
            "class":"solr.LowerCaseFilterFactory"},
          {
            "class":"solr.StopFilterFactory",
            "words":"lang/stopwords_gl.txt",
            "ignoreCase":"true",
            "enablePositionIncrements":"true"},
          {
            "class":"solr.GalicianStemFilterFactory"}]},
      "fields":[],
      "dynamicFields":[]},
    {
      "name":"tlong",
      "class":"solr.TrieLongField",
      "precisionStep":"8",
      "positionIncrementGap":"0",
      "fields":[],
      "dynamicFields":["*_tl"]}]}</pre>
<p>In response to the second command Solr will return the following:
</p>
<pre class="brush:plain">{
  "responseHeader":{
    "status":0,
    "QTime":2},
  "fieldType":{
    "name":"text_gl",
    "class":"solr.TextField",
    "positionIncrementGap":"100",
    "analyzer":{
      "class":"solr.TokenizerChain",
      "tokenizer":{
        "class":"solr.StandardTokenizerFactory"},
      "filters":[{
          "class":"solr.LowerCaseFilterFactory"},
        {
          "class":"solr.StopFilterFactory",
          "words":"lang/stopwords_gl.txt",
          "ignoreCase":"true",
          "enablePositionIncrements":"true"},
        {
          "class":"solr.GalicianStemFilterFactory"}]},
    "fields":[],
    "dynamicFields":[]}}</pre>
<p>As you can see, the amount information is nice as we are getting all the information about the field types and in addition to that the information which field are using give field (both dynamic and non-dynamic.</p>
<h4>Retrieving information about copyFields</h4>
<p>In addition to what we&#8217;ve discussed so far we are able to get information about copyFields section from the <em>schema.xml</em>. In order to do that one should run the following command:
</p>
<pre class="brush:bash">$curl 'http://localhost:8983/solr/collection1/schema/copyfields'</pre>
<p>And in response we will get the following data:
</p>
<pre class="brush:plain">{
  "responseHeader":{
    "status":0,
    "QTime":1},
  "copyfields":[{
      "source":"author",
      "dest":"text"},
    {
      "source":"cat",
      "dest":"text"},
    {
      "source":"content",
      "dest":"text"},
    {
      "source":"content_type",
      "dest":"text"},
    {
      "source":"description",
      "dest":"text"},
    {
      "source":"features",
      "dest":"text"},
    {
      "source":"author",
      "dest":"author_s",
      "destDynamicBase":"*_s"}]}</pre>
<h3>The future</h3>
<p>In Solr 4.3 the described API was improved and is now being prepared to enable not only reading of the index structure, but also writing modifications to it with the use of HTTP requests. We can expect that feature in one of the upcoming versions of Apache Solr, so its worth waiting in my opinion, at least by those who needs it.</p>
<p>W Solr 4.3 opisywane API zostało usprawnione oraz jest przygotowywane do umożliwienia zmian w strukturze indeksu za pomocą protokołu HTTP. Możemy zatem spodziewać się, iż w jednej z kolejnych wersji serwera wyszukiwania Solr otrzymamy możliwość łatwej zmiany struktury indeksu, przynajmniej takich, które nie będą powodować konfliktów z już zaindeksowanymi danymi.</p>
]]></content:encoded>
					
					<wfw:commentRss>https://solr.pl/en/2013/05/20/solr-4-2-index-structure-reading-api/feed/</wfw:commentRss>
			<slash:comments>0</slash:comments>
		
		
			</item>
		<item>
		<title>Backing Up Your Index</title>
		<link>https://solr.pl/en/2012/08/13/backing-up-your-index/</link>
					<comments>https://solr.pl/en/2012/08/13/backing-up-your-index/#respond</comments>
		
		<dc:creator><![CDATA[Rafał Kuć]]></dc:creator>
		<pubDate>Mon, 13 Aug 2012 21:51:12 +0000</pubDate>
				<category><![CDATA[Solr]]></category>
		<category><![CDATA[backup]]></category>
		<category><![CDATA[handler]]></category>
		<category><![CDATA[index]]></category>
		<category><![CDATA[replication]]></category>
		<guid isPermaLink="false">http://sematext.solr.pl/?p=472</guid>

					<description><![CDATA[Did you ever wonder if you can create a backup of your index with the tools available in Solr ? For exmaple after every commit or optimize operation ? Or may you would like to create backups with the HTTP]]></description>
										<content:encoded><![CDATA[<p>Did you ever wonder if you can create a backup of your index with the tools available in Solr ? For exmaple after every <em>commit</em> or <em>optimize</em> operation ? Or may you would like to create backups with the HTTP API call ? Lets see what possibilities Solr has to offer.</p>
<p><span id="more-472"></span></p>
<h3>The Beginning</h3>
<p>We decided to write about index backups even though this functionality is fairly simple. We noticed that many people tend to forget about this functionality, not only when it comes to Apache Solr. We hope that this blog entry, will help you remember about backup creation functionality, when you need it. But now, lets start from the beginning &#8211; before we started the tests, we looked at the directory where Solr keeps its indices and this is what we saw:
</p>
<pre class="brush:bash">drwxrwxr-x 2 gr0 gr0 4096 2012-08-12 20:17 index
drwxrwxr-x 2 gr0 gr0 4096 2012-08-12 20:16 spellchecker</pre>
<h3>Manual Backup</h3>
<p>In order to create a backup of your index with the use of HTTP API you have to have replication handler configured. If you have it, then you need to send the <em>command</em> parameter with <em>backup</em> value to the master server replication handler, for example like this:
</p>
<pre class="brush:xml">curl 'http://localhost:8983/solr/replication?command=backup'</pre>
<p>The above will tell Solr to create a new backup of the current index. Lets now look how the directory where indices live looks like after running the above command:
</p>
<pre class="brush:bash">drwxrwxr-x 2 gr0 gr0 4096 2012-08-12 20:18 index
drwxrwxr-x 2 gr0 gr0 4096 2012-08-12 20:19 snapshot.20120812201917
drwxrwxr-x 2 gr0 gr0 4096 2012-08-12 20:16 spellchecker</pre>
<p>As you can see, there is a new directory created &#8211; <em>snapshot.20120812201917</em>. We can assume, that we got what we wanted <img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f642.png" alt="🙂" class="wp-smiley" style="height: 1em; max-height: 1em;" /></p>
<h3>Automatic Backup</h3>
<p>In addition to manual backup creation, you can also configure Solr to create indices after <em>commit</em> or <em>optimize</em> operation. Please remember though, that if your index is changing rapidly it is usually a bad idea to create backup after each&nbsp;commit operation.&nbsp; But lets get back to automatic backups. In order to configure Solr to create backups for us, you need to add the following line to replication handler configuration:
</p>
<pre class="brush:xml">&lt;str name="backupAfter"&gt;commit&lt;/str&gt;</pre>
<p>So, the full replication handler configuration (on the <em>master</em> server) would look like this:
</p>
<pre class="brush:xml">&lt;requestHandler name="/replication" &gt;
 &lt;lst name="master"&gt;
  &lt;str name="replicateAfter"&gt;commit&lt;/str&gt;
  &lt;str name="replicateAfter"&gt;startup&lt;/str&gt;
  &lt;str name="confFiles"&gt;schema.xml,stopwords.txt&lt;/str&gt;
  &lt;str name="backupAfter"&gt;commit&lt;/str&gt;
 &lt;/lst&gt;
&lt;/requestHandler&gt;</pre>
<p>After sending two <em>commit</em> operation our dictionary with indices looks like this:
</p>
<pre class="brush:bash">drwxrwxr-x 2 gr0 gr0 4096 2012-08-12 21:12 index
drwxrwxr-x 2 gr0 gr0 4096 2012-08-12 21:12 snapshot.20120812211203
drwxrwxr-x 2 gr0 gr0 4096 2012-08-12 21:12 snapshot.20120812211216
drwxrwxr-x 2 gr0 gr0 4096 2012-08-12 20:16 spellchecker</pre>
<p>As you can see, Solr did what we wanted to be done.</p>
<h3>Keeping Order</h3>
<p>It is possible to control the maximum amount of backups that should be stored on disk. In order to configure that number you need to add the following line to your replication handler configuration:
</p>
<pre class="brush:xml">&lt;str name="maxNumberOfBackups"&gt;10&lt;/str&gt;</pre>
<p>The above configuration value tells Solr to keep maximum of ten backups of your index. Of course you can delete created backups (manually for example) if you don&#8217;t need them anymore.</p>
]]></content:encoded>
					
					<wfw:commentRss>https://solr.pl/en/2012/08/13/backing-up-your-index/feed/</wfw:commentRss>
			<slash:comments>0</slash:comments>
		
		
			</item>
		<item>
		<title>Deep paging problem</title>
		<link>https://solr.pl/en/2011/07/18/deep-paging-problem/</link>
					<comments>https://solr.pl/en/2011/07/18/deep-paging-problem/#respond</comments>
		
		<dc:creator><![CDATA[Rafał Kuć]]></dc:creator>
		<pubDate>Mon, 18 Jul 2011 19:45:36 +0000</pubDate>
				<category><![CDATA[Solr]]></category>
		<category><![CDATA[deep]]></category>
		<category><![CDATA[deep paging]]></category>
		<category><![CDATA[index]]></category>
		<category><![CDATA[lucene]]></category>
		<category><![CDATA[paging]]></category>
		<category><![CDATA[solr]]></category>
		<guid isPermaLink="false">http://sematext.solr.pl/?p=359</guid>

					<description><![CDATA[Imagine the following problem &#8211; we have an application that expects Solr to return the results sorted on the basis of some field. Those results will be than paged in the GUI. However, if the person using the GUI application]]></description>
										<content:encoded><![CDATA[<p>Imagine the following problem &#8211; we have an application that expects Solr to return the results sorted on the basis of some field. Those results will be than paged in the GUI. However, if the person using the GUI application immediately selects the tenth, twentieth, or fiftieth page of search results there is a problem &#8211; the wait time. Is there anything we can do about this? Yes, we can help Solr a bit.</p>
<p><span id="more-359"></span></p>
<h3>A few numbers at the beginning</h3>
<p>Let&#8217;s start with the query and statistics. Imagine that we have the following query, which is sent to Solr to get the five hundredth page of search results:
</p>
<pre class="brush:xml">q=*:*&amp;sort=price+asc&amp;rows=100&amp;start=50000</pre>
<p>What Solr must do Solr to retrieve and render the results list ? Of course, read the documents from Lucene index. There is of course the question of how many documents to be read from the index? Is it 100? Unfortunately, no. Solr must collect 50.100 sorted documents from the Lucene index, due to the fact that we want 100 documents starting from 50.000-th. Kinda scary. Now let&#8217;s look at the comparison of how long it takes for Solr download the first page of search results (the query <code>q=*:*&amp;sort=price+asc&amp;rows=100&amp;start=0</code>) and how long it takes to render the last page of search results (ie the query <code>q=*:*&amp;sort = price+asc&amp;rows=100&amp;start=999900</code>). The test was performed on an index containing a million documents, consisting of four fields: <code>id</code> (<code>string</code>), <code>name</code> (<code>text</code>), <code>description</code> (<code>text</code>), <code>price</code> (<code>long</code>). Before every test iteration Solr was started the query was run and Solr was truned off. These steps were repeated 100 times for each of the queries, and times is seen in the table are the arithmetic mean of the query time execution. Here are the test results:</p>
[table “18” not found /]<br />

<h3>Typical solutions</h3>
<p>Of course we can try to set the cache or the size of <em>queryResultWindowSize</em>, but there will be a problem of how to set the size, there may be a situation where it will be insufficient or not relevant entry in the memory of Solr and then waiting time for the n-th search page will be very long. We can also try adding warming queries, but we won&#8217;t be able to prepare all the combinations, but even if we could the the cache would have to be big. So we won&#8217;t be able to achieve the desired results with any of these solutions.</p>
<h3>Filters, filters and filter again</h3>
<p>This behavior Solr (and other applications based on Lucene too) is caused by the queue of documents, called. priority queue, which in order to display the last page of results must download all documents matching the query and return the ones we want located in the desired page. Of course, in our case, if we want the first page of search results queue will have 100 entries. However, if we want the last page will have Solr search in the queue to put one million documents. Of course what I told is in big simplification.</p>
<p>The idea is to limit the number of documents Lucene must put in the queue. How to do it ? We will use filters to help us, so in Solr we will use the fq parameter. Using a filter will limit the number of search results. The ideal size of the queue would be the one that is passed in the rows parameter of query. However, this situation is ideal and not very possible in most situations. An additional problem is that asking a query with a filter we can not determine the number of results, because we do not know how much documents will the filter return. The solution to this problem is the making two queries instead of just one &#8211; the first one to see how fimiting is our filter thus using <code>rows=0</code> and <code>start=0</code>, and the second is already adequately calculated (example below).</p>
<p>The maximum price of the product in the test data is 10,000, and the minimum is 0. So to the first query we will add the following bracket: <code>&lt;0; 1000&gt;</code>, and to the second query, we add the following bracket: <code>&lt;9000; 10000&gt;</code>.</p>
<h3>Disadvantages of solution based on filters</h3>
<p>There is one minus the filter-based solution and it is quite large. It may happen that the number of results to which the filter is limiting the query is too small. What then? We should increase the choosen bracket for the filter. Of course we can calculate the optimal brackets on the basis of our data, but it depends on the data and queries and why I won&#8217;t be talking about this at this point.</p>
<h3>What is the performance after the change?</h3>
<p>So let&#8217;s repeat the tests, but now let&#8217;s implement the filter based approach. So the first will just return the first page of results (the query <code>q=*:*&amp;sort=price+asc&amp;rows=100&amp;start=0&amp;fq=price:[0+TO+1000]</code>). The second query (actually two queries) will will be used the check the number of results and then fetch those results (those two queries: <code>q=*:*&amp;sort=price+asc&amp;rows=0&amp;start=0&amp;fq=price:[9000+TO+10,000]</code> and <code>q=*:*&amp;sort=price+asc&amp;rows=100&amp;start=100037&amp;fq=price:[9000+TO+10000]</code>). It is worth noting about the changed start parameter in the query, due to the fact that we get fewer search results (this is caused by the fq parameter). This test was was carried out in similar way to the previous one &#8211; start Solr, run a query (or queries), and shutdown Solr. The number seen in the table are the arithmetic mean of query time execution.</p>
[table “19” not found /]<br />

<p>As you can see, the query performance changed. We can therefore conclude that we succeeded. Of course, we could be tempted by further optimizations, but for now let&#8217;s say that we are satisfied with the results. I suspect however that you can ask the following question:</p>
<h3>How is this handled by other search engines ?</h3>
<p>Good question, but the answer is trivial in total &#8211; one of the methods is to prevent a very deep paging. This method shall include Google. Try to type the word &#8220;ask&#8221; for the search and go further than 91 page of search results. Google didn&#8217;t let me <img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f609.png" alt="😉" class="wp-smiley" style="height: 1em; max-height: 1em;" /></p>
<h3>Conclusions</h3>
<p>As you can see deep paging performance after our changes has increased significantly. We can now allow the user to search results without worrying that it will kill our Solr and the machine on which he works. Of course, this method is not without flaws, due to the fact that we need to know certain parameters (eg in our case, the maximum price for descending sort), but this is a solution that allows you to provide search results in a relatively low latency time when compared to the pre-optimization method.</p>
]]></content:encoded>
					
					<wfw:commentRss>https://solr.pl/en/2011/07/18/deep-paging-problem/feed/</wfw:commentRss>
			<slash:comments>0</slash:comments>
		
		
			</item>
		<item>
		<title>Index &#8211; delete or update?</title>
		<link>https://solr.pl/en/2011/02/16/index-delete-or-update/</link>
					<comments>https://solr.pl/en/2011/02/16/index-delete-or-update/#respond</comments>
		
		<dc:creator><![CDATA[Rafał Kuć]]></dc:creator>
		<pubDate>Wed, 16 Feb 2011 08:16:33 +0000</pubDate>
				<category><![CDATA[Solr]]></category>
		<category><![CDATA[delete]]></category>
		<category><![CDATA[index]]></category>
		<category><![CDATA[solr]]></category>
		<guid isPermaLink="false">http://sematext.solr.pl/?p=213</guid>

					<description><![CDATA[From time to time, in working with Solr there is a problem &#8211; how to update Solr index structure. There are various reasons for these changes &#8211; the new functional requirements, optimization, or anything else &#8211; it is not important.]]></description>
										<content:encoded><![CDATA[<p>From time to time, in working with Solr there is a problem &#8211; how to update Solr index structure. There  are various reasons for these changes &#8211; the new functional  requirements, optimization, or anything else &#8211; it is not important. What  is important is the question that arise &#8211; should we remove the index,  or simply change the structure and do a full indexing? Contrary to appearances, the answer to this question depends on the changes we made in the structure of the index.</p>
<p><span id="more-213"></span></p>
<p><img decoding="async" title="Więcej..." src="../wp-includes/js/tinymce/plugins/wordpress/img/trans.gif" alt=""></p>
<p>Personally, I am an advocate of solutions that have the smallest chance to cause problems &#8211; I just like to sleep at night. I  think that removing the index after updateing its structure and then do  the full indexation of the data is one of those solutions, at least in  my opinion. I am aware, however, that this type of solution is not always acceptable. So when we are not forced to remove the index, and when not doing it exposes us to potential problems with the Solr ?</p>
<p>The answer to the question depends on what changed in the structure of the index. Such changes can be divided into three areas covering most of the changes that we make in the structure of the index:</p>
<ul>
<li><strong>Adding / removing new field</strong></li>
<li><strong>Similarity modification<br />
</strong></li>
<li><strong>Field modification<br />
</strong></li>
</ul>
<h3>Adding / removing new field</h3>
<p>In the case of the first type of modification of the matter is quite simple &#8211; if we <strong>add or remove a new field</strong> to <em>schema.xml</em> there is no need to remove the entire index before re-indexing. Solr handle adding a new field to the current index. Of  course, you should be aware that the documents which will not be after  this operation will not be re-indexed automatically updated.</p>
<h3>Similarity modification</h3>
<p>In  the second case &#8211; the change of the class that is responsible for <em> Similarity </em>also does not force us to to delete the index after the  change. But  unlike the previous example, if we want Solr to correctly calculate&nbsp; the <em>score</em>, and thus to sort in the correct order we will be forced to re-indexing of  all documents previously present in the index.</p>
<h3>Field modification</h3>
<p>Let stop a minute on the third case. Let&#8217;s  suppose that we modify slightly the field in the index with the prosaic  reason &#8211; we are no longer interested in the normalization of its  length. We set <em>omitNorms=&#8221;true&#8221;</em> (I assume that the previous setting was <em>omitNorms=&#8221;false&#8221;</em>). If  we only re-index all the documents, the Lucene indexes, in the combined  segments, will still have information about length normalization of the  field. Something went wrong. This  is precisely the case when it is necessary to delete the index after  the change to its structure, and prior to full indexation. At first glance, it seems that this is a very small change, but thinking further, we have some side effects of the change.  It  is worth remembering that some of the field properties are overwritten  by other, as in the case of normalization of the length &#8211; if one segment  will have lenght normalization, and the second will not, when you  combine the segments you will have lenght normalization in the one that  was created.</p>
]]></content:encoded>
					
					<wfw:commentRss>https://solr.pl/en/2011/02/16/index-delete-or-update/feed/</wfw:commentRss>
			<slash:comments>0</slash:comments>
		
		
			</item>
		<item>
		<title>CheckIndex for the rescue</title>
		<link>https://solr.pl/en/2011/01/17/checkindex-for-the-rescue/</link>
					<comments>https://solr.pl/en/2011/01/17/checkindex-for-the-rescue/#respond</comments>
		
		<dc:creator><![CDATA[Rafał Kuć]]></dc:creator>
		<pubDate>Mon, 17 Jan 2011 08:06:58 +0000</pubDate>
				<category><![CDATA[Lucene]]></category>
		<category><![CDATA[Solr]]></category>
		<category><![CDATA[check]]></category>
		<category><![CDATA[check index]]></category>
		<category><![CDATA[checkindex]]></category>
		<category><![CDATA[index]]></category>
		<category><![CDATA[lucene]]></category>
		<category><![CDATA[rescue]]></category>
		<guid isPermaLink="false">http://sematext.solr.pl/?p=190</guid>

					<description><![CDATA[While using Lucene and Solr we are used to a very high reliability of this products. However, there may come the day when Solr will inform us that our index is corrupted, and we need to do something about it.]]></description>
										<content:encoded><![CDATA[<p>While using Lucene and Solr we are used to a very high reliability of this products. However, there may come the day when Solr will inform us that our index is corrupted, and we need to do something about it. Is the only way to repair the index is to restore it from the backup or do full indexation ? Not only &#8211; there is hope in the form of CheckIndex tool.</p>
<p><span id="more-190"></span></p>
<h3>What is CheckIndex ?</h3>
<p>CheckIndex is a tool available in the Lucene library, which allows you to check the files and create new segments that do not contain problematic entries. This means that this tool, with little loss of data is able to repair a broken index, and thus save us from having to restore the index from the backup (of course if we have it) or do the full indexing of all documents that were stored in Solr.</p>
<h3>Where do I start?</h3>
<p>Please note that, according to what we find in Javadocs, this tool is experimental and may change in the future. Therefore, before starting working with it we should create a copy of the index. In addition, it is worth knowing that the tool analyzes the index byte by byte, and thus for large indexes the time of analysis and repair may be large. It is important not to run the tool with the<em> -fix</em> option at the moment when it is used by Solr or other application based on the Lucene library. Finally, be aware that the launch of the tool in repairing mode may result in removal of some or all documents that are stored in the index.</p>
<h3>How to run it ?</h3>
<p>To run the utility, go to the directory where the Lucene library files are located and run the following command:
</p>
<pre class="brush:bash">java -ea:org.apache.lucene... org.apache.lucene.index.CheckIndex INDEX_PATH -fix</pre>
<p>In my case, it looked as follows:
</p>
<pre class="brush:bash">java -cp lucene-core-2.9.3.jar -ea:org.apache.lucene... org.apache.lucene.index.CheckIndex E:\\Solr\\solr\\data\\index\\ -fix</pre>
<p>After a while I got the following information:
</p>
<pre class="brush:bash">Opening index @ E:\Solr\solr\data\index\

Segments file=segments_2 numSegments=1 version=FORMAT_DIAGNOSTICS [Lucene 2.9]
1 of 1: name=_0 docCount=19
compound=false
hasProx=true
numFiles=11
size (MB)=0,018
diagnostics = {os.version=6.1, os=Windows 7, lucene.version=2.9.3 951790 - 2010-06-06 01:30:55, source=flush, os.arch=x86, java.version=1.6.0_23, java.vendor=Sun Microsystems Inc.}
no deletions
test: open reader.........OK
test: fields..............OK [15 fields]
test: field norms.........OK [15 fields]
test: terms, freq, prox...OK [900 terms; 1517 terms/docs pairs; 1707 tokens]
test: stored fields.......OK [232 total field count; avg 12,211 fields per doc]
test: term vectors........OK [3 total vector count; avg 0,158 term/freq vector fields per doc]

No problems were detected with this index.</pre>
<p>It mean that the index is correct and there was no need for any corrective action. Additionally, you can learn some interesting things about the index <img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f609.png" alt="😉" class="wp-smiley" style="height: 1em; max-height: 1em;" /></p>
<h3>Broken index</h3>
<p>But what happens in the case of the broken index? There is only one way to see it &#8211; let&#8217;s try. So, I broke one of the index files and ran the CheckIndex tool. The following appeared on the console after I&#8217;ve run the CheckIndex tool:
</p>
<pre class="brush:bash">Opening index @ E:\Solr\solr\data\index\

Segments file=segments_2 numSegments=1 version=FORMAT_DIAGNOSTICS [Lucene 2.9]
1 of 1: name=_0 docCount=19
compound=false
hasProx=true
numFiles=11
size (MB)=0,018
diagnostics = {os.version=6.1, os=Windows 7, lucene.version=2.9.3 951790 - 2010-06-06 01:30:55, source=flush, os.arch=x86, java.version=1.6.0_23, java.vendor=Sun Microsystems Inc.}
no deletions
test: open reader.........FAILED
WARNING: fixIndex() would remove reference to this segment; full exception:
org.apache.lucene.index.CorruptIndexException: did not read all bytes from file "_0.fnm": read 150 vs size 152
at org.apache.lucene.index.FieldInfos.read(FieldInfos.java:370)
at org.apache.lucene.index.FieldInfos.&lt;init&gt;(FieldInfos.java:71)
at org.apache.lucene.index.SegmentReader$CoreReaders.&lt;init&gt;(SegmentReader.java:119)
at org.apache.lucene.index.SegmentReader.get(SegmentReader.java:652)
at org.apache.lucene.index.SegmentReader.get(SegmentReader.java:605)
at org.apache.lucene.index.CheckIndex.checkIndex(CheckIndex.java:491)
at org.apache.lucene.index.CheckIndex.main(CheckIndex.java:903)

WARNING: 1 broken segments (containing 19 documents) detected
WARNING: 19 documents will be lost

NOTE: will write new segments file in 5 seconds; this will remove 19 docs from the index. THIS IS YOUR LAST CHANCE TO CTRL+C!
5...
4...
3...
2...
1...
Writing...
OK
Wrote new segments file "segments_3"</pre>
<p>As you can see, all the 19 documents that were in the index have been removed. This is an extreme case, but you should realize that this tool might work like this.</p>
<h3>The end</h3>
<p>If you remember about the basisc assumptions associated with the use of the CheckIndex tool you may find yourself in a situation when this tool will come in handy and you will not have to ask yourself a question like &#8220;When the last backup was made ?&#8221;.</p>
]]></content:encoded>
					
					<wfw:commentRss>https://solr.pl/en/2011/01/17/checkindex-for-the-rescue/feed/</wfw:commentRss>
			<slash:comments>0</slash:comments>
		
		
			</item>
		<item>
		<title>Quick look &#8211; IndexSorter</title>
		<link>https://solr.pl/en/2010/10/04/quick-look-indexsorter/</link>
					<comments>https://solr.pl/en/2010/10/04/quick-look-indexsorter/#respond</comments>
		
		<dc:creator><![CDATA[Rafał Kuć]]></dc:creator>
		<pubDate>Mon, 04 Oct 2010 12:14:20 +0000</pubDate>
				<category><![CDATA[Solr]]></category>
		<category><![CDATA[index]]></category>
		<category><![CDATA[index sorter]]></category>
		<category><![CDATA[index sorting]]></category>
		<category><![CDATA[indexsorter]]></category>
		<category><![CDATA[lucene]]></category>
		<category><![CDATA[solr]]></category>
		<category><![CDATA[sorting]]></category>
		<guid isPermaLink="false">http://sematext.solr.pl/?p=82</guid>

					<description><![CDATA[At the Apache Lucene Eurocon 2010 conference, which took place in May this year, Andrew Białecki in his presentation talked about how to obtain satisfactory search results when using early termination search techniques. Unfortunately the tool he mentioned, was not]]></description>
										<content:encoded><![CDATA[<p>At  the Apache Lucene Eurocon 2010 conference, which took place in May this  year, Andrew Białecki in his presentation talked about how to obtain  satisfactory search results when using early termination search  techniques. Unfortunately the tool he mentioned, was not available in Solr &#8211; but it changed.</p>
<p><span id="more-82"></span></p>
<p>At the time of writing, the described tools are available only in branch named<em> branch_3x</em> in SVN, but it is planned to migrate this functionality to  version 4.x.</p>
<h3>But what is it?</h3>
<p>Using  the techniques of terminating the search after a predetermined time,  without looking at the number of search results, at some point we come  across the problem of the quality of search results. Instead  of receiving the best results, in the context of the search query, we  get them in a random fashion (or at least they may look random). This means that we are not able to ensure that the user that uses the system gets the best matching results. Of  course, we talk about the situation, when you terminate the search  after a predetermined period of time and we that is why Solr can&#8217;t  gather all the documents that match your query.</p>
<h3>Is it useful for me ?</h3>
<p>When ending a search after a predetermined time may be useful? There are many uses cases of such a search. Imagine that our implementation is composed of many separate shards, which operate on large amounts of data each. When  making a distributed query, each of the shards, present in the search  system, must be queried for relevant documents, then all results must be  gathered and displayed to the end user (of course, this not need to be a  man, this may be an application). But  what if each of the shards needs a very long time to process all search  results, and we are, for example, only interested in those added in  recent times (eg last week). This  is where we have the possibility of early termination of search query &#8211;  assuming that we are more interested in documents added the day before  rather than two weeks ago.</p>
<h3>How to achieve it ?</h3>
<p>Example above illustrates the case when we can use the search that is terminated after a specified time. However, when looking further into search results we come to a problem &#8211; to sort search results Solr must collect them all. So  when making query with a sort parameter like <code>sort=added+desc</code> to get the  documents sorted correctly, each of the shards would have to return all  search results &#8211; this mean that we can&#8217;t use early termination of  search ? Not really. To  help us, Solr provides a tool &#8211; IndexSorter, which until now was  available only in the Apache Nutch project, but recently was commited to  Lucene and Solr. With this tool, we can pre-sort the index by the parameter that we need. Thus,  an index sorted descending by date of a document adding, Solr would  first get the documents that have been added lately, and thus we would  be able to use early termination.</p>
<h3>Using IndexSorter</h3>
<p>What to do to use the IndexSorter tool ? Can I tell You the truth ? &#8211; It&#8217;s not that complicated. Note,  however, that at the time of publication of this entry the mentioned  tool is only available in <em>branch_3x</em> of Lucene/Solr project. To  sort an index on the basis of a field, run the following command from  the command line (of course keeping in mind the appropriate location of  the library<em> lucene-misc-3.1.jar</em> &#8211; after building the project we find it  in directory<em> lucene/build/contrib/misc</em>):
</p>
<pre class="brush:bash">java IndexSorter SOURCE_DIRECTORY TARGET_DIRECTORY FIELD_NAME</pre>
<p>The parameters mean:</p>
<ul>
<li><em>SOURCE_DIRECTORY </em>&#8211; a catalog with an index that you want to sort,</li>
<li><em>TARGET_DIRECTORY</em> &#8211; the directory where sorted index will be saved,</li>
<li><em>FIELD_NAME </em>&#8211; the field on which basis the index will be sorted.</li>
</ul>
<p>If everything goes correctly, You should see something like this:
</p>
<pre class="brush:bash">IndexSorter: done, 896 total milliseconds</pre>
<h3>The end</h3>
<p>In  my opinion, Lucene and Solr just got a very interesting feature, which  can be used for example wherever the amount of data is very large, when  response time can not exceed a certain time limit, or when the results  beyond the first (the first 100 or 1000) are not significant. All  who are interested in the subject or index sorting and early  termination techniques should watch a slide presentation titled  &#8220;<em>Munching and Crunching: Lucene Index Post-Processing</em>&#8221; (<a href="http://lucene-eurocon.org/slides/Munching-&amp;-crunching-Lucene-index-post-processing-and-applications_Andrzej-Bialecki.pdf" target="_blank" rel="noopener noreferrer">slides</a>) led by  Andrzej Bialecki during Lucene Eurocon Conference 2010, who discussed  these topics.</p>
]]></content:encoded>
					
					<wfw:commentRss>https://solr.pl/en/2010/10/04/quick-look-indexsorter/feed/</wfw:commentRss>
			<slash:comments>0</slash:comments>
		
		
			</item>
		<item>
		<title>5 sins of schema.xml modifications</title>
		<link>https://solr.pl/en/2010/08/30/5-sins-of-schema-xml-modifications/</link>
					<comments>https://solr.pl/en/2010/08/30/5-sins-of-schema-xml-modifications/#respond</comments>
		
		<dc:creator><![CDATA[Rafał Kuć]]></dc:creator>
		<pubDate>Mon, 30 Aug 2010 12:08:35 +0000</pubDate>
				<category><![CDATA[Solr]]></category>
		<category><![CDATA[attribute]]></category>
		<category><![CDATA[attributes]]></category>
		<category><![CDATA[error]]></category>
		<category><![CDATA[index]]></category>
		<category><![CDATA[index structure]]></category>
		<category><![CDATA[indexing]]></category>
		<category><![CDATA[mistake]]></category>
		<category><![CDATA[schema]]></category>
		<category><![CDATA[schema.xml]]></category>
		<category><![CDATA[solr]]></category>
		<category><![CDATA[structure]]></category>
		<guid isPermaLink="false">http://sematext.solr.pl/?p=71</guid>

					<description><![CDATA[I made a promise and here it is &#8211; the entry on the most common mistakes when designing Solr index, which is when You create or modify the schema.xml file for Your system implementation. Feel free to read on 😉]]></description>
										<content:encoded><![CDATA[<p>I made a promise and here it is &#8211; the entry on the most common mistakes when designing Solr index, which is when You create or modify the <em>schema.xml</em> file for Your system implementation. Feel free to read on <img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f609.png" alt="😉" class="wp-smiley" style="height: 1em; max-height: 1em;" /></p>
<p><span id="more-71"></span></p>
<p>Each of us knows what is schema.xml file and what is (if not, I invite you to read the entry located at: <a href="http://solr.pl/2010/08/16/what-is-schema-xml/?lang=en" target="_blank" rel="noopener noreferrer">http://solr.pl/2010/08/16/what-is-schema-xml/?lang=en</a>). What are the most frequently commit errors creating or updating this file? I personally met with the following:</p>
<h3>1. Trash in the configuration</h3>
<p>I confess that the first principle is to keep the file <em>schema.xml</em> in the simplest possible form. Linked to this is a very important issue &#8211; this file should not be synonymous with chaos. In other word, do not stick with unnecessary comments, unwanted types, fields and so on. Order in the structure of the <em>schema.xml</em> file not only helps us to maintain this file and its modifications with ease, but also assures us that no information that is unnecessary will be stored in Solr index.</p>
<h3>2. Cosmetic changes to the default configuration</h3>
<p>How many of those who use Solr in their daily work took the default<em> schema.xml</em> file supplied in the example implementation Solr and only slightly modified the contents &#8211; for example, changing only the names of the fields ? I should raise my hand too, because I did it once. This is a pretty big mistake. Someone may ask why. Are you sure You need English stemming when implementing search for content written in Polish ? I think not. The same applies to field and type attributes like term vectors.</p>
<h3>3. No updates</h3>
<p>Sometimes I find the implementation of search based application, where update of Solr does not mean an update of <em>schema.xml</em> file. If it is a conscious decision, dictated by such costly or even impossible re-indexing of all data, I understand the situation. But there are cases where an upgrade would bring only benefits, and where costs of such upgrade would be minimal (eg less expensive re-index or slight changes in the application). Do not be afraid to update the <em>schema.xml</em> file &#8211; whether it is to update the fields, update types, whether the addition of newer stuff. A good example is the migration from Solr 1.3 to version 1.4 &#8211; newer version introduced significant changes associated with numeric types, where migration to the new types would result in great increase in query performance using those types (such as queries using value ranges).</p>
<h3>4. &#8220;I`ll use it one day&#8221;</h3>
<p>Adding new types, not removing unnecessary now, the same in the case of fields, or <em>copyField </em>definition. Most of us think &#8211; that old definition can be useful in the future, but remember that each type is some extra portion of memory needed by Solr, each field is a place in the index. My small advice &#8211; if you stop to use the type, field, or whatever else you have in your configuration file (not only in the <em>schema.xml</em>), simply remove it from this file. Applying this principle throughout the life cycle of the applications using Solr will ensure You that the index is in optimal condition, and after a few months since another feature implementation You will not need to be puzzled and as a result You will not need to dig into the application code to determine if the field is used in some forgotten code fragment.</p>
<h3>5. Attributes, attributes and again attributes</h3>
<p>Preservation of original values, adding term vectors and its properties are just examples of things we don`t need in every implementation. Sometimes we have more than required by the application index. A larger index, lower productivity, at least in some cases (eg, indexing). It is worth considering if you really need all this information, which we say to Solr to calculate and store. Removing some unnecessary, of course, from our point of view of information, may surprise us. Sometimes it is worth a try;)</p>
<p>Feel free to comment, because I will read eagerly, for what else we should pay attention to when modifying schema.xml file.</p>
<p>Finally, I think that it is worth to mention the article <em>&#8220;The Seven Deadly Sins of Solr&#8221;</em> LucidImagination published on the website at: <a href="http://www.lucidimagination.com/blog/2010/01/21/the-seven-deadly-sins-of-solr" target="_blank" rel="noopener noreferrer">http://www.lucidimagination.com/blog/2010/01/21/the-seven-deadly-sins-of-solr</a>. It describes bad practices when working with Solr. In my opinion, interesting reading. I highly recommend it.</p>
]]></content:encoded>
					
					<wfw:commentRss>https://solr.pl/en/2010/08/30/5-sins-of-schema-xml-modifications/feed/</wfw:commentRss>
			<slash:comments>0</slash:comments>
		
		
			</item>
	</channel>
</rss>
