{"id":82,"date":"2010-10-04T14:14:20","date_gmt":"2010-10-04T12:14:20","guid":{"rendered":"http:\/\/sematext.solr.pl\/?p=82"},"modified":"2020-11-10T14:14:59","modified_gmt":"2020-11-10T13:14:59","slug":"quick-look-indexsorter","status":"publish","type":"post","link":"https:\/\/solr.pl\/en\/2010\/10\/04\/quick-look-indexsorter\/","title":{"rendered":"Quick look &#8211; IndexSorter"},"content":{"rendered":"<p>At  the Apache Lucene Eurocon 2010 conference, which took place in May this  year, Andrew Bia\u0142ecki in his presentation talked about how to obtain  satisfactory search results when using early termination search  techniques. Unfortunately the tool he mentioned, was not available in Solr &#8211; but it changed.<\/p>\n\n\n<!--more-->\n\n\n<p>At the time of writing, the described tools are available only in branch named<em> branch_3x<\/em> in SVN, but it is planned to migrate this functionality to  version 4.x.<\/p>\n<h3>But what is it?<\/h3>\n<p>Using  the techniques of terminating the search after a predetermined time,  without looking at the number of search results, at some point we come  across the problem of the quality of search results. Instead  of receiving the best results, in the context of the search query, we  get them in a random fashion (or at least they may look random). This means that we are not able to ensure that the user that uses the system gets the best matching results. Of  course, we talk about the situation, when you terminate the search  after a predetermined period of time and we that is why Solr can&#8217;t  gather all the documents that match your query.<\/p>\n<h3>Is it useful for me ?<\/h3>\n<p>When ending a search after a predetermined time may be useful? There are many uses cases of such a search. Imagine that our implementation is composed of many separate shards, which operate on large amounts of data each. When  making a distributed query, each of the shards, present in the search  system, must be queried for relevant documents, then all results must be  gathered and displayed to the end user (of course, this not need to be a  man, this may be an application). But  what if each of the shards needs a very long time to process all search  results, and we are, for example, only interested in those added in  recent times (eg last week). This  is where we have the possibility of early termination of search query &#8211;  assuming that we are more interested in documents added the day before  rather than two weeks ago.<\/p>\n<h3>How to achieve it ?<\/h3>\n<p>Example above illustrates the case when we can use the search that is terminated after a specified time. However, when looking further into search results we come to a problem &#8211; to sort search results Solr must collect them all. So  when making query with a sort parameter like <code>sort=added+desc<\/code> to get the  documents sorted correctly, each of the shards would have to return all  search results &#8211; this mean that we can&#8217;t use early termination of  search ? Not really. To  help us, Solr provides a tool &#8211; IndexSorter, which until now was  available only in the Apache Nutch project, but recently was commited to  Lucene and Solr. With this tool, we can pre-sort the index by the parameter that we need. Thus,  an index sorted descending by date of a document adding, Solr would  first get the documents that have been added lately, and thus we would  be able to use early termination.<\/p>\n<h3>Using IndexSorter<\/h3>\n<p>What to do to use the IndexSorter tool ? Can I tell You the truth ? &#8211; It&#8217;s not that complicated. Note,  however, that at the time of publication of this entry the mentioned  tool is only available in <em>branch_3x<\/em> of Lucene\/Solr project. To  sort an index on the basis of a field, run the following command from  the command line (of course keeping in mind the appropriate location of  the library<em> lucene-misc-3.1.jar<\/em> &#8211; after building the project we find it  in directory<em> lucene\/build\/contrib\/misc<\/em>):\n<\/p>\n<pre class=\"brush:bash\">java IndexSorter SOURCE_DIRECTORY TARGET_DIRECTORY FIELD_NAME<\/pre>\n<p>The parameters mean:<\/p>\n<ul>\n<li><em>SOURCE_DIRECTORY <\/em>&#8211; a catalog with an index that you want to sort,<\/li>\n<li><em>TARGET_DIRECTORY<\/em> &#8211; the directory where sorted index will be saved,<\/li>\n<li><em>FIELD_NAME <\/em>&#8211; the field on which basis the index will be sorted.<\/li>\n<\/ul>\n<p>If everything goes correctly, You should see something like this:\n<\/p>\n<pre class=\"brush:bash\">IndexSorter: done, 896 total milliseconds<\/pre>\n<h3>The end<\/h3>\n<p>In  my opinion, Lucene and Solr just got a very interesting feature, which  can be used for example wherever the amount of data is very large, when  response time can not exceed a certain time limit, or when the results  beyond the first (the first 100 or 1000) are not significant. All  who are interested in the subject or index sorting and early  termination techniques should watch a slide presentation titled  &#8220;<em>Munching and Crunching: Lucene Index Post-Processing<\/em>&#8221; (<a href=\"http:\/\/lucene-eurocon.org\/slides\/Munching-&amp;-crunching-Lucene-index-post-processing-and-applications_Andrzej-Bialecki.pdf\" target=\"_blank\" rel=\"noopener noreferrer\">slides<\/a>) led by  Andrzej Bialecki during Lucene Eurocon Conference 2010, who discussed  these topics.<\/p>","protected":false},"excerpt":{"rendered":"<p>At the Apache Lucene Eurocon 2010 conference, which took place in May this year, Andrew Bia\u0142ecki in his presentation talked about how to obtain satisfactory search results when using early termination search techniques. Unfortunately the tool he mentioned, was not<\/p>\n","protected":false},"author":3,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":"","_links_to":"","_links_to_target":""},"categories":[27],"tags":[195,223,222,224,162,164,225],"class_list":["post-82","post","type-post","status-publish","format-standard","hentry","category-solr-en","tag-index-2","tag-index-sorter-2","tag-index-sorting","tag-indexsorter-2","tag-lucene-2","tag-solr-2","tag-sorting-2"],"_links":{"self":[{"href":"https:\/\/solr.pl\/en\/wp-json\/wp\/v2\/posts\/82","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/solr.pl\/en\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/solr.pl\/en\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/solr.pl\/en\/wp-json\/wp\/v2\/users\/3"}],"replies":[{"embeddable":true,"href":"https:\/\/solr.pl\/en\/wp-json\/wp\/v2\/comments?post=82"}],"version-history":[{"count":1,"href":"https:\/\/solr.pl\/en\/wp-json\/wp\/v2\/posts\/82\/revisions"}],"predecessor-version":[{"id":83,"href":"https:\/\/solr.pl\/en\/wp-json\/wp\/v2\/posts\/82\/revisions\/83"}],"wp:attachment":[{"href":"https:\/\/solr.pl\/en\/wp-json\/wp\/v2\/media?parent=82"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/solr.pl\/en\/wp-json\/wp\/v2\/categories?post=82"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/solr.pl\/en\/wp-json\/wp\/v2\/tags?post=82"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}