<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>cell &#8211; Solr.pl</title>
	<atom:link href="https://solr.pl/en/tag/cell-2/feed/" rel="self" type="application/rss+xml" />
	<link>https://solr.pl/en/</link>
	<description>All things to be found - Blog related to Apache Solr &#38; Lucene projects - https://solr.apache.org</description>
	<lastBuildDate>Wed, 11 Nov 2020 22:42:00 +0000</lastBuildDate>
	<language>en-US</language>
	<sy:updatePeriod>
	hourly	</sy:updatePeriod>
	<sy:updateFrequency>
	1	</sy:updateFrequency>
	<generator>https://wordpress.org/?v=6.9</generator>
	<item>
		<title>Simple photo search</title>
		<link>https://solr.pl/en/2012/02/20/simple-photo-search/</link>
					<comments>https://solr.pl/en/2012/02/20/simple-photo-search/#respond</comments>
		
		<dc:creator><![CDATA[Rafał Kuć]]></dc:creator>
		<pubDate>Mon, 20 Feb 2012 22:41:31 +0000</pubDate>
				<category><![CDATA[Solr]]></category>
		<category><![CDATA[cell]]></category>
		<category><![CDATA[exif]]></category>
		<category><![CDATA[extract]]></category>
		<category><![CDATA[photo]]></category>
		<category><![CDATA[search]]></category>
		<category><![CDATA[solr]]></category>
		<category><![CDATA[solr cell]]></category>
		<guid isPermaLink="false">http://sematext.solr.pl/?p=441</guid>

					<description><![CDATA[Recently we had a change to help with a non-commercial project which included search as its part. One of the assumptions, although not the key ones, was the photo search functionality, so that the user could find the pictures fast]]></description>
										<content:encoded><![CDATA[<p>Recently we had a change to help with a non-commercial project which included search as its part. One of the assumptions, although not the key ones, was the photo search functionality, so that the user could find the pictures fast and accurately. Because the search had to work with meta data of JPEG files, the idea was simple &#8211; use Apache Solr with Apache Tika.</p>
<p><span id="more-441"></span></p>
<h3>Assumptions</h3>
<p>Assumptions were quite simple &#8211; the user should be able to find photos by their file name, author and other data available in EXIF, like aperture, shutter speed, focal length or ISO value. Another thing was that Solr should take care of grabbing the meta data from JPEG files, so this was definitely something we wanted use Solr cell for. As You can see, those assumptions were simple.</p>
<h3>Index structure</h3>
<p>Index structure was very simple and contained only most needed fields. The fields section of the <em>schema.xml</em> file looked as follows:
</p>
<pre class="brush:xml">&lt;field name="id" type="string" indexed="true" stored="true" required="true" /&gt;
&lt;field name="name" type="text" indexed="true" stored="true" /&gt;
&lt;field name="author" type="text" indexed="true" stored="true" /&gt;
&lt;field name="iso" type="text" indexed="true" stored="true" multiValued="true" /&gt;
&lt;field name="iso_string" type="text" indexed="true" stored="true" multiValued="true" /&gt;
&lt;field name="aperture" type="double" indexed="true" stored="true" /&gt;
&lt;field name="exposure" type="string" indexed="true" stored="true" /&gt;
&lt;field name="exposure_time" type="double" indexed="true" stored="true" /&gt;
&lt;field name="focal" type="string" indexed="true" stored="true" /&gt;
&lt;field name="focal_35" type="string" indexed="true" stored="true" /&gt;
&lt;dynamicField name="ignored_*" type="string" indexed="false" stored="false" multiValued="true" /&gt;</pre>
<p>The dynamic field was added to ignore the data we weren&#8217;t interested in. Also the <em>copyField </em>was introduced to copy the <em>iso</em> field value to <em>iso_string</em> field to enable faceting.</p>
<h3>Solr configuration</h3>
<p>The following handler definition was added to <em>solrconfig.xml </em>file:
</p>
<pre class="brush:xml">&lt;requestHandler name="/update/extract" class="solr.extraction.ExtractingRequestHandler"&gt;
 &lt;lst name="defaults"&gt;
  &lt;str name="uprefix"&gt;ignored_&lt;/str&gt;
  &lt;str name="lowernames"&gt;true&lt;/str&gt;
  &lt;str name="captureAttr"&gt;true&lt;/str&gt;
  &lt;str name="fmap.stream_name"&gt;name&lt;/str&gt;
  &lt;str name="fmap.artist"&gt;author&lt;/str&gt;
  &lt;str name="fmap.exif_isospeedratings"&gt;iso&lt;/str&gt;
  &lt;str name="fmap.exif_fnumber"&gt;aperture&lt;/str&gt;
  &lt;str name="fmap.exposure_time"&gt;exposure&lt;/str&gt;
  &lt;str name="fmap.exif_exposuretime"&gt;exposure_time&lt;/str&gt;
  &lt;str name="fmap.focal_length"&gt;focal&lt;/str&gt;
  &lt;str name="fmap.focal_length_35"&gt;focal_35&lt;/str&gt;
 &lt;/lst&gt;
&lt;/requestHandler&gt;</pre>
<p>A few words about configuration. The <em>uprefix</em> parameter tells Solr which prefix it should use for the fields that were not mentioned explicitly in the handler configuration. In the above case, the fields which were not mentioned will be prefixed with the <em>ignored_</em> word. That means that they will be matched by the dynamic field and thus they won&#8217;t be indexed (<em>stored=&#8221;false&#8221;</em> and <em>indexed=&#8221;false&#8221;</em>). The <em>lowernames </em>parameter with the value of <em>true</em> will cause all the field names to be lowercased. The <em>captureAttr</em> parameter tell Solr, to catch file attributes. The next parameters in the above configuration is mapping definition between fields returned by Tika and fields in the index. For example, <em>fmap.exif_fnumber</em> with the value of <em>aperture </em>says Solr to place the value of Tika <em>exif_fnumber</em> in the <em>aperture</em> index field.</p>
<h4>Additional, needed libraries</h4>
<p>In order for the above configuration to work we need some additional libraries (similar to the ones described in <a href="http://solr.pl/en/2012/01/23/document-language-identification/" target="_blank" rel="noopener noreferrer">language identification</a>). From the <em>dist</em> directory that is available in Solr distribution we copy the <em>apache-solr-cell-3.5.0.jar</em> file to <em>tikaDir </em>directory that should be created at the same level as the <em>webapps </em>directory in Solr deployment (of course this is an example). Next we add the following like to the <em>solrconfig.xml</em> file:
</p>
<pre class="brush:xml">&lt;lib dir="../tikaLib/" /&gt;</pre>
<p>The above tell Solr to include all the libraries from the given directory. Next we need to copy all the jar files from the <em>contrib/extraction/</em> Solr distribution directory to the created <em>tikaDir</em> directory. Additional <em>solrconfig.xml</em> changes are not needed.</p>
<h3>Data indexation</h3>
<p>The assumptions were, that there will be about 10.000 new photos a week that will need to be indexed. Those photos will be stored in a shared file system location. A simple bash script was responsible for choosing the files that were needed to be indexed and during its work it run the following command for each file:
</p>
<pre class="brush:bash">curl 'http://solrmaster:8983/solr/photos/update/extract?literal.id=9926&amp;commit=true" -F "myfile=@Wisla_2011_10_10.JPG"</pre>
<p>The above command sends a file names Wisla_2011_10_10.JPG to <em>/extract</em> handler and says to run <em>commit</em> command after its processing. In addition to that, the unique id of the file is set (the <em>literal.id</em> parameter).</p>
<h3>Queries</h3>
<p>I addition to some standard filtering by author or other attributes of the photo it was also desired for the search to work. Yeah, just work <img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f642.png" alt="🙂" class="wp-smiley" style="height: 1em; max-height: 1em;" /> We decided, that if we were the users of the application, we would like the fields like author or file name to be important. So, we decided to start with the following query:
</p>
<pre class="brush:xml">q=jan+kowalski+wisla&amp;qf=name^100+author^1000+iso+aperture+exposure_time+focal&amp;defType=dismax</pre>
<p>As you can see, the query is simple. Two fields in the index are more valuable then others &#8211; name of the photo and its author. The value of those fields were set up by adding query time boosts. The rest of the fields are without boost, so the default boost of 1 applies.</p>
<h3>To sum up</h3>
<p>The described deployment is really simple. The applications works as so the search <img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f642.png" alt="🙂" class="wp-smiley" style="height: 1em; max-height: 1em;" /> The next steps that will have to be done is the JVM and Solr tunning. One of the most important things would be looking at the users behavior and tune up searches to make search experience as good as possible. But let&#8217;s leave it for other solr.pl post.</p>
<p>&nbsp;</p>
]]></content:encoded>
					
					<wfw:commentRss>https://solr.pl/en/2012/02/20/simple-photo-search/feed/</wfw:commentRss>
			<slash:comments>0</slash:comments>
		
		
			</item>
		<item>
		<title>Solr: data indexing for fun and profit</title>
		<link>https://solr.pl/en/2010/09/06/solr-data-indexing-for-fun-and-profit/</link>
					<comments>https://solr.pl/en/2010/09/06/solr-data-indexing-for-fun-and-profit/#respond</comments>
		
		<dc:creator><![CDATA[Marek Rogoziński]]></dc:creator>
		<pubDate>Mon, 06 Sep 2010 12:10:35 +0000</pubDate>
				<category><![CDATA[Solr]]></category>
		<category><![CDATA[acf]]></category>
		<category><![CDATA[cell]]></category>
		<category><![CDATA[import]]></category>
		<category><![CDATA[indexing]]></category>
		<category><![CDATA[lcf]]></category>
		<category><![CDATA[tika]]></category>
		<guid isPermaLink="false">http://sematext.solr.pl/?p=73</guid>

					<description><![CDATA[Solr is not very friendly to novice users. Preparing good schema file requires some experience. Assuming that we have prepared the configuration files, what remains for us is to share our data with the search server and take care of]]></description>
										<content:encoded><![CDATA[<p>Solr is not very friendly to novice users. Preparing good schema file requires some experience. Assuming that we have prepared the configuration files, what remains for us is to share our data with the search server and take care of update ability.</p>
<p><span id="more-73"></span></p>
<p>There are a few ways to import data:</p>
<ul>
<li> Update Handler</li>
<li>Cvs Request Handler</li>
<li>Data Import Handler</li>
<li>Extracting Request Handler (Solr Cell)</li>
<li>Client libraries (for example Solrj)</li>
<li>Apache Connector Framework (formerly Lucene Connector Framework)</li>
<li>Apache Nutch</li>
</ul>
<p>In addition to the mentioned above you casn stream your data to search server.  As you can see, there is some confusion here and its to provide the best method to use in a particular case at first glance.</p>
<h2>Update Handler</h2>
<p>Perhaps the most popular method because of simplicity. It requires the preparation of the corresponding XML file and then You must send it via HTTP to a Solr server. It enables document and individual fields boosting.</p>
<h2>CSV Request Handler</h2>
<p>When we have data in CSV format (Coma Separated Values) or in TSV format (Tab Separated Values) this option may be most convenient. Unfortunately, in contrast to the Update Handler is not possible to boost documents or fields.</p>
<h2>Data Import Handler</h2>
<p>This method is less common, requires additional and sometimes quite complicated configuration, but allows direct linking to the data source. Using DIH we do not need any additional scripts for data exporting from a source to the format required by Solr. What we get out of the box is: integration with databases (based on JDBC),  integration with sources available in XML (for example RSS), e-mail integration (via IMAP protocol) and integration with documents which can be parsed by Apache Tika (like OpenOffice documents, Microsoft Word, RTF, HTML, and many, many more).   In addition it is possible to develop your own sources and transformations.</p>
<h2>Extracting Request Handler (Solre Cell)</h2>
<p>Specialized handler for indexing the content of documents stored in files of different formats. List of supported formats is quite extensive and the indexing is performed by Apache Tika.  The drawback of this method is the need of building additional solutions that provide Solr  the information about the document and its identifier and that there is  no support for providing additional meta data, external to the  document.</p>
<h2>Client Libraries</h2>
<p>Solr provides client libraries for many programming languages. Their capabilities differ, but if the data are generated onboard  by the application and the time after in which the data must be  available for searching is very low, this way of indexing is often the  only available option.</p>
<h2>Apache Connector Framework</h2>
<p>ACF is a relatively new project, which revealed a wider audience in early 2010. The project was initially an internal project run by the company MetaCarta, and was donated to the open source community and is currently being developed within Apache incubator. The idea is to build a system that allows making connection to the data source with a help of a series of plug-ins. At  the moment there is no published version, but the system itself is  already worth of interest in the case of the need to integrate with such  systems as: FileNet P8 (IBM), Documentum (EMC), LiveLink (OpenText), Patriarch (Memex), Meridio (Autonomy) Windows shares (Microsoft) and SharePoint (Microsoft).</p>
<h2>Apache Nutch</h2>
<p>Nutch is in fact, a separate project run by the Apache (previously under the Apache Lucene, now a top level project). For the person using Solr Nutch is interesting as it allows to crawl through Web pages and index them by Solr.</p>
<h2>Word about streaming</h2>
<p>Streaming means the ability to notice Solr, where to download the data to be indexed. This avoids unnecessary data transmission over the network, if the data is on the same server as indexer, or double data transmission (from the source to the importer and from the importer to Solr).</p>
<h2>And a word about security</h2>
<p>Solr, bye design, is intended to be used in a architecture assuming safe environment. It is very important to note, who and how is able to query solr. While  the returned data can be easily reduced, by forcing the use of filters  in the definition of the handler, then in the case of indexing is not so  easy. In particular, the most dangerous seems to be Solr Cell &#8211; it will not only allow to read any file to which Solr have access(eg. files with passwords), but will also will provide a convenient method of searching in those files <img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f609.png" alt="😉" class="wp-smiley" style="height: 1em; max-height: 1em;" /></p>
<h2>Other options</h2>
<p>I tried to mention all the methods that does not require any additional work to make indexing work. The  problem may be the definition of this additional work, because  sometimes it is easier to write additional plug-in than break through  numerous configuration options and create a giant XML file. Therefore,  the choice of methods was guided by my own sense, which resulted in  skipping of some methods (like fetching data from WWW pages with the use  of Apache Droids or Heritrix, or solutionsa based on Open Pipeline or Open Pipe).</p>
<p>Certainly in this short article I managed to miss some interesting methods. If so, please comment, I`ll be glad update this entry <img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f642.png" alt="🙂" class="wp-smiley" style="height: 1em; max-height: 1em;" /></p>
]]></content:encoded>
					
					<wfw:commentRss>https://solr.pl/en/2010/09/06/solr-data-indexing-for-fun-and-profit/feed/</wfw:commentRss>
			<slash:comments>0</slash:comments>
		
		
			</item>
	</channel>
</rss>
