<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>tika &#8211; Solr.pl</title>
	<atom:link href="https://solr.pl/en/tag/tika-2/feed/" rel="self" type="application/rss+xml" />
	<link>https://solr.pl/en/</link>
	<description>All things to be found - Blog related to Apache Solr &#38; Lucene projects - https://solr.apache.org</description>
	<lastBuildDate>Wed, 11 Nov 2020 20:59:38 +0000</lastBuildDate>
	<language>en-US</language>
	<sy:updatePeriod>
	hourly	</sy:updatePeriod>
	<sy:updateFrequency>
	1	</sy:updateFrequency>
	<generator>https://wordpress.org/?v=6.9</generator>
	<item>
		<title>Document language identification</title>
		<link>https://solr.pl/en/2012/01/23/document-language-identification/</link>
					<comments>https://solr.pl/en/2012/01/23/document-language-identification/#respond</comments>
		
		<dc:creator><![CDATA[Rafał Kuć]]></dc:creator>
		<pubDate>Mon, 23 Jan 2012 20:59:03 +0000</pubDate>
				<category><![CDATA[Solr]]></category>
		<category><![CDATA[3.5]]></category>
		<category><![CDATA[document]]></category>
		<category><![CDATA[identification]]></category>
		<category><![CDATA[language]]></category>
		<category><![CDATA[solr]]></category>
		<category><![CDATA[tika]]></category>
		<guid isPermaLink="false">http://sematext.solr.pl/?p=396</guid>

					<description><![CDATA[One of the functionality of the latest Solr version (3.5) is the ability to identify the language of the document during its indexation. In todays entry we will see how Apache Solr work together with Apache Tika to identify the]]></description>
										<content:encoded><![CDATA[<p>One of the functionality of the latest Solr version (<a href="http://solr.pl/en/2011/11/27/apache-lucene-and-solr-3-5/" target="_blank" rel="noopener noreferrer">3.5</a>) is the ability to identify the language of the document during its indexation. In todays entry we will see how Apache Solr work together with Apache Tika to identify the language of the documents.</p>
<p><span id="more-396"></span></p>
<h3>At the beginning</h3>
<p>You should remember that the described functionality was introduced in Solr 3.5.</p>
<h3>Assumptions</h3>
<p>We will be using two fields to identify the document language:&nbsp;<em>title</em>&nbsp;and&nbsp;<em>body</em>. We want to store the information of the detected language in the <em>lang</em>&nbsp;field.</p>
<h3>Index structure</h3>
<p>The structure of our index is of course simplified and contain only fields needed for the test. So the field definition part of the <em>schema.xml</em>&nbsp;file looks like this:
</p>
<pre class="brush:xml">&lt;field name="id" type="string" indexed="true" stored="true" required="true" /&gt;
&lt;field name="title" type="text_ws" indexed="true" stored="true" /&gt;
&lt;field name="body" type="text_ws" indexed="true" stored="true" /&gt;
&lt;field name="lang" type="string" indexed="true" stored="true" /&gt;</pre>
<p>All the fields as marked as&nbsp;<em>stored=&#8221;true&#8221;</em>&nbsp;for simplicity.</p>
<h3>Update request processor configuration</h3>
<p>In order to be able to use the language identification feature we need to configure Solr update request processor. We will be using the one that is using Apache Tika (there is a second implementation based on&nbsp;<a href="http://code.google.com/p/language-detection/">http://code.google.com/p/language-detection/</a>). In order to configure the process we add the following to the <em>solrconfig.xml</em>&nbsp;file:
</p>
<pre class="brush:xml">&lt;updateRequestProcessorChain name="langid"&gt;
  &lt;processor name="langid" class="org.apache.solr.update.processor.TikaLanguageIdentifierUpdateProcessorFactory"&gt;
    &lt;lst name="defaults"&gt;
      &lt;str name="langid.fl"&gt;title,body&lt;/str&gt;
      &lt;str name="langid.langField"&gt;lang&lt;/str&gt;
    &lt;/lst&gt;
  &lt;/processor&gt;
  &lt;processor class="solr.LogUpdateProcessorFactory" /&gt;
  &lt;processor class="solr.RunUpdateProcessorFactory" /&gt;
&lt;/updateRequestProcessorChain&gt;</pre>
<p>Other parameters of the <em>TikaLanguageIdentifierUpdateProcessorFactory</em>&nbsp;are described on Apache Solr wiki pages available at the following URL address:&nbsp;<a href="http://wiki.apache.org/solr/LanguageDetection">http://wiki.apache.org/solr/LanguageDetection</a>.</p>
<h3>Additional libraries</h3>
<p>In order for the update request processor to be working we need some additional libraries. From the <em>dist</em>&nbsp;directory from Apache Solr distribution we copy the&nbsp;<em>apache-solr-langid-3.5.0.jar</em>&nbsp;to&nbsp;<em>tikaDir</em>&nbsp;(for example), which we make on the same level as the <em>webapps</em>&nbsp;directory. Then we add the following line to the&nbsp;<em>solrconfig.xml </em>file:
</p>
<pre class="brush:xml">&lt;lib dir="../tikaLib/" regex="apache-solr-langid-\d.*\.jar" /&gt;</pre>
<p>The next library we will need is the Tika jar with all the goodiess (<em>tika-app-1.0.jar</em>) which we can download at the following URL address: <a href="http://tika.apache.org/">http://tika.apache.org/</a>. We place it in the same <em>tikaDir</em>&nbsp;directory and then we add the following entry to the <em>solrconfig.xml</em>&nbsp;file<em>:</em>
</p>
<pre class="brush:xml">&lt;lib dir="../tikaLib/" regex="tika-app-1.0.jar" /&gt;</pre>
<h3>Test documents</h3>
<p>For the testing purposes I decided to prepare three documents. The first was in English, the second one in Polish and the third one in German. Their content was downloaded from Wikipedia. They look as follows:</p>
<h4>tika_en.xml</h4>
<pre class="brush:xml">&lt;add&gt;
&lt;doc&gt;
  &lt;field name="id"&gt;1&lt;/field&gt;
  &lt;field name="title"&gt;Water&lt;/field&gt;
  &lt;field name="body"&gt;Water is a chemical substance with the chemical formula H2O. A water molecule contains one oxygen and two hydrogen atoms connected by covalent bonds. Water is a liquid at ambient conditions, but it often co-exists on Earth with its solid state, ice, and gaseous state (water vapor or steam). Water also exists in a liquid crystal state near hydrophilic surfaces.[1][2] Under nomenclature used to name chemical compounds, Dihydrogen monoxide is the scientific name for water, though it is almost never used.&lt;/field&gt;
&lt;/doc&gt;
&lt;/add&gt;</pre>
<h4>tika_pl.xml</h4>
<pre class="brush:xml">&lt;add&gt;
&lt;doc&gt;
  &lt;field name="id"&gt;2&lt;/field&gt;
  &lt;field name="title"&gt;Woda&lt;/field&gt;
  &lt;field name="body"&gt;Woda (tlenek wodoru; nazwa systematyczna IUPAC: oksydan) – związek chemiczny o wzorze H2O, występujący w warunkach standardowych w stanie ciekłym. W stanie gazowym wodę określa się mianem pary wodnej, a w stałym stanie skupienia – lodem. Słowo woda jako nazwa związku chemicznego może się odnosić do każdego stanu skupienia.&lt;/field&gt;
&lt;/doc&gt;
&lt;/add&gt;</pre>
<h4>tika_de.xml</h4>
<pre class="brush:xml">&lt;add&gt;
&lt;doc&gt;
  &lt;field name="id"&gt;3&lt;/field&gt;
  &lt;field name="title"&gt;Wasser&lt;/field&gt;
  &lt;field name="body"&gt;Wasser (H2O) ist eine chemische Verbindung aus den Elementen Sauerstoff (O) und Wasserstoff (H). Wasser ist die einzige chemische Verbindung auf der Erde, die in der Natur in allen drei Aggregatzuständen vorkommt. Die Bezeichnung Wasser wird dabei besonders für den flüssigen Aggregatzustand verwendet. Im festen (gefrorenen) Zustand spricht man von Eis, im gasförmigen Zustand von Wasserdampf.&lt;/field&gt;
&lt;/doc&gt;
&lt;/add&gt;</pre>
<h3>More testing</h3>
<p>To index the data I used the following shell commands:
</p>
<pre class="brush:xml">curl 'http://localhost:8983/solr/update?update.chain=langid' --data-binary @tika_pl.xml -H 'Content-type:application/xml'
curl 'http://localhost:8983/solr/update?update.chain=langid' --data-binary @tika_en.xml -H 'Content-type:application/xml'
curl 'http://localhost:8983/solr/update?update.chain=langid' --data-binary @tika_de.xml -H 'Content-type:application/xml'
curl 'http://localhost:8983/solr/update?update.chain=langid' --data-binary '&lt;commit/&gt;' -H 'Content-type:application/xml'</pre>
<p>It is worth to notice the additional <em>update.chain=langid</em>&nbsp;parameter added to the request. This parameter is used to tell Solr which update processor to use when indexing the data. In the example we told Solr that it should use our defined update processor.</p>
<h3>Indexed data</h3>
<p>So let&#8217;s have a look at the indexed data. We will do that by running the following query: <em>q=*:*&amp;indent=true</em>.
</p>
<pre class="brush:xml">&lt;?xml version="1.0" encoding="UTF-8"?&gt;
&lt;response&gt;
&lt;lst name="responseHeader"&gt;
  &lt;int name="status"&gt;0&lt;/int&gt;
  &lt;int name="QTime"&gt;0&lt;/int&gt;
  &lt;lst name="params"&gt;
    &lt;str name="indent"&gt;true&lt;/str&gt;
    &lt;str name="q"&gt;*:*&lt;/str&gt;
  &lt;/lst&gt;
&lt;/lst&gt;
&lt;result name="response" numFound="3" start="0"&gt;
  &lt;doc&gt;
    &lt;str name="body"&gt;Woda (tlenek wodoru; nazwa systematyczna IUPAC: oksydan) – związek chemiczny o wzorze H2O, występujący w warunkach standardowych w stanie ciekłym. W stanie gazowym wodę określa się mianem pary wodnej, a w stałym stanie skupienia – lodem. Słowo woda jako nazwa związku chemicznego może się odnosić do każdego stanu skupienia.&lt;/str&gt;
    &lt;str name="id"&gt;2&lt;/str&gt;
    &lt;str name="lang"&gt;pl&lt;/str&gt;
    &lt;str name="title"&gt;Woda&lt;/str&gt;
  &lt;/doc&gt;
  &lt;doc&gt;
    &lt;str name="body"&gt;Water is a chemical substance with the chemical formula H2O. A water molecule contains one oxygen and two hydrogen atoms connected by covalent bonds. Water is a liquid at ambient conditions, but it often co-exists on Earth with its solid state, ice, and gaseous state (water vapor or steam). Water also exists in a liquid crystal state near hydrophilic surfaces.[1][2] Under nomenclature used to name chemical compounds, Dihydrogen monoxide is the scientific name for water, though it is almost never used.&lt;/str&gt;
    &lt;str name="id"&gt;1&lt;/str&gt;
    &lt;str name="lang"&gt;en&lt;/str&gt;
    &lt;str name="title"&gt;Water&lt;/str&gt;
  &lt;/doc&gt;
  &lt;doc&gt;
    &lt;str name="body"&gt;Wasser (H2O) ist eine chemische Verbindung aus den Elementen Sauerstoff (O) und Wasserstoff (H). Wasser ist die einzige chemische Verbindung auf der Erde, die in der Natur in allen drei Aggregatzuständen vorkommt. Die Bezeichnung Wasser wird dabei besonders für den flüssigen Aggregatzustand verwendet. Im festen (gefrorenen) Zustand spricht man von Eis, im gasförmigen Zustand von Wasserdampf.&lt;/str&gt;
    &lt;str name="id"&gt;3&lt;/str&gt;
    &lt;str name="lang"&gt;de&lt;/str&gt;
    &lt;str name="title"&gt;Wasser&lt;/str&gt;
  &lt;/doc&gt;
&lt;/result&gt;
&lt;/response&gt;</pre>
<p>As you can see, Solr with the use of Tika, was able to identify the languages of the indexed documents. Of course, let&#8217;s not be too optimistic, because mistakes happen, especially when dealing with multi-language documents, but that&#8217;s understandable.</p>
<h3>To sum up</h3>
<p>You should remember, that the language identification feature is not perfect and can make mistakes. Also remember, that the longer the documents, the better the functionality will work. Of course the problem is that we can&#8217;t use the language identification during query time, but it&#8217;s not only problem with Solr and Tika. You can deal with that by identifying your user, it&#8217;s web browser or place he is located in.</p>
]]></content:encoded>
					
					<wfw:commentRss>https://solr.pl/en/2012/01/23/document-language-identification/feed/</wfw:commentRss>
			<slash:comments>0</slash:comments>
		
		
			</item>
		<item>
		<title>Solr and Tika integration (part 1 &#8211; basics)</title>
		<link>https://solr.pl/en/2011/03/21/solr-and-tika-integration-part-1-basics/</link>
					<comments>https://solr.pl/en/2011/03/21/solr-and-tika-integration-part-1-basics/#respond</comments>
		
		<dc:creator><![CDATA[Rafał Kuć]]></dc:creator>
		<pubDate>Mon, 21 Mar 2011 08:19:50 +0000</pubDate>
				<category><![CDATA[Solr]]></category>
		<category><![CDATA[solr]]></category>
		<category><![CDATA[tika]]></category>
		<guid isPermaLink="false">http://sematext.solr.pl/?p=223</guid>

					<description><![CDATA[Indexing the so-called &#8220;rich documents&#8221;, ie files like pdf, doc, rtf, and so on (or binary files) always required some additional work on the developer side, at least to get the contents of the file and prepare it in a]]></description>
										<content:encoded><![CDATA[<p><span id="goog-gtc-unit-1" class="goog-gtc-unit"><span class="goog-gtc-translatable goog-gtc-from-human" dir="ltr">Indexing the so-called &#8220;rich documents&#8221;, ie files like pdf, doc, rtf, and so on (or binary files) always required some additional work on the developer side, at least to get the contents of the file and prepare it in a format understood by the search engines, in this case for Solr.</span></span> <span id="goog-gtc-unit-2" class="goog-gtc-unit"><span class="goog-gtc-translatable goog-gtc-from-mt" dir="ltr">To minimize this job I decided to look at the <a title="http://tika.apache.org/" href="http://tika.apache.org/" target="_blank" rel="noopener noreferrer">Apache Tika</a> and integration of this library with Solr.</span></span></p>
<p><span id="more-223"></span></p>
<h3>Introduction</h3>
<p><span id="goog-gtc-unit-4" class="goog-gtc-unit"><span class="goog-gtc-translatable goog-gtc-from-human" dir="ltr">First, a few words about the opportunities that we have when we choose Apache Tika.</span></span> <span id="goog-gtc-unit-5" class="goog-gtc-unit"><span class="goog-gtc-translatable goog-gtc-from-human" dir="ltr">Apache Tika is a framework designed to extract information from the so-called &#8220;rich documents&#8221;-documents such as PDF files, the files in Microsoft Office format, rtf, but not only.</span></span> <span id="goog-gtc-unit-6" class="goog-gtc-unit"><span class="goog-gtc-translatable goog-gtc-from-human" dir="ltr">Using Apache Tika we can also extract information from compressed documents, HTML files, images (eg jpg, png, gif), audio files (eg mp3, midi, wave), and compiled Java bytecode files.</span></span> <span id="goog-gtc-unit-7" class="goog-gtc-unit"><span class="goog-gtc-translatable goog-gtc-from-human" dir="ltr">In addition, Apache Tika can detect the type of file being processed, which further simplifies the work with such documents.</span></span> <span id="goog-gtc-unit-8" class="goog-gtc-unit"><span class="goog-gtc-translatable goog-gtc-from-human" dir="ltr">It is worth mentioning that the described framework is based on libraries such as PDFBox, Apache POI, or Neko HTML which indirectly guarantee very good results of extracted data.</span></span></p>
<h3>Sample index structure</h3>
<p><span id="goog-gtc-unit-10" class="goog-gtc-unit"><span class="goog-gtc-translatable goog-gtc-from-human" dir="ltr">I&#8217;ll skip how to manually start the extraction of the contents of the documents in the Apache Tika and I will focus on the integration of this framework with Solr and how trivial it is.</span></span> <span id="goog-gtc-unit-11" class="goog-gtc-unit"><span class="goog-gtc-translatable goog-gtc-from-human" dir="ltr">Assume that we are interested in the ID, title and contents of the documents we have to index.</span></span> <span id="goog-gtc-unit-12" class="goog-gtc-unit"><span class="goog-gtc-translatable goog-gtc-from-mt" dir="ltr">Thus we create a simple schema.xml file describing the index structure, which could look like this:</span></span>
</p>
<pre class="brush:xml">&lt;field name="id" type="string" indexed="true" stored="true" required="true" /&gt;
&lt;field name="tytul" type="text" indexed="true" stored="true"/&gt;
&lt;field name="zawartosc" type="text" indexed="true" stored="false" multiValued="true"/&gt;</pre>
<h3>Configuration</h3>
<p>To the solrconfig.xml file we add the following entry which defines a handler that will handle the indexing of documents using Apache Tika:
</p>
<pre class="brush:xml">&lt;requestHandler name="/update/extract" class="org.apache.solr.handler.extraction.ExtractingRequestHandler"&gt;
   &lt;lst name="defaults"&gt;
      &lt;str name="fmap.Last-Modified"&gt;last_modified&lt;/str&gt;
      &lt;str name="uprefix"&gt;ignored_&lt;/str&gt;
    &lt;/lst&gt;
&lt;/requestHandler&gt;</pre>
<p><span id="goog-gtc-unit-26" class="goog-gtc-unit"><span class="goog-gtc-translatable goog-gtc-from-human" dir="ltr">All update requests sent to the<em> /update/extract</em> address will be handled by Apache Tika.</span></span> <span id="goog-gtc-unit-27" class="goog-gtc-unit"><span class="goog-gtc-translatable goog-gtc-from-human" dir="ltr">Of course, remember to send the commit command to the update handler after sending the documents to the handler using Apache Tika. Otherwise, your documents won&#8217;t be visible. In the standard Solr deployment you shoudl send the commit command to the handler located under <em>/update</em>.</span></span></p>
<p>In the configuration we told the extraction handler to assign the Last-Modified attribute to the <em>last_modified</em> field and to ignore the fields that do are not specified.</p>
<h3>Additional notes</h3>
<p>If you are going to index large binary files, remember to change the size limits. To do that, change the following values in the solrconfig.xml file:&lt;requestDispatcher handleSelect=&#8221;true&#8221;&gt; &lt;requestParsers enableRemoteStreaming=&#8221;false&#8221; multipartUploadLimitInKB=&#8221;10240&#8243; /&gt;
</p>
<div id="_mcePaste" style="overflow: hidden; position: absolute; left: -10000px; top: 269px; width: 1px; height: 1px;">
<pre>fmap.Last-Modified</pre>
</div>
<h3><span id="goog-gtc-unit-36" class="goog-gtc-unit"><span class="goog-gtc-translatable goog-gtc-from-tm goog-gtc-from-tm-score-100" dir="ltr">The end</span></span></h3>
<p><span id="goog-gtc-unit-37" class="goog-gtc-unit"><span class="goog-gtc-translatable goog-gtc-from-mt" dir="ltr">All parameters defining the ExtractingRequestHandler can be found at: <a title="http://wiki.apache.org/solr/ExtractingRequestHandler" href="http://wiki.apache.org/solr/ExtractingRequestHandler" target="_blank" rel="noopener noreferrer">http://wiki.apache.org/solr/ExtractingRequestHandler</a>.</span></span></p>
]]></content:encoded>
					
					<wfw:commentRss>https://solr.pl/en/2011/03/21/solr-and-tika-integration-part-1-basics/feed/</wfw:commentRss>
			<slash:comments>0</slash:comments>
		
		
			</item>
		<item>
		<title>Solr: data indexing for fun and profit</title>
		<link>https://solr.pl/en/2010/09/06/solr-data-indexing-for-fun-and-profit/</link>
					<comments>https://solr.pl/en/2010/09/06/solr-data-indexing-for-fun-and-profit/#respond</comments>
		
		<dc:creator><![CDATA[Marek Rogoziński]]></dc:creator>
		<pubDate>Mon, 06 Sep 2010 12:10:35 +0000</pubDate>
				<category><![CDATA[Solr]]></category>
		<category><![CDATA[acf]]></category>
		<category><![CDATA[cell]]></category>
		<category><![CDATA[import]]></category>
		<category><![CDATA[indexing]]></category>
		<category><![CDATA[lcf]]></category>
		<category><![CDATA[tika]]></category>
		<guid isPermaLink="false">http://sematext.solr.pl/?p=73</guid>

					<description><![CDATA[Solr is not very friendly to novice users. Preparing good schema file requires some experience. Assuming that we have prepared the configuration files, what remains for us is to share our data with the search server and take care of]]></description>
										<content:encoded><![CDATA[<p>Solr is not very friendly to novice users. Preparing good schema file requires some experience. Assuming that we have prepared the configuration files, what remains for us is to share our data with the search server and take care of update ability.</p>
<p><span id="more-73"></span></p>
<p>There are a few ways to import data:</p>
<ul>
<li> Update Handler</li>
<li>Cvs Request Handler</li>
<li>Data Import Handler</li>
<li>Extracting Request Handler (Solr Cell)</li>
<li>Client libraries (for example Solrj)</li>
<li>Apache Connector Framework (formerly Lucene Connector Framework)</li>
<li>Apache Nutch</li>
</ul>
<p>In addition to the mentioned above you casn stream your data to search server.  As you can see, there is some confusion here and its to provide the best method to use in a particular case at first glance.</p>
<h2>Update Handler</h2>
<p>Perhaps the most popular method because of simplicity. It requires the preparation of the corresponding XML file and then You must send it via HTTP to a Solr server. It enables document and individual fields boosting.</p>
<h2>CSV Request Handler</h2>
<p>When we have data in CSV format (Coma Separated Values) or in TSV format (Tab Separated Values) this option may be most convenient. Unfortunately, in contrast to the Update Handler is not possible to boost documents or fields.</p>
<h2>Data Import Handler</h2>
<p>This method is less common, requires additional and sometimes quite complicated configuration, but allows direct linking to the data source. Using DIH we do not need any additional scripts for data exporting from a source to the format required by Solr. What we get out of the box is: integration with databases (based on JDBC),  integration with sources available in XML (for example RSS), e-mail integration (via IMAP protocol) and integration with documents which can be parsed by Apache Tika (like OpenOffice documents, Microsoft Word, RTF, HTML, and many, many more).   In addition it is possible to develop your own sources and transformations.</p>
<h2>Extracting Request Handler (Solre Cell)</h2>
<p>Specialized handler for indexing the content of documents stored in files of different formats. List of supported formats is quite extensive and the indexing is performed by Apache Tika.  The drawback of this method is the need of building additional solutions that provide Solr  the information about the document and its identifier and that there is  no support for providing additional meta data, external to the  document.</p>
<h2>Client Libraries</h2>
<p>Solr provides client libraries for many programming languages. Their capabilities differ, but if the data are generated onboard  by the application and the time after in which the data must be  available for searching is very low, this way of indexing is often the  only available option.</p>
<h2>Apache Connector Framework</h2>
<p>ACF is a relatively new project, which revealed a wider audience in early 2010. The project was initially an internal project run by the company MetaCarta, and was donated to the open source community and is currently being developed within Apache incubator. The idea is to build a system that allows making connection to the data source with a help of a series of plug-ins. At  the moment there is no published version, but the system itself is  already worth of interest in the case of the need to integrate with such  systems as: FileNet P8 (IBM), Documentum (EMC), LiveLink (OpenText), Patriarch (Memex), Meridio (Autonomy) Windows shares (Microsoft) and SharePoint (Microsoft).</p>
<h2>Apache Nutch</h2>
<p>Nutch is in fact, a separate project run by the Apache (previously under the Apache Lucene, now a top level project). For the person using Solr Nutch is interesting as it allows to crawl through Web pages and index them by Solr.</p>
<h2>Word about streaming</h2>
<p>Streaming means the ability to notice Solr, where to download the data to be indexed. This avoids unnecessary data transmission over the network, if the data is on the same server as indexer, or double data transmission (from the source to the importer and from the importer to Solr).</p>
<h2>And a word about security</h2>
<p>Solr, bye design, is intended to be used in a architecture assuming safe environment. It is very important to note, who and how is able to query solr. While  the returned data can be easily reduced, by forcing the use of filters  in the definition of the handler, then in the case of indexing is not so  easy. In particular, the most dangerous seems to be Solr Cell &#8211; it will not only allow to read any file to which Solr have access(eg. files with passwords), but will also will provide a convenient method of searching in those files <img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f609.png" alt="😉" class="wp-smiley" style="height: 1em; max-height: 1em;" /></p>
<h2>Other options</h2>
<p>I tried to mention all the methods that does not require any additional work to make indexing work. The  problem may be the definition of this additional work, because  sometimes it is easier to write additional plug-in than break through  numerous configuration options and create a giant XML file. Therefore,  the choice of methods was guided by my own sense, which resulted in  skipping of some methods (like fetching data from WWW pages with the use  of Apache Droids or Heritrix, or solutionsa based on Open Pipeline or Open Pipe).</p>
<p>Certainly in this short article I managed to miss some interesting methods. If so, please comment, I`ll be glad update this entry <img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f642.png" alt="🙂" class="wp-smiley" style="height: 1em; max-height: 1em;" /></p>
]]></content:encoded>
					
					<wfw:commentRss>https://solr.pl/en/2010/09/06/solr-data-indexing-for-fun-and-profit/feed/</wfw:commentRss>
			<slash:comments>0</slash:comments>
		
		
			</item>
	</channel>
</rss>
