<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>language &#8211; Solr.pl</title>
	<atom:link href="https://solr.pl/en/tag/language/feed/" rel="self" type="application/rss+xml" />
	<link>https://solr.pl/en/</link>
	<description>All things to be found - Blog related to Apache Solr &#38; Lucene projects - https://solr.apache.org</description>
	<lastBuildDate>Wed, 11 Nov 2020 22:44:26 +0000</lastBuildDate>
	<language>en-US</language>
	<sy:updatePeriod>
	hourly	</sy:updatePeriod>
	<sy:updateFrequency>
	1	</sy:updateFrequency>
	<generator>https://wordpress.org/?v=6.9</generator>
	<item>
		<title>Solr 4.0 and Polish language analysis</title>
		<link>https://solr.pl/en/2012/04/02/solr-4-0-and-polish-language-analysis/</link>
					<comments>https://solr.pl/en/2012/04/02/solr-4-0-and-polish-language-analysis/#respond</comments>
		
		<dc:creator><![CDATA[Rafał Kuć]]></dc:creator>
		<pubDate>Mon, 02 Apr 2012 21:43:51 +0000</pubDate>
				<category><![CDATA[Solr]]></category>
		<category><![CDATA[4.0]]></category>
		<category><![CDATA[analysis]]></category>
		<category><![CDATA[hunspell]]></category>
		<category><![CDATA[language]]></category>
		<category><![CDATA[morfologik]]></category>
		<category><![CDATA[polish]]></category>
		<category><![CDATA[solr]]></category>
		<guid isPermaLink="false">http://sematext.solr.pl/?p=449</guid>

					<description><![CDATA[Because Polish language analysis functionality is present in Lucene (and Solr) for some time I decided to take a look and compare it on the basis of upcoming Lucene and Solr 4.0. Options At the time of writing, the following]]></description>
										<content:encoded><![CDATA[<p>Because Polish language analysis functionality is present in Lucene (and Solr) for some time I decided to take a look and compare it on the basis of upcoming Lucene and Solr 4.0.</p>
<p><span id="more-449"></span></p>
<h3>Options</h3>
<p>At the time of writing, the following options were present when it comes to analyzing Polish:</p>
<ul>
<li>Use Stempel library (available since Solr 3.1)</li>
<li>Use Hunspell and Polish dictionaries (available since Solr 3.5)</li>
<li>Use Morfologik library (will be available in Solr 4.0, <a href="https://issues.apache.org/jira/browse/SOLR-3272" target="_blank" rel="noopener noreferrer">SOLR-3272</a>).</li>
</ul>
<h3>Configuration</h3>
<p>Lets look how to configure all the above options in Solr (please remember that all the following configuration examples are based on Solr 4.0).</p>
<h4>Stempel</h4>
<p>In order to add Polish stemming using Stempel library, we just need to add the following filter to our type definition:
</p>
<pre class="brush:xml">&lt;filter class="solr.StempelPolishStemFilterFactory" /&gt;</pre>
<p>In addition to that, you need to add <em>lucene-analyzers-stempel-4.0.jar</em> library and <em>apache-solr-analysis-extras-4.0.jar</em> library to&nbsp;<em>SOLR_HOME/lib</em>. It&#8217;s also a good idea to use<em> solr.LowerCaseFilterFactory</em> before Stempel filter.</p>
<h4>Hunspell</h4>
<p>Similar to the configuration above, to use Hunspell you need to add a new filter to your type definition. For example in the following way:
</p>
<pre class="brush:xml">&lt;filter class="solr.HunspellStemFilterFactory" dictionary="pl_PL.dic" affix="pl_PL.aff" ignoreCase="true" /&gt;</pre>
<p>Parameters <em>dictionary</em> and <em>affix</em> are responsible for dictionary definition that we want to use. The <em>ignoreCase</em> parameter set to <em>true</em> tells Hunspell to ignore character case. You can find Hunspell dictionaries at the following URL: <a href="http://wiki.services.openoffice.org/wiki/Dictionaries" target="_blank" rel="noopener noreferrer">http://wiki.services.openoffice.org/wiki/Dictionaries</a>.</p>
<h4>Morfologik</h4>
<p>Similar to the two above examples all you need to change in your <em>schema.xml</em> is adding a new filter, this time the following way:
</p>
<pre class="brush:xml">&lt;filter class="solr.MorfologikFilterFactory" dictionary="MORFOLOGIK" /&gt;</pre>
<p>The <em>dictionary</em> parameter tell Solr which dictionary you would like to use. You can choose the one from the following three:</p>
<ul>
<li>MORFOLOGIK</li>
<li>MORFEUSZ</li>
<li>COMBINED</li>
</ul>
<p>In addition to that, you need to add the following libraries to the <em>SOLR_HOME/lib</em>: <em>lucene-analyzers-morfologik-4.0.jar, </em><em>apache-solr-analysis-extras-4.0.jar, morfologik-fsa-1.5.2.jar</em>, <em>morfologik-polish-1.5.2.jar</em> and <em>morfologik-stemming-1.5.2.jar</em>.</p>
<h3>Results Comparison</h3>
<p>Of course I wasn&#8217;t able to judge the results of analysis from the above three filters on the whole Polish language corpus and that&#8217;s why I decided to choose four work, to see the each of the filters behave. Those words are: &#8220;<em>urodzić urodzony urodzona urodzeni&#8221;</em> (this words are variations of the <em>born</em> word in Polish)<em>. </em>The results are as follows:<em><br />
</em></p>
<h4>Stempel</h4>
<p>The terms I got from Stempel were the following ones:
</p>
<pre>[urodzić] [urodzo] [urodzona] [urodzeni]</pre>
<p>Not all of them are words, but you have to remember that Stempel is a stemmer and because of that it produce stems which can be different from the actual words or their root forms. It is important to have the words we are interested in to be processed to the same tokens, which will allow to find those words by Lucene/Solr. Remembering that, I have to say, that the results of analysis using Stempel are not as good as I would like them to be. For example by searching for <em>urodzić</em> word you won&#8217;t be able to find documents with words like <em>urodzona</em> or <em>urodzić</em>.</p>
<h4>Hunspell</h4>
<p>The result of Hunspell analysis were as follows:
</p>
<pre>[urodzić, urodzić] [urodzony, urodzić] [urodzić] [urodzić, urodzony, urodzenie]</pre>
<p>Comparing the results I got when using Hunspell to those Stempel produced we can see the difference. Our sample query for the <em>urodzić</em> word, would find documents with words like <em>urodzony</em>, <em>urodzona</em> oraz <em>urodzeni</em>, which is quite nice. You can also notice, that with three words we got more than one term on the same positions. The results I got when using Hunspell are OK and I think they should satisfy most of the users (they do satisfy me), but lets have a look on the newly introduced filter in Lucene and Solr &#8211; Morrfologik.</p>
<h4>Morfologik</h4>
<p>The results of Morfologik analysis were as follows:
</p>
<pre>[urodzić] [urodzony, urodzić] [urodzić] [urodzić, urodzony]</pre>
<p>Again, if you compare those the the ones got when using Hunspell you can hardly see the difference (of course in this particular case). The only difference between Hunspell and Morfologik is the last term for which we got different results. In my opinion the results achieved with Morfologik, are satisfying.</p>
<h3>Performance</h3>
<p>The performance test was done in a simple manner &#8211; for each filter I&#8217;ve indexed 5 million documents, where all the text fields were based on Polish language analysis with appropriate filter (in addition to that some standard filters like stopwords, synonyms and so on). Every time the indexation was done on a clean Solr 4.0 instance. Because of using Data Import Handler I&#8217;ve sent commit every 100k documents. The index contained several fields, but the actual index structure was not crucial for the test as I indexed the same set of documents every time. Following are the test results:</p>
[table “21” not found /]<br />

<p><strong>Warning<em>:</em></strong> At the time of writing, according to&nbsp; <a href="https://issues.apache.org/jira/browse/SOLR-3245">SOLR-3245</a> JIRA issue there is a problem with Hunspell performance with Polish dictionaries and Solr 4.0. I&#8217;m almost certain that this situation will be resolved by the time Solr 4.0 will be released. But right now performance of Hunspell with Polish dictionaries and Solr 4.0 may not be sufficient.</p>
<h3>Short Summary</h3>
<p>Despite not having performance results for Hunspell (because I don&#8217;t count the ones I have right now as correct ones) we can see that Hunspell and Morfologik are a good candidates for Polish language analysis. Looking at Morfologik we have similar performance to Stempel, but Morfologik results are better in my opinion and that will make your user more happy.</p>
]]></content:encoded>
					
					<wfw:commentRss>https://solr.pl/en/2012/04/02/solr-4-0-and-polish-language-analysis/feed/</wfw:commentRss>
			<slash:comments>0</slash:comments>
		
		
			</item>
		<item>
		<title>Document language identification</title>
		<link>https://solr.pl/en/2012/01/23/document-language-identification/</link>
					<comments>https://solr.pl/en/2012/01/23/document-language-identification/#respond</comments>
		
		<dc:creator><![CDATA[Rafał Kuć]]></dc:creator>
		<pubDate>Mon, 23 Jan 2012 20:59:03 +0000</pubDate>
				<category><![CDATA[Solr]]></category>
		<category><![CDATA[3.5]]></category>
		<category><![CDATA[document]]></category>
		<category><![CDATA[identification]]></category>
		<category><![CDATA[language]]></category>
		<category><![CDATA[solr]]></category>
		<category><![CDATA[tika]]></category>
		<guid isPermaLink="false">http://sematext.solr.pl/?p=396</guid>

					<description><![CDATA[One of the functionality of the latest Solr version (3.5) is the ability to identify the language of the document during its indexation. In todays entry we will see how Apache Solr work together with Apache Tika to identify the]]></description>
										<content:encoded><![CDATA[<p>One of the functionality of the latest Solr version (<a href="http://solr.pl/en/2011/11/27/apache-lucene-and-solr-3-5/" target="_blank" rel="noopener noreferrer">3.5</a>) is the ability to identify the language of the document during its indexation. In todays entry we will see how Apache Solr work together with Apache Tika to identify the language of the documents.</p>
<p><span id="more-396"></span></p>
<h3>At the beginning</h3>
<p>You should remember that the described functionality was introduced in Solr 3.5.</p>
<h3>Assumptions</h3>
<p>We will be using two fields to identify the document language:&nbsp;<em>title</em>&nbsp;and&nbsp;<em>body</em>. We want to store the information of the detected language in the <em>lang</em>&nbsp;field.</p>
<h3>Index structure</h3>
<p>The structure of our index is of course simplified and contain only fields needed for the test. So the field definition part of the <em>schema.xml</em>&nbsp;file looks like this:
</p>
<pre class="brush:xml">&lt;field name="id" type="string" indexed="true" stored="true" required="true" /&gt;
&lt;field name="title" type="text_ws" indexed="true" stored="true" /&gt;
&lt;field name="body" type="text_ws" indexed="true" stored="true" /&gt;
&lt;field name="lang" type="string" indexed="true" stored="true" /&gt;</pre>
<p>All the fields as marked as&nbsp;<em>stored=&#8221;true&#8221;</em>&nbsp;for simplicity.</p>
<h3>Update request processor configuration</h3>
<p>In order to be able to use the language identification feature we need to configure Solr update request processor. We will be using the one that is using Apache Tika (there is a second implementation based on&nbsp;<a href="http://code.google.com/p/language-detection/">http://code.google.com/p/language-detection/</a>). In order to configure the process we add the following to the <em>solrconfig.xml</em>&nbsp;file:
</p>
<pre class="brush:xml">&lt;updateRequestProcessorChain name="langid"&gt;
  &lt;processor name="langid" class="org.apache.solr.update.processor.TikaLanguageIdentifierUpdateProcessorFactory"&gt;
    &lt;lst name="defaults"&gt;
      &lt;str name="langid.fl"&gt;title,body&lt;/str&gt;
      &lt;str name="langid.langField"&gt;lang&lt;/str&gt;
    &lt;/lst&gt;
  &lt;/processor&gt;
  &lt;processor class="solr.LogUpdateProcessorFactory" /&gt;
  &lt;processor class="solr.RunUpdateProcessorFactory" /&gt;
&lt;/updateRequestProcessorChain&gt;</pre>
<p>Other parameters of the <em>TikaLanguageIdentifierUpdateProcessorFactory</em>&nbsp;are described on Apache Solr wiki pages available at the following URL address:&nbsp;<a href="http://wiki.apache.org/solr/LanguageDetection">http://wiki.apache.org/solr/LanguageDetection</a>.</p>
<h3>Additional libraries</h3>
<p>In order for the update request processor to be working we need some additional libraries. From the <em>dist</em>&nbsp;directory from Apache Solr distribution we copy the&nbsp;<em>apache-solr-langid-3.5.0.jar</em>&nbsp;to&nbsp;<em>tikaDir</em>&nbsp;(for example), which we make on the same level as the <em>webapps</em>&nbsp;directory. Then we add the following line to the&nbsp;<em>solrconfig.xml </em>file:
</p>
<pre class="brush:xml">&lt;lib dir="../tikaLib/" regex="apache-solr-langid-\d.*\.jar" /&gt;</pre>
<p>The next library we will need is the Tika jar with all the goodiess (<em>tika-app-1.0.jar</em>) which we can download at the following URL address: <a href="http://tika.apache.org/">http://tika.apache.org/</a>. We place it in the same <em>tikaDir</em>&nbsp;directory and then we add the following entry to the <em>solrconfig.xml</em>&nbsp;file<em>:</em>
</p>
<pre class="brush:xml">&lt;lib dir="../tikaLib/" regex="tika-app-1.0.jar" /&gt;</pre>
<h3>Test documents</h3>
<p>For the testing purposes I decided to prepare three documents. The first was in English, the second one in Polish and the third one in German. Their content was downloaded from Wikipedia. They look as follows:</p>
<h4>tika_en.xml</h4>
<pre class="brush:xml">&lt;add&gt;
&lt;doc&gt;
  &lt;field name="id"&gt;1&lt;/field&gt;
  &lt;field name="title"&gt;Water&lt;/field&gt;
  &lt;field name="body"&gt;Water is a chemical substance with the chemical formula H2O. A water molecule contains one oxygen and two hydrogen atoms connected by covalent bonds. Water is a liquid at ambient conditions, but it often co-exists on Earth with its solid state, ice, and gaseous state (water vapor or steam). Water also exists in a liquid crystal state near hydrophilic surfaces.[1][2] Under nomenclature used to name chemical compounds, Dihydrogen monoxide is the scientific name for water, though it is almost never used.&lt;/field&gt;
&lt;/doc&gt;
&lt;/add&gt;</pre>
<h4>tika_pl.xml</h4>
<pre class="brush:xml">&lt;add&gt;
&lt;doc&gt;
  &lt;field name="id"&gt;2&lt;/field&gt;
  &lt;field name="title"&gt;Woda&lt;/field&gt;
  &lt;field name="body"&gt;Woda (tlenek wodoru; nazwa systematyczna IUPAC: oksydan) – związek chemiczny o wzorze H2O, występujący w warunkach standardowych w stanie ciekłym. W stanie gazowym wodę określa się mianem pary wodnej, a w stałym stanie skupienia – lodem. Słowo woda jako nazwa związku chemicznego może się odnosić do każdego stanu skupienia.&lt;/field&gt;
&lt;/doc&gt;
&lt;/add&gt;</pre>
<h4>tika_de.xml</h4>
<pre class="brush:xml">&lt;add&gt;
&lt;doc&gt;
  &lt;field name="id"&gt;3&lt;/field&gt;
  &lt;field name="title"&gt;Wasser&lt;/field&gt;
  &lt;field name="body"&gt;Wasser (H2O) ist eine chemische Verbindung aus den Elementen Sauerstoff (O) und Wasserstoff (H). Wasser ist die einzige chemische Verbindung auf der Erde, die in der Natur in allen drei Aggregatzuständen vorkommt. Die Bezeichnung Wasser wird dabei besonders für den flüssigen Aggregatzustand verwendet. Im festen (gefrorenen) Zustand spricht man von Eis, im gasförmigen Zustand von Wasserdampf.&lt;/field&gt;
&lt;/doc&gt;
&lt;/add&gt;</pre>
<h3>More testing</h3>
<p>To index the data I used the following shell commands:
</p>
<pre class="brush:xml">curl 'http://localhost:8983/solr/update?update.chain=langid' --data-binary @tika_pl.xml -H 'Content-type:application/xml'
curl 'http://localhost:8983/solr/update?update.chain=langid' --data-binary @tika_en.xml -H 'Content-type:application/xml'
curl 'http://localhost:8983/solr/update?update.chain=langid' --data-binary @tika_de.xml -H 'Content-type:application/xml'
curl 'http://localhost:8983/solr/update?update.chain=langid' --data-binary '&lt;commit/&gt;' -H 'Content-type:application/xml'</pre>
<p>It is worth to notice the additional <em>update.chain=langid</em>&nbsp;parameter added to the request. This parameter is used to tell Solr which update processor to use when indexing the data. In the example we told Solr that it should use our defined update processor.</p>
<h3>Indexed data</h3>
<p>So let&#8217;s have a look at the indexed data. We will do that by running the following query: <em>q=*:*&amp;indent=true</em>.
</p>
<pre class="brush:xml">&lt;?xml version="1.0" encoding="UTF-8"?&gt;
&lt;response&gt;
&lt;lst name="responseHeader"&gt;
  &lt;int name="status"&gt;0&lt;/int&gt;
  &lt;int name="QTime"&gt;0&lt;/int&gt;
  &lt;lst name="params"&gt;
    &lt;str name="indent"&gt;true&lt;/str&gt;
    &lt;str name="q"&gt;*:*&lt;/str&gt;
  &lt;/lst&gt;
&lt;/lst&gt;
&lt;result name="response" numFound="3" start="0"&gt;
  &lt;doc&gt;
    &lt;str name="body"&gt;Woda (tlenek wodoru; nazwa systematyczna IUPAC: oksydan) – związek chemiczny o wzorze H2O, występujący w warunkach standardowych w stanie ciekłym. W stanie gazowym wodę określa się mianem pary wodnej, a w stałym stanie skupienia – lodem. Słowo woda jako nazwa związku chemicznego może się odnosić do każdego stanu skupienia.&lt;/str&gt;
    &lt;str name="id"&gt;2&lt;/str&gt;
    &lt;str name="lang"&gt;pl&lt;/str&gt;
    &lt;str name="title"&gt;Woda&lt;/str&gt;
  &lt;/doc&gt;
  &lt;doc&gt;
    &lt;str name="body"&gt;Water is a chemical substance with the chemical formula H2O. A water molecule contains one oxygen and two hydrogen atoms connected by covalent bonds. Water is a liquid at ambient conditions, but it often co-exists on Earth with its solid state, ice, and gaseous state (water vapor or steam). Water also exists in a liquid crystal state near hydrophilic surfaces.[1][2] Under nomenclature used to name chemical compounds, Dihydrogen monoxide is the scientific name for water, though it is almost never used.&lt;/str&gt;
    &lt;str name="id"&gt;1&lt;/str&gt;
    &lt;str name="lang"&gt;en&lt;/str&gt;
    &lt;str name="title"&gt;Water&lt;/str&gt;
  &lt;/doc&gt;
  &lt;doc&gt;
    &lt;str name="body"&gt;Wasser (H2O) ist eine chemische Verbindung aus den Elementen Sauerstoff (O) und Wasserstoff (H). Wasser ist die einzige chemische Verbindung auf der Erde, die in der Natur in allen drei Aggregatzuständen vorkommt. Die Bezeichnung Wasser wird dabei besonders für den flüssigen Aggregatzustand verwendet. Im festen (gefrorenen) Zustand spricht man von Eis, im gasförmigen Zustand von Wasserdampf.&lt;/str&gt;
    &lt;str name="id"&gt;3&lt;/str&gt;
    &lt;str name="lang"&gt;de&lt;/str&gt;
    &lt;str name="title"&gt;Wasser&lt;/str&gt;
  &lt;/doc&gt;
&lt;/result&gt;
&lt;/response&gt;</pre>
<p>As you can see, Solr with the use of Tika, was able to identify the languages of the indexed documents. Of course, let&#8217;s not be too optimistic, because mistakes happen, especially when dealing with multi-language documents, but that&#8217;s understandable.</p>
<h3>To sum up</h3>
<p>You should remember, that the language identification feature is not perfect and can make mistakes. Also remember, that the longer the documents, the better the functionality will work. Of course the problem is that we can&#8217;t use the language identification during query time, but it&#8217;s not only problem with Solr and Tika. You can deal with that by identifying your user, it&#8217;s web browser or place he is located in.</p>
]]></content:encoded>
					
					<wfw:commentRss>https://solr.pl/en/2012/01/23/document-language-identification/feed/</wfw:commentRss>
			<slash:comments>0</slash:comments>
		
		
			</item>
	</channel>
</rss>
