<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>tagger &#8211; Solr.pl</title>
	<atom:link href="https://solr.pl/en/tag/tagger-2/feed/" rel="self" type="application/rss+xml" />
	<link>https://solr.pl/en/</link>
	<description>All things to be found - Blog related to Apache Solr &#38; Lucene projects - https://solr.apache.org</description>
	<lastBuildDate>Sat, 14 Nov 2020 14:25:53 +0000</lastBuildDate>
	<language>en-US</language>
	<sy:updatePeriod>
	hourly	</sy:updatePeriod>
	<sy:updateFrequency>
	1	</sy:updateFrequency>
	<generator>https://wordpress.org/?v=6.9</generator>
	<item>
		<title>Solr Text Tagger &#8211; Quick Look into Solr 7.4</title>
		<link>https://solr.pl/en/2018/09/17/solr-text-tagger-quick-look-into-solr-7-4/</link>
					<comments>https://solr.pl/en/2018/09/17/solr-text-tagger-quick-look-into-solr-7-4/#respond</comments>
		
		<dc:creator><![CDATA[Rafał Kuć]]></dc:creator>
		<pubDate>Mon, 17 Sep 2018 13:25:28 +0000</pubDate>
				<category><![CDATA[Solr]]></category>
		<category><![CDATA[request handler]]></category>
		<category><![CDATA[solr]]></category>
		<category><![CDATA[tagger]]></category>
		<category><![CDATA[tagging]]></category>
		<category><![CDATA[text]]></category>
		<guid isPermaLink="false">http://sematext.solr.pl/?p=969</guid>

					<description><![CDATA[We haven&#8217;t looked into any Solr functionalities on Solr.pl for a while so it is time to look into one of the new features &#8211; the text tagger. It works on the basis of Solr index and is able to]]></description>
										<content:encoded><![CDATA[
<p>We haven&#8217;t looked into any Solr functionalities on Solr.pl for a while so it is time to look into one of the new features &#8211; the text tagger. It works on the basis of Solr index and is able to check documents sent to a new handler and return occurrences of names with offsets and other metadata that we added to Solr index. However keep in mind that Solr Text Tagger doesn&#8217;t do any kind of natural language processing, so no what we will give it, that is what we can expect to get back.</p>



<span id="more-969"></span>



<h3 class="wp-block-heading">Preparations</h3>



<p>We will keep it as minimal as it can be. We start with running our Solr instance:</p>



<pre class="wp-block-preformatted brush:xml">$ bin/solr start -c -f</pre>



<p>And we create a test collection:</p>



<pre class="wp-block-preformatted brush:xml">$ bin/solr create -c test</pre>



<p>We will be using the standard data-driven schema. Though it is not suggested to be used in production, it is more than enough for our Text Tagger test.</p>



<h3 class="wp-block-heading">Data For Tagging</h3>



<p>Let&#8217;s focus on the data for a second. First of all, we need data that will be used for tagging. For the purpose of the blog post we will just go with simple name tagging, for example:</p>



<pre class="wp-block-preformatted brush:xml">Thomas Jefferson
Alexander Hamilton
George Washington
John Adams
James Wilkinson</pre>



<p>A few names &#8211; you may recognize them <img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f609.png" alt="😉" class="wp-smiley" style="height: 1em; max-height: 1em;" /> We will use those names for tagging purposes.</p>



<p>To index them to Solr we will need a new field type, two fields, and a copy field, just like this:</p>



<pre class="wp-block-preformatted brush:xml">$ curl -XPOST -H 'Content-type:application/json'  'http://localhost:8983/solr/test/schema' -d '{
  "add-field-type":{
    "name":"tag",
    "class":"solr.TextField",
    "postingsFormat":"FST50",
    "omitNorms":true,
    "omitTermFreqAndPositions":true,
    "indexAnalyzer":{
      "tokenizer":{ "class":"solr.StandardTokenizerFactory" },
      "filters":[
        {"class":"solr.LowerCaseFilterFactory"},
        {"class":"solr.ConcatenateGraphFilterFactory", "preservePositionIncrements":false }
      ]},
    "queryAnalyzer":{
      "tokenizer":{
         "class":"solr.StandardTokenizerFactory" },
      "filters":[
        {"class":"solr.LowerCaseFilterFactory"}
      ]}
    },
  "add-field":{"name":"name", "type":"text_general"},
  "add-field":{"name":"name_tag", "type":"tag", "stored":false},
  "add-copy-field":{"source":"name", "dest":["name_tag"]}
}'</pre>



<p>So let&#8217;s stop here and see why we added those. First of all, we added a new field type called <em>tag</em>. We needed that for the text tagger to work. There are two crucial things there &#8211; the <em>postingsFormat</em> which needs to be set to <em>FST50</em> and the <em>solr.ConcatenateGraphFilterFactory</em> at index time analysis. Those two settings are required.</p>



<p>Next &#8211; we need to keep the names themselves. So we will keep the name in a <em>text_general</em> field that is stored and can be easily retrieved, we will have the <em>name_tag</em> field using our newly created <em>tag</em> field type and finally we have a copy field that we will use to copy the value from the <em>name</em> field to the <em>name_tag</em> field, so that we don&#8217;t have to send the data twice.</p>



<p>Now let&#8217;s index those names that we had by using the following request:</p>



<pre class="wp-block-preformatted brush:xml">$ curl -XPOST -H 'Content-type:application/json'  'http://localhost:8983/solr/test/update' -d '[
 {"name":"Thomas Jefferson"},
 {"name":"Alexander Hamilton"},
 {"name":"George Washington"},
 {"name":"John Adams"},
 {"name":"James Wilkinson"}
]'</pre>



<h3 class="wp-block-heading">Tagging Text</h3>



<p>Now that we have our data that will be used for tagging sent to Solr we can try tagging a real document. For that, we took a part of the document from Wikipedia describing the life of George Washington. To send it to Solr for tagging we first need to create the configuration for a new handler in Solr, for example like this:</p>



<pre class="wp-block-preformatted brush:xml">$ curl -XPOST -H 'Content-type:application/json' 'http://localhost:8983/solr/test/config' -d '{
 "add-requesthandler" : {
  "name": "/tagger",
  "class": "solr.TaggerRequestHandler",
  "defaults": {
   "field":"name_tag"
  }
 }
}'</pre>



<p>We added a new request handler called <em>/tagger</em> using the <em>solr.TaggerRequestHandler</em> and we set its defaults to use a field called <em>name_tag</em> &#8211; the one that we created the new field type for. Once that is done we can finally look at tagging the text. We do that by sending our document to newly added request handler:</p>



<pre class="wp-block-preformatted brush:xml">$ curl -X POST 'http://localhost:8983/solr/test/tagger?fl=id,name&amp;wt=json&amp;indent=on' <br>-H 'Content-Type:text/plain' -d 'George Washington (February 22, 
1732 December 14, 1799) was the first President of the United States 
(1789-1797), and was among the nation`s Founding Fathers. From 1775 
to 1783, he led the Patriot forces to victory over the British in 
the American Revolutionary War. He presided over the Constitutional 
Convention which formed the basis of the new federal government in 
Since the late 1780s, Washington has been known as the "Father 
of His Country" by compatriots. He is ranked by scholars among <br>the top presidents in history. Washington was born to a moderately 
prosperous family of planters, who owned slaves in colonial 
Virginia. He had early opportunities in education, learned 
mathematics and quickly launched a successful career as a surveyor, 
which in turn enabled him to make considerable land investments. He 
then joined the Virginia militia and fought in the French and Indian
War. During the American Revolutionary War, Washington was 
appointed commander-in-chief of the Continental Army, taking 
initiative in raids like those at Trenton, commanding in 
conventional battles such as the Battle of Monmouth, and 
coordinating a combined French-Patriot allied campaign at the Siege 
of Yorktown ending the conflict. His devotion to American 
Republicanism impelled him to decline further power after victory, 
and he resigned as commander-in-chief in 1783. Washington was born 
to a moderately prosperous family of planters, who owned slaves in 
colonial Virginia. He had early opportunities in education, learned 
mathematics and quickly launched a successful career as a surveyor, 
which in turn enabled him to make considerable land investments. He 
then joined the Virginia militia and fought in the French and Indian 
War. During the American Revolutionary War, Washington was appointed 
commander-in-chief of the Continental Army, taking initiative in 
raids like those at Trenton, commanding in conventional battles such 
as the Battle of Monmouth, and coordinating a combined French-Patriot 
allied campaign at the Siege of Yorktown ending the conflict. His 
devotion to American Republicanism impelled him to decline further 
power after victory, and he resigned as commander-
in-chief in 1783. Washington was among the countryís premier statesmen
and was unanimously chosen as president by the Electoral 
College in the first two national elections. He promoted and oversaw 
the implementation of a strong, well-financed national government. 
He remained impartial in the fierce rivalry between two cabinet 
secretaries, Thomas Jefferson and Alexander Hamilton, though he 
adopted Hamilton economic plans. When the French Revolution plunged 
Europe into war, Washington assumed a policy of neutrality 
to protect American shipsóalthough the controversial Jay Treaty of 
1795 normalized trade relations with Great Britain. Washington set 
precedents still in use today, such as the Cabinet advisory system, 
the inaugural address, the title "Mr. President", and a two-term 
limit. In his Farewell Address he gave a primer on civic virtue, 
warning of partisanship, sectionalism, and involvement in foreign 
wars. Washington inherited slaves at age eleven, prospered from slavery 
most of his life, and as late as 1793 officially supported other 
slaveholders. Eventually he became troubled with slavery and in his 
final will in 1799 he freed all his slaves. He is renowned for his 
religious toleration; his personal religion and devotion to 
Freemasonry have been debated. Upon his death, Washington was famously 
eulogized as "first in war, first in peace, and first in 
the hearts of his countrymen". He has been widely memorialized by 
monuments, public works, places, stamps, and currency. The nations capital, Washington D.C., and the state of Washington bear his name; and since 1932 the quarter dollar has carried his effigy.'</pre>



<p>The response returned by Solr looks as follows:</p>



<pre class="wp-block-preformatted brush:xml">{<br>  "responseHeader":{
    "status":0,
    "QTime":2},
  "tagsCount":15,
  "tags":[[
      "startOffset",0,
      "endOffset",6,
      "ids",["d8d14516-7637-48c3-8042-ecf68b3822ef"]],
    [
      "startOffset",7,
      "endOffset",17,
      "ids",["d8d14516-7637-48c3-8042-ecf68b3822ef"]],
    [
      "startOffset",398,
      "endOffset",408,
      "ids",["d8d14516-7637-48c3-8042-ecf68b3822ef"]],
    [
      "startOffset",534,
      "endOffset",544,
      "ids",["d8d14516-7637-48c3-8042-ecf68b3822ef"]],
    [
      "startOffset",1693,
      "endOffset",1699,
      "ids",["7dad41c3-e607-4b6f-a86d-6d6490f48e4f"]],
    [
      "startOffset",1700,
      "endOffset",1709,
      "ids",["7dad41c3-e607-4b6f-a86d-6d6490f48e4f"]],
    [
      "startOffset",1714,
      "endOffset",1723,
      "ids",["5796481a-a698-4aa4-9666-9f5e4e565996"]],
    [
      "startOffset",1724,
      "endOffset",1732,
      "ids",["5796481a-a698-4aa4-9666-9f5e4e565996"]],
    [
      "startOffset",1752,
      "endOffset",1760,
      "ids",["5796481a-a698-4aa4-9666-9f5e4e565996"]],
    [
      "startOffset",1831,
      "endOffset",1841,
      "ids",["d8d14516-7637-48c3-8042-ecf68b3822ef"]],
    [
      "startOffset",1992,
      "endOffset",2002,
      "ids",["d8d14516-7637-48c3-8042-ecf68b3822ef"]],
    [
      "startOffset",2278,
      "endOffset",2288,
      "ids",["d8d14516-7637-48c3-8042-ecf68b3822ef"]],
    [
      "startOffset",2651,
      "endOffset",2661,
      "ids",["d8d14516-7637-48c3-8042-ecf68b3822ef"]],
    [
      "startOffset",2875,
      "endOffset",2885,
      "ids",["d8d14516-7637-48c3-8042-ecf68b3822ef"]],
    [
      "startOffset",2909,
      "endOffset",2919,
      "ids",["d8d14516-7637-48c3-8042-ecf68b3822ef"]]],
  "response":{"numFound":3,"start":0,"docs":[
      {
        "name":["Thomas Jefferson"],
        "id":"7dad41c3-e607-4b6f-a86d-6d6490f48e4f"},
      {
        "name":["Alexander Hamilton"],
        "id":"5796481a-a698-4aa4-9666-9f5e4e565996"},
      {
        "name":["George Washington"],
        "id":"d8d14516-7637-48c3-8042-ecf68b3822ef"}]
  }}</pre>



<p>As you can see we have both the tags that were found along with the offsets, but also the name of the tag and its identifier. From the response that Solr gave us we see that three tags were found and where the tags were found &#8211; nice! <img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f642.png" alt="🙂" class="wp-smiley" style="height: 1em; max-height: 1em;" /></p>



<h3 class="wp-block-heading">Tagger Parameters</h3>



<p>Of course, you can control tagger. As you already seen in the above request that we&#8217;ve used <em>fl</em> parameter in the request and the required <em>field</em> property when adding new request handler. However, this is not everything. We can add filter queries, specify <g class="gr_ gr_8 gr-alert sel gr_gramm gr_replaced gr_inline_cards gr_disable_anim_appear Grammar only-ins doubleReplace replaceWithoutSep" id="8" data-gr-id="8">the </g>maximum number of rows, choose an algorithm for overlapping tags and so on. You can find the full list of options in the Solr documentation at <a href="https://lucene.apache.org/solr/guide/7_4/the-tagger-handler.html">https://lucene.apache.org/solr/guide/7_4/the-tagger-handler.html</a>.</p>



<h3 class="wp-block-heading">Performance Considerations</h3>



<p>As usual, there are a few things when it comes to best practices for handling the tagger collection and its layout as well as tagging of the documents. So the first problem is that in Solr 7.4.0 Text Tagger doesn&#8217;t support batching, so you are only able to send the documents one by one. This will probably change in the future, but for now, you can just combine multiple documents in a single request and add a dummy concatenations character between them. In addition to that, you should consider force merge. As you know, the fewer segments you have in your Lucene index &#8211; so your shard, the faster the queries will be. This is also true for tagging &#8211; try to keep the number of segments to the minimum to improve throughput and latency of your tagging queries.</p>



<h3 class="wp-block-heading">Limitations</h3>



<p>One last thing that I wanted to mention is that when we&#8217;ve been writing this text, so as of Solr 7.4.0 the text tagger was not supporting sharded index. Maybe this will change in the future, but for now, if you want to use text tagger functionality in Solr, you have to be aware of that limitation.</p>
]]></content:encoded>
					
					<wfw:commentRss>https://solr.pl/en/2018/09/17/solr-text-tagger-quick-look-into-solr-7-4/feed/</wfw:commentRss>
			<slash:comments>0</slash:comments>
		
		
			</item>
	</channel>
</rss>
