{"id":266,"date":"2011-05-09T20:45:16","date_gmt":"2011-05-09T18:45:16","guid":{"rendered":"http:\/\/sematext.solr.pl\/?p=266"},"modified":"2020-11-11T20:45:58","modified_gmt":"2020-11-11T19:45:58","slug":"solr-filters-patternreplacecharfilter","status":"publish","type":"post","link":"https:\/\/solr.pl\/en\/2011\/05\/09\/solr-filters-patternreplacecharfilter\/","title":{"rendered":"Solr filters: PatternReplaceCharFilter"},"content":{"rendered":"<p>Continuing the overview of the filters included in Solr today we look at the PatternReplaceCharFilter.<\/p>\n<p>As you might guess the task of the filter is to change the matching input stream parts that match the given regular expression.<\/p>\n\n\n<!--more-->\n\n\n<p>You have the following parameters:<\/p>\n<ul>\n<li><em>pattern<\/em> (required) \u2013 the value to be changed (regular expressions)<\/li>\n<li><em>replacement<\/em> (default: &#8220;&#8221;) &#8211; the value that will be used as a replament for the fragment that matched the regular expression<\/li>\n<li><em>blockDelimiters<\/em><\/li>\n<li><em>maxBlockChars<\/em> (default: 10000, must be greater than 0) \u2013 buffer used for comparison<\/li>\n<\/ul>\n<h2>Use examples<\/h2>\n<p>The use of a filter is simple &#8211; we add its definition to the type definition in schema.xml file, for example:\n<\/p>\n<pre class=\"brush:xml\">&lt;fieldType name=\"textCharNorm\" class=\"solr.TextField\"&gt;\n  &lt;analyzer&gt;\n    &lt;charFilter class=\"solr.PatternReplaceCharFilterFactory\" \u2026\/&gt;\n    &lt;charFilter class=\"solr.MappingCharFilterFactory\" mapping=\"mapping-ISOLatin1Accent.txt\"\/&gt;\n    &lt;tokenizer class=\"solr.WhitespaceTokenizerFactory\"\/&gt;\n  &lt;\/analyzer&gt;\n&lt;\/fieldType&gt;<\/pre>\n<p>Poni\u017cej przyk\u0142adowe definicji dla r\u00f3\u017cnych przypadk\u00f3w.<\/p>\n<p>Below are examples of definitions for different cases.<\/p>\n<h3>Cut pieces of text<\/h3>\n<p>You just need to specify, in the pattern attribute, what we want to cut. Example:\n<\/p>\n<pre class=\"brush:xml\">&lt;charFilter class=\"solr.PatternReplaceCharFilterFactory\" pattern=\"#TAG\" \/&gt;<\/pre>\n<p>which will suppress the content of the data elements: &#8220;#TAG&#8221;<\/p>\n<h3>Text fragments replacement<\/h3>\n<p>A similar case to the one above, but we want to convert text to another.\n<\/p>\n<pre class=\"brush:xml\">&lt;charFilter class=\"solr.PatternReplaceCharFilterFactory\" pattern=\"#TAG\" replacement=\"[CENZORED]\"\/&gt;<\/pre>\n<h3>Changing patterns<\/h3>\n<p>The two above cases were trivial. What is the strength of this filter is handling regular expressions. (You use regular expressions, right?) The following example is simple &#8211; it hides all the numbers by turning them into stars. It also handles the numbers separated by hyphens, treating them as a single number.\n<\/p>\n<pre class=\"brush:xml\">&lt;charFilter class=\"solr.PatternReplaceCharFilterFactory\" pattern=\"(\\\\d+-*\\\\d+)+\" replacement=\"*\"\/&gt;<\/pre>\n<h3>Text Manipulation<\/h3>\n<p>The replacement doesn&#8217;t have to be plain text. This filter supports references which allow you to refer to parts of the matched pattern. For details, refer to the documentation of regular expressions. In the following example, all multiplied characters are replaced with a single sign.\n<\/p>\n<pre class=\"brush:xml\">&lt;charFilter class=\"solr.PatternReplaceCharFilterFactory\" pattern=\"(.)\\\\1\" replacement=\"$1\"\/&gt;<\/pre>\n<h2>Advanced Parameters<\/h2>\n<p>So far I have not mentioned the following parameters: <em>blockDelimiters <\/em>and <em>maxBlockChars<\/em>. If you look at the source code you would see that those parameters are related to the way the filter is implemented. <em>CharFilter <\/em> operates on a single character, and pattern matching requires an internal buffer to read more characters. <em>MaxBlockChars <\/em>allows you to specify the size of the buffer. You do not have to worry about it, if the pattern you defined, does not match piece of text larger than 10k characters). <em>BlockDelimiters <\/em>can further optimize filling of the buffer. It can be used if the information in the analyzed field is somehow divided into sections (eg, it is a CSV, sentences, etc.). It  is a text that informs the scanner, that a new section starts,  therefore, parts matched in the previous section are no longer useful.<\/p>\n<h2>Limits<\/h2>\n<p>An  important limitation of the filter is that it directly manipulates the  input data and does not keep information related to the original text. This  means that if the filter removes a portion of the string, or add a new  fragment, tokenizer will not notice that and the location of tokens in  the original box will not be saved properly. You should be aware of that when using queries that operate on the relative positions of tokens or if you use highlighting.<\/p>","protected":false},"excerpt":{"rendered":"<p>Continuing the overview of the filters included in Solr today we look at the PatternReplaceCharFilter. As you might guess the task of the filter is to change the matching input stream parts that match the given regular expression.<\/p>\n","protected":false},"author":3,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":"","_links_to":"","_links_to_target":""},"categories":[27],"tags":[180,208,181,290,164],"class_list":["post-266","post","type-post","status-publish","format-standard","hentry","category-solr-en","tag-analysis","tag-configuration","tag-filter","tag-filtering","tag-solr-2"],"_links":{"self":[{"href":"https:\/\/solr.pl\/en\/wp-json\/wp\/v2\/posts\/266","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/solr.pl\/en\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/solr.pl\/en\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/solr.pl\/en\/wp-json\/wp\/v2\/users\/3"}],"replies":[{"embeddable":true,"href":"https:\/\/solr.pl\/en\/wp-json\/wp\/v2\/comments?post=266"}],"version-history":[{"count":1,"href":"https:\/\/solr.pl\/en\/wp-json\/wp\/v2\/posts\/266\/revisions"}],"predecessor-version":[{"id":267,"href":"https:\/\/solr.pl\/en\/wp-json\/wp\/v2\/posts\/266\/revisions\/267"}],"wp:attachment":[{"href":"https:\/\/solr.pl\/en\/wp-json\/wp\/v2\/media?parent=266"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/solr.pl\/en\/wp-json\/wp\/v2\/categories?post=266"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/solr.pl\/en\/wp-json\/wp\/v2\/tags?post=266"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}