Sunday, 10 February 2019

We Do not Use CommonGramsFilterFactory for StopWords filtering in Solr

Home SOLR We Do not Use CommonGramsFilterFactory for StopWords filtering in Solr

Gopesh Sharma February 10, 2019 SOLR

Common Grams Filter creates word shingles by combining common tokens from the stop words with regular tokens. This is only useful when I want to create a phrase query which contains the common words. So for example if my stop words contains 'the' word the query "the good" will give the text as the, the_good, good as shown below.

Do note that we have used WhiteSpace Tokenizer before the Common Grams Filter.

Our index time analyzer looks like below, where we have first used WhiteSpace Tokenizer, Then CommonGramsFilterFactory and then the Word Delimiter filter factory.

<analyzer type="index">

  <tokenizer class="solr.WhitespaceTokenizerFactory"/>

  <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" 

          ignoreCase="true" expand="true"/>

  <filter class="solr.CommonGramsFilterFactory" ignoreCase="true"

          words="stopwords.txt" />

  <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" 

          generateNumberParts="1" catenateWords="1" catenateNumbers="1" 

          catenateAll="0" splitOnCaseChange="1"/>

  <filter class="solr.LowerCaseFilterFactory"/>

</analyzer>

Now the problem occurs when you are searching for good, which actually increases the score drastically of the document which has the stop words in it. Because the term frequency is 2 if we search for "good" in "of|of|ofgood|good|good". We actually do not need the phrase query. But if you need you need to use CommonGramsFilterFactory and then you need to use omitTermFreqAndPositions.

Coding Defined

a blog for those who code

Sunday, 10 February 2019

We Do not Use CommonGramsFilterFactory for StopWords filtering in Solr

No comments:

Post a Comment

Recent Post

Featured Post

Ship your Product Faster

Popular

Categories

Contact Us

CC License

Popular Posts in Last 7 Days

Popular Posts in Last 30 Days

Coding Defined

a blog for those who code

Sunday, 10 February 2019

We Do not Use CommonGramsFilterFactory for StopWords filtering in Solr

No comments:

Post a Comment

Social Counter

Recent Post

Featured Post

Ship your Product Faster

Popular

Categories

Contact Us

CC License

Popular Posts in Last 7 Days

Popular Posts in Last 30 Days