a blog for those who code

Sunday 10 February 2019

We Do not Use CommonGramsFilterFactory for StopWords filtering in Solr

Common Grams Filter creates word shingles by combining common tokens from the stop words with regular tokens. This is only useful when I want to create a phrase query which contains the common words. So for example if my stop words contains 'the' word the query "the good" will give the text as the, the_good, good as shown below.




Do note that we have used WhiteSpace Tokenizer before the Common Grams Filter.

Our index time analyzer looks like below, where we have first used WhiteSpace Tokenizer, Then CommonGramsFilterFactory and then the Word Delimiter filter factory.

<analyzer type="index">
  <tokenizer class="solr.WhitespaceTokenizerFactory"/>
  <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" 
          ignoreCase="true" expand="true"/>
  <filter class="solr.CommonGramsFilterFactory" ignoreCase="true"
          words="stopwords.txt" />
  <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" 
          generateNumberParts="1" catenateWords="1" catenateNumbers="1" 
          catenateAll="0" splitOnCaseChange="1"/>
  <filter class="solr.LowerCaseFilterFactory"/>
</analyzer>


Now the problem occurs when you are searching for good, which actually increases the score drastically of the document which has the stop words in it. Because the term frequency is 2 if we search for "good" in "of|of|ofgood|good|good". We actually do not need the phrase query. But if you need you need to use CommonGramsFilterFactory and then you need to use omitTermFreqAndPositions.

No comments:

Post a Comment