Brazilian Portuguese now part of the stopword module

Nice! But what is a stopword list, you ask?

Espen Klem
norch
2 min readSep 10, 2017

--

Most search engines need a list of words that you don’t want indexed. These words bear little or no meaning, i.e. “a”, “the”, “for”, “but” etc. Each document going into the search index get these words removed. If they are not removed from the search index, you’ll get result sets when searching that includes way to many documents, longer query processing time and really big search index files.

Our stopword module for Search-index and Norch now also have Brazilian Portuguese as an option, thanks to Micael Levi! Open source is just brilliant =)

Stopwords for English. Copyright © 2011, Chris Umbel

Which languages are included in stopword so far?

  • Modern Standard Arabic
  • Bengali
  • Brazilian Portuguese
  • Danish
  • German
  • English
  • Spanish
  • Farsi
  • French
  • Hindi
  • Italian
  • Japanese
  • Dutch
  • Norwegian
  • Polish
  • Portuguese
  • Russian
  • Swedish
  • Chinese Simplified

How to go about creating a stopword list?

First of all you need a big enough corpus of text documents that somewhat represents the type of text you’ll be adding to your index. Then you’ll need to find the words that bear little or no meaning. The easiest way is to count how many times each word is used. Simplified, we can say that the words used the most bears the least meaning. When you have a list of all words used, sorted from most to least used, you need to find the breaking point between meaningless and meaningful words. Some of the stopword lists in the stopword module are based on the most used words in movie subtitles for a given year, like our Danish stopword list.

A paper on how to create a Punjabi stopword list discusses a better and more accurate way of creating a stopword list. Basically, you take the frequency of each word in your text corpus and combine it with the mean probability of the same word occurring in a single document. You can check it out for yourself: Automated Stopwords Identification in Punjabi Documents, by Rajeev Puri, Dr. R.P.S. Bedi and Dr. Vishal Goyal.

A stopword list trainer module?

Based on Rajeev Puri, Dr. R.P.S. Bedi and Dr. Vishal Goyal’s paper I’m looking into creating a stopword trainer. The obvious reason for that is that it will improve some of our lists. We will also be able to get stopword lists for more languages. And last, people can create their own specialized stopword lists that fits their content, being an intranet, a forum, a genre, also selecting how many words the list should contain.

Shout out if you need a stopword list for a language that isn’t in the library!

--

--

Espen Klem
norch

Designing - Creating - Dismantling - Socialising - Nerding. Interaction Designer at Knowit. Tinkering with search when I can.