A stopword list for any language in the world

Espen Klem
norch
Published in
2 min readApr 14, 2018

As described before lists of stopwords can be good for natural language processing and information retrieval. We have stopword, a library of stopword lists for many different languages, but not for every language in the world. We also have stopword-trainer, a library for “training” or creating lists of stopwords. But to train a stopword list you need a large document corpus. That’s the main show-stopper for creating a stopword list for any language in the world.

Here Wikipedia comes in handy

Test-crawling the Norwegian Wikipedia to check that the it works fine as data for the stopword-trainer.

We’re now in the process of making a stopword list for language that has a Wikipedia site. Wikipedia has sites for approximately 200 languages that are perfect document corpora for training or generating a stopword list.

A nice side effect

Stopwor-trainer generates a list of words sorted on frequency. From most used to least used. And with this instead of sorted alphabetically, you can easily set a threshold on how many different stopwords that should be removed. This is to better fit your needs: An aggressive approach with lots of different words removed, or just the bare minimum removal?

We need a little help to make this happen

Okay! We got the tools we need to do this quite effectively, but it is still a lot of work. And one part of that work we can’t do well enough our self: Validate a stopword list for a new language. So for all the languages we don’t understand we need someone who can read that language to tell us if the result looks okay. So if you need a stopword list for your language, and have the time to look through and validate the end result, do the following:

  • Create an issue over at the stopword module and tell us which language you need a stopword list for and a URL for the matching Wikipedia site.
  • Read through the stopword list we’ve generated. We’ll ping you when we’re ready and need your input.

The first thing we’re going to do is to try to generate a stopword list for Punjabi, both for Gurmukhi and Shahmukhi.

So, what language do you need a stopword list for? Let us know!

--

--

Espen Klem
norch
Editor for

Designing - Creating - Dismantling - Socialising - Nerding. Interaction Designer at Knowit. Tinkering with search when I can.