A stopword list for any language in the world
As described before lists of stopwords can be good for natural language processing and information retrieval. We have stopword, a library of stopword lists for many different languages, but not for every language in the world. We also have stopword-trainer, a library for “training” or creating lists of stopwords. But to train a stopword list you need a large document corpus. That’s the main show-stopper for creating a stopword list for any language in the world.
Here Wikipedia comes in handy
We’re now in the process of making a stopword list for language that has a Wikipedia site. Wikipedia has sites for approximately 200 languages that are perfect document corpora for training or generating a stopword list.
A nice side effect
Stopwor-trainer generates a list of words sorted on frequency. From most used to least used. And with this instead of sorted alphabetically, you can easily set a threshold on how many different stopwords that should be removed. This is to better fit your needs: An aggressive approach with lots of different words removed, or just the bare minimum removal?
We need a little help to make this happen
Okay! We got the tools we need to do this quite effectively, but it is still a lot of work. And one part of that work we can’t do well enough our self: Validate a stopword list for a new language. So for all the languages we don’t understand we need someone who can read that language to tell us if the result looks okay. So if you need a stopword list for your language, and have the time to look through and validate the end result, do the following:
- Create an issue over at the stopword module and tell us which language you need a stopword list for and a URL for the matching Wikipedia site.
- Read through the stopword list we’ve generated. We’ll ping you when we’re ready and need your input.
The first thing we’re going to do is to try to generate a stopword list for Punjabi, both for Gurmukhi and Shahmukhi.
So, what language do you need a stopword list for? Let us know!