Stopword list for Punjabi Gurmukhi published

Espen Klem
norch
Published in
2 min readOct 17, 2018

Yay! We got a stopword list for Punjabi Gurmukhi. Our first effort since I wrote the ”A stopword list for every language in the world” blogpost. Thanks to Manmeet Singh and Wikipedia (Punjabi Gurmukhi).

How it’s generated

First I’m using the wikipedia-stopword-crawler in a two-step process. The Punjabi Gurmukhi Wikipedia site is crawled for article URLs and then these URLs are crawled and the content of all paragraph tags within the article is stored. These paragraphs are fed to the stopword-trainer that identifies stopwords and sort them on frequency.

The list is also really long. Since it is frequency sorted it can be cut from the bottom to be less agressive.

Test for Punjabi Gurmukhi stopword removal.

Minor flaws

Full stop in Punjabi Gurmukhi is different from “.”. This means that it hasn’t been removed before the calculation. Some words with full stop at the end has then been identified as a word. I’ve removed these manually. The words are still there but the calculation is a little off since the frequency count was split into [word] and [word][full stop].

Some of the articles in the beginning, describing different years in history, are really short and has a lot of the same words that is normally not seen as without meaning, hence not a stopword. These have come higher up on the list.

Down the same street, we also have that Wikipedia is an encyclopedia and it has it’s own tribal language that doesn’t fully match the text you want to remove stopwords from. This is true for every use case of stopword removal.

Next language?

First I thought Punjabi was one language, then after a little research, I figured out the written language was split into two: Punjabi Shahmukhi and Punjabi Gurmukhi. The latter is now done, now I need to get going on Punjabi Shahmukhi. Anyone up for helping out?

Need a stopword list for your language?

Step over to the Stopword module and create an issue. I’ll do most of the work, but since I probably doesn’t know the language I need someone to proofread the results of the different steps of the process.

--

--

Espen Klem
norch
Editor for

Designing - Creating - Dismantling - Socialising - Nerding. Interaction Designer at Knowit. Tinkering with search when I can.