An analysis on using Wikipedia to generate stopword lists

Published in

norch

4 min readMar 11, 2019

A quick analysis on how to generate stopword lists that are good enough for natural language processing. Basically, how many documents do you need to analyse before it’s good enough for i.e a search engine, and what manual steps do you still need to do?

Or, in other words: How many documents do you need to crawl from a Wikipedia site, before it’s enough for a good stopword analysis?

Stopwords are words that bears little or no meaning and for this reason you want to remove them before you use the text in some form of natural language processing like search.

I’ve created a small program for generating lists of stopwords based on a document corpus. It’s called stopword-trainer and it calculates “stopwordieness” based on how many times a word is used in total combined with how many documents the word is found in.

stopWordiness = (termInCorpus / totDocs) * (1 / (Math.log(totDocs/(termInDocs - 1))))

The test

For this test I’ve crawled 108.501 documents from the main Norwegian Wikipedia-site and made five different sized batches of the document corpus:

100 documents
1000 documents
10000 documents
40000 documents
108501 documents

I’ve only extracted paragraphs of text, excluding titles, sub-titles, lists and links. The idea is that the paragraphs are the chunks of text that has the lowest ratio of words that bears any meaning.

Peculiarities of Wikipedia to look out for

The batches with 100 and 1000 documents generated stopword lists with a lot of noice. They were kind of weak. Nothing strange there.

What was more surprising was that analysing the 10000 batch gave me the word “hovedbelteasteroide”, or “main-belt asteroid” in english, as the word with the 26th highest stopworthieness. The word is found 1802 times within the first 10000 documents of the 108501 documents in the total corpus. The word is only found 7 more times in the batch of 40000 documents and again 4 more times in full batch of 108501 documents.

Wikipedia is an encyclopedia and it has a certain structure to it that we need to understand. Since I’m crawling documents sorted alphabetically on title, I’ll get all the special characters first, then all the numbers and then titles starting with a letter.

Another type of structure that appears are the historical references / article stubs at the beginning. They are very short and follow a template more or less. That means that you might have words like “dead”, “born” and “year” ending up in your list of stopwords.

For non-English languages: For different reasons, there will quite possibly be some english words in your stopword list. Many of the smaller Wikipedias have some english text, and most are referencing titles English movies, music and literature. These needs to be weeded out.

Difference between the batches

The 100 and 1000 document corpus didn’t do that well. More strange was that the 10000 document corpus also had a lot of noise.

10000 documents corpus (to the right) still has a lot of strange stuff going on.

Difference between 40000 and 108501 documents corpus is minimal.

Comparison (GitHub diffs)

The best effort-winner is 10000-selected documents corpus.

Conclusion and a pleasant surprise

The conclusion is that it’s a moving target. The bigger the Wikipedia site is, the more types of “main-belt asteroid”-issues you will have. But for the Norwegian language, that had a little less than 500000 articles when crawling, 40000 documents was more than enough. The batch with 108501 documents did marginally better, but maybe not enough for all the crawling and processing. You’ll always need some manual control and editing at the end.

What came as a surprise was how well the 10000-selected corpus did. It’s documents in the range 10001–20000, skipping all the asteroids and historic events. Only 4 words differ out of a 100 if you compare it to the 108501 document corpus. Thats a lot of saved time and server requests!

Next time I’ll set the starting point a bit into the document corpus, skipping all articles starting with special characters and numbers.

The near perfect solution?

You could regularly create your own list of stopwords based on the actual text you are analysing, instead of using Wikipedia. So if you were to make a search engine, you could generate a stopword list based on a subset of the actual content you add to your search engine.

Other ways of avoiding to defining the “main-belt asteroid”-word as a stopword could be to ad it to a red-list (a black list for a black list) so it would automatically be prevented from being added to the list of stopwords.

In any case, every noun should be removed from the list. The Norwegian word for “main-belt asteroid” is a good first query in a search scenario, if the search solution have some way of filtering the initial search result.

If only used on Wikipedia, the algorithm for calculating stopworthieness could be tweaked to put more emphasis on total document spread compared to total document corpus. But it’s a generic tool, so then I would end up with some other unforeseen effect.