Why is removing stop words not always a good idea

Wilame
3 min readJan 22, 2019

--

In a lot of tutorials about Machine Learning applied to text, you may read that removing stop words is a necessary pre-processing step. Apparently, removing stop words is not only necessary, but is also a must do. But this is not always true. Let’s see why.

What are stop words

The definition of what’s a stop word may vary. You may consider a stop word a word that has high frequency on a corpus. Or you can consider every word that’s empty of true meaning given a context.

Words such as articles and some verbs are usually considered stop words because they don’t help us to find the context or the true meaning of a sentence. These are words that can be removed without any negative consequences to the final model that you are training.

But here’s the catch: there’s no universal stop words list because a word can be empty of meaning depending on the corpus you are using or on the problem you are analysing.

This means that any word can be a stop word depending on what you are trying to do. That’s why there’s a lot of debate about the need of removing stop words.

A matter of performance

Reducing the data set size is without any doubt a way of increasing performance. Training models takes time and if you have less tokens to be trained, the training time should decrease.

In addition, techniques such as TF-IDF give more value to rare words than to very repetitive tokens. So, let’s say that you are working on a problem for a bank about one of their products. The name of this product could be a word with high frequency on your corpus, therefore, you could consider it a stop word.

But now, let’s say that you are working on a problem about public transportation and suddenly, you find the name of a very rare disease on your corpus. Since the frequency of this word is very low, we can consider it a rare token with a high weight.

What’s the main point here?

Word importance may vary depending on the dataset. But it may also change depending on the goal you are trying to achieve. Problems like sentiment analysis are much more sensitive to stop words removal than document classification.

An example could be the following sentence: “I told you that she was not happy”. Let’s remove the stop words with the Aruana library:

The result would be [‘told’, ‘happy’].

For sentiment analysis purposes, the overall meaning of the resulting sentence is positive, which is not at all the reality. Another problem of removing stop words from the model is that it’s crucial to have these tokens when our goal is to generate text or to work with search engines.

So, when should I remove stop words?

You should remove these tokens only if they don’t add any new information for your problem. Classification problems normally don’t need stop words because it’s possible to talk about the general idea of a text even if you remove stop words from it.

Observe the following output:

[‘sociology’, ‘scientific’, ‘study’, ‘society’, ‘patterns’, ‘social’, ‘relationships’, ‘social’, ‘interaction’, ‘culture’, ‘everyday’, ‘life’, ‘social’, ‘science’, ‘uses’, ‘various’, ‘methods’, ‘empirical’, ‘investigation’, ‘critical’, ‘analysis’, ‘develop’, ‘body’, ‘knowledge’, ‘social’, ‘order’, ‘acceptance’, ‘change’, ‘social’, ‘evolution’, ‘sociologists’, ‘conduct’, ‘research’, ‘may’, ‘applied’, ‘directly’, ‘social’, ‘policy’, ‘welfare’, ‘others’, ‘focus’, ‘primarily’, ‘refining’, ‘theoretical’, ‘understanding’, ‘social’, ‘processes’, ‘subject’, ‘matter’, ‘ranges’, ‘micro’, ‘sociology’, ‘level’, ‘individual’, ‘agency’, ‘interaction’, ‘macro’, ‘level’, ‘systems’, ‘social’, ‘structure’]

Are you able to say what’s the main theme of this text? I bet you said it’s something related to social or sociology. No stop words are required to tell you this. Here’s the code with the original text after pre-processing:

So, for theme classification, stop words are useless. In any other case, it’s better to keep these words and do some tests with and without them so see how it affects the model. Anyway, you should never remove stop words without thinking about the impact of these words on the problem you are trying to solve.

--

--