Why you should avoid removing STOPWORDS

Does removing stopwords really improve model performance?

Gagandeep Singh
Jun 24 · 4 min read

You might think it is very common to remove stop words from text during prepocessing it. Yes, I agree with you but you should be careful about what kind of stopwords you are removing.

The most common method to remove stop words is using NLTK’s stopwords.

Let’s look at the list of stop words from nltk.

Now, look at all the bold words.

So, the question is what is wrong with them?

Lets imagine you are asked to create a model which does sentiment analysis of product reviews. The dataset is fairly small that you label it your self. Consider few reviews from the dataset.

1. The product is really very good. — POSITIVE

2. The products seems to be good. — POSITIVE

3. Good product. I really liked it. — POSITIVE

4. I didn’t like the product. — NEGATIVE

5. The product is not good. — NEGATIVE

You performed preprocessing on data and removed all stopwords.

Now, let us look what happens to the sample we selected above.

1. product really good. — POSITIVE

2. products seems good. — POSITIVE

3. Good product. really liked. — POSITIVE

4. like product. — NEGATIVE

5. product good. — NEGATIVE

Look at negative feedbacks.

Scary, right?

Positive feedbacks doesn’t seem to be affected but look at negative feedbacks. Their whole meaning has changed. If we train our model on this data, then it is surely going to underperform.

This happens very often, after removing stopwords the whole meaning of sentence changes.

If you are working with basic NLP techniques like BOW, Count Vectorizer or TF-IDF(Term Frequency and Inverse Document Frequency) then removing stopwords is a good idea because stopwords act like noise for these methods. If you working with LSTM’s or other models which capture the semantic meaning and the meaning of a word depends on the context of the previous text, then it becomes important not to remove stopwords.

Now, coming to my original question — Does removing stopwords really improve model performance?

Like I said earlier it depends on what kind of stopwords are you removing. The problem here is that if you do not removing stop words, noise will increase in dataset because of words like I, my, me, etc.

So, whats the solution? Creating a new list of correct stop words but the problem is to reuse it in different projects.

This is why I’ve create a Python package nlppreprocess which removes stops words which are not necessary. It also has some additional functionalities that can make cleaning of text fast.

The best way to utilize its functionality is by connecting it with pandas as below:

You can check its complete documentation on the page itself.

Now, if we utilize this package to preprocess the above samples we’ll get something like this

1. product really very good. — POSITIVE

2. products seems good. — POSITIVE

3. Good product. really liked. — POSITIVE

4. not like product. — NEGATIVE

5. product not good. — NEGATIVE

Now, it seems resonable to use this package for removal of stopwords and other preprocessing.

Let me know what is your opinion on this in comment section.

Thank You!

Bucket — The Zykrr Engineering Blog

Building the future of Customer Experience

Gagandeep Singh

Written by

Very Curious about everything. Data Scientist at Zykrr. Geeky — https://www.linkedin.com/in/gaganmanku96/

Bucket — The Zykrr Engineering Blog

Building the future of Customer Experience