To Use or Lose: Stop Words in NLP

Moirangthem Gelson Singh
4 min readAug 29, 2023

--

Photo by Towfiqu barbhuiya on Unsplash

In Natural Language Processing (NLP), the term “Stopword” holds a significant place. The term was coined by Hans Peter Luhn in 1958, who is recognized as a pioneer in the field of Information Retrieval (IR). In this blog, we’ll look into the aspects of stopwords, exploring their types, benefits of their removal, and the scenarios in which their presence or absence can impact NLP outcomes.

What are Stopwords?

Stopwords are words that appear frequently in almost every document, contributing little semantic value. Examples include “The,” “is,” and “am.” These words may seem trivial, but they play a crucial role in shaping the efficiency and accuracy of NLP tasks. Let’s look further and try to understand their significance.

Types of Stopwords

Stopwords can be categorized into two main types:

  1. Generic Stopwords: These are language-specific and ubiquitous words that are commonly found in various contexts. For the English language, examples include ‘a,’ ‘and,’ ‘the,’ ‘all,’ ‘do,’ and ‘so.’ They hold little significance on their own.
  2. Domain-Specific Stopwords: These are words specific to a particular domain, such as education, health, sports, or politics. For instance, in the education domain, words like ‘paper,’ ‘class,’ ‘study,’ and ‘book’ might be considered domain-specific stopwords.

Why Remove Stopwords

Stopwords, while it looks simple, can pose challenges for NLP tasks such as indexing, topic modeling, and information retrieval. In order to enhance the signal-to-noise ratio in raw data, it’s essential to eliminate these redundant terms during the pre-processing step. The benefits are as follows:

  1. Reduced Text Count: Removing stopwords decreases the word count in the document corpus by approximately 35–45%. This simplified text focuses on the meaningful content, improving overall comprehension.
  2. Dataset Size Reduction: By discarding stopwords, the dataset’s size becomes smaller, retaining only the most informative components. This is particularly useful when working with memory-intensive tasks.
  3. Improved Performance: Information retrieval (IR) and text classification (TC) systems can experience a performance boost when stopwords are eliminated. The processed data results in better discrimination between documents.

When to Remove Stopwords

The decision to remove stopwords depends from task to task. Here are some scenarios in which their removal is beneficial:

  1. Information Retrieval: Tasks like IR, where documents are retrieved based on relevance, benefit from the removal of stopwords. This ensures that the retrieved documents are more relevant.
  2. Auto-Tag Generation and Text Classification: Removing stopwords helps in generating accurate tags and improving the classification of texts, contributing to more effective automated processes.

When to Keep Stopwords

Conversely, there are situations where preserving stopwords is essential:

  1. Machine Translation: In tasks involving translation, stopwords can provide necessary context, ensuring the translation’s accuracy and fluency.
  2. Language Modeling and Text Summarization: Stopwords contribute to the natural flow of sentences and can help in creating clear language patterns.
  3. Sentiment Analysis: These tasks heavily rely on retaining stopwords to comprehend the subtleties and emotional tone of the text.

Now, let us look into a simple example implemented in python using NLTK library to analyze stopwords.

Example 1:

import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')

#sample text
text = "The world of NLP is fascinating and full of challenges."

#tokenization
tokens = nltk.word_tokenize(text)

#remove stopwords
filtered_tokens = [word for word in tokens if word.lower() not in stopwords.words('english')]

print("Original tokens:", tokens)
print("Filtered tokens:", filtered_tokens)

Output:

Let’s see another case where removing stopwords can go wrong.

Example 2:

import nltk
from nltk.corpus import stopwords
from nltk.sentiment import SentimentIntensityAnalyzer
nltk.download('vader_lexicon')

#sample text
text_with_stopwords = "I do not like mangoes."

#tokenization
tokens_with_stopwords = nltk.word_tokenize(text_with_stopwords)

#store english words
english_stopwords = set(stopwords.words('english'))

#filter tokens to remove stopwords
filtered_tokens_with_stopwords = [word for word in tokens_with_stopwords if word.lower() not in english_stopwords]

#convert filtered tokens back to a sentence
filtered_text = ' '.join(filtered_tokens_with_stopwords)

#perform sentiment analysis on the original and filtered text
sia = SentimentIntensityAnalyzer()
sentiment_scores_original = sia.polarity_scores(text_with_stopwords)
sentiment_scores_filtered = sia.polarity_scores(filtered_text)

#display sentiment analysis results
print("Original Text:", text_with_stopwords)
print("Filtered Text:", filtered_text)
print("Sentiment Scores (Original):", sentiment_scores_original)
print("Sentiment Scores (Filtered):", sentiment_scores_filtered)

Output:

In above example, we can see that when we remove the stopwords, the sentence gives a false positive. By removing stopwords, the structure and context of the sentence is altered, leading to incorrect sentiment analysis results.

Conclusion

As we discussed above the essence of stop words in NLP, we saw what stop words stand for in a document and when to utilize or avoid it. As we progress in our NLP journey, one thing remains certain: the artful management of stop words continues to be a skill shaped by experience and an understanding of the ever-evolving landscape of natural language processing.

--

--

Moirangthem Gelson Singh

👋 Hey there! I recently graduated with my master's degree in Computer Science and I have been exploring the nooks and corners of AI.