Text-PreProcessing — Stopwords Removal In NLP

TejasH MistrY
4 min readApr 5, 2024

--

Text-PreProcessing — Stopwords Removal In NLP

What are StopWords?

Stopwords are common words in a language that are often filtered out during natural language processing tasks because they typically don’t carry significant meaning on their own. These words include articles (e.g., “a,” “an,” “the”), conjunctions (e.g., “and,” “but,” “or”), prepositions (e.g., “in,” “on,” “at”), and other frequently occurring words (e.g., “is,” “are,” “to”).

In natural language processing (NLP), stopwords are often removed from text data before analysis or processing to improve the efficiency and accuracy of algorithms. By removing stopwords, we focus on the important words that convey the main meaning of the text.

Stopwords in NLTK

NLTK (Natural Language Toolkit) provides a predefined list of stopwords for various languages, which can be used to filter out these common words from text data. This filtering process is often performed as a preprocessing step before tasks like text classification, sentiment analysis, or information retrieval.

Before performing the operation to remove stopwords from text data, it’s necessary to ensure that you have downloaded the stopwords library.

How to Download Stopwords?

To download stopwords for natural language processing tasks, you typically need to use a library that provides access to pre-defined lists of stopwords.

One of the most commonly used libraries for this purpose is NLTK (Natural Language Toolkit) in Python. NLTK provides a collection of stopwords for various languages, including English.

To download stopwords, execute the following command in your terminal or command prompt:

  1. First, ensure that you have NLTK installed. You can install NLTK using pip if you haven’t already:
pip install nltk

2. Once NLTK is installed, you can download the stopwords data for a specific language. In this example, we’ll download stopwords for the English language:

import nltk  nltk.download('stopwords')

3. After running the nltk.download('stopwords') command, NLTK will download the stopwords data for the English language and store it locally on your system. Once the download is complete, you can access the stopwords list in your Python code.

import nltk

# Download stopwords for English language
nltk.download('stopwords')

NLTK comes with a stopwords corpus that contains word lists for many languages. You can see the complete list of languages using the fileids() method.

from nltk.corpus import stopwords

stopwords.fileids()

['arabic',
'azerbaijani',
'basque',
'bengali',
'catalan',
'chinese',
'danish',
'dutch',
'english',
'finnish',
'french',
'german',
'greek',
'hebrew',
'hinglish',
'hungarian',
'indonesian',
'italian',
'kazakh',
'nepali',
'norwegian',
'portuguese',
'romanian',
'russian',
'slovene',
'spanish',
'swedish',
'tajik',
'turkish']

For those working with languages other than English, NLTK provides stop word lists for several other languages, such as German, Indonesian, Portuguese, and Spanish.

from nltk.corpus import stopwords

# pre-define list of stopwords in english language
stopwords.words("english")
stopwords.words("arabic")
stopwords.words("french")

Now lets see a practical example of how to remove stopwords from text data using NLTK in Python.

import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

# Download stopwords for English language
# nltk.download('stopwords')

# Sample text
text = "Natural language processing (NLP) is a fascinating field. It deals with how computers understand and interact with human language. Sentence tokenization is one of the basic tasks in NLP."

# Tokenize the text into words
words = word_tokenize(text)

# Get the list of English stopwords
stop_words = set(stopwords.words('english'))

# Remove stopwords from the tokenized words
filtered_words = [word for word in words if word.lower() not in stop_words]

# Print the filtered words
print(filtered_words)

Explanation of Provided code.

1. Importing NLTK and Required Modules

  • The code begins by importing the necessary modules from NLTK, including stopwords and word_tokenize, which are used for stopwords handling and word tokenization, respectively.

2. Downloading Stopwords

  • The code includes a commented-out line nltk.download('stopwords'), which would typically be used to download the stopwords data for the English language if it's the first time running the script. Since it's commented out, it indicates that the stopwords have been downloaded previously.

3. Tokenizing the Text:

  • The word_tokenize() function from NLTK is used to tokenize the sample text into words. Tokenization is the process of splitting text into individual words or tokens.

4. Accessing English Stopwords:

  • The stopwords.words('english') function returns a list of English stopwords from NLTK's stopwords corpus. These stopwords include common words like "the," "is," "and," etc.

5. Filtering Stopwords:

  • The list comprehension [word for word in words if word.lower() not in stop_words] iterates over each word in the tokenized text (words). For each word, it checks if the lowercase version of the word is not in the set of English stopwords (stop_words). If the word is not a stopword, it is added to the filtered_words list.

6. Printing Filtered Words:

  • The filtered words, which exclude stopwords, are printed using the print() function.

--

--

TejasH MistrY

Machine learning enthusiast breaking down complex Ml/AI concepts and exploring their real-world impact.