Stop Words in NLP
All about stop words in Natural language processing along with hands-on examples.
In this article, we will learn all about stop words for Natural Language processing.
In computing, stop words are words that are filtered out before or after the natural language data (text) are processed. While “stop words” typically refers to the most common words in a language, all-natural language processing tools don’t use a single universal list of stop words.
“stop words” usually refers to the most common words in a language. There is no universal list of “stop words” that is used by all NLP tools in common.
In this article we will look at below topics:
- What are stop words
- When to remove stop words
- Pros and Cons
- How to remove stop words in python using:
* NLTK Library
* SpaCy Library
* Gensim Library
* Custom stop words
What are stop words?
Stopwords are the words in any language which does not add much meaning to a sentence. They can safely be ignored without sacrificing the meaning of the sentence. For some search engines, these are some of the most common, short function words, such as the, is, at, which, and on. In this case, stop words can cause problems when searching for phrases that include them, particularly in names such as “The Who” or “Take That”.
When to remove stop words?
If we have a task of text classification or sentiment analysis then we should remove stop words as they do not provide any information to our model, i.e keeping out unwanted words out of our corpus, but if we have the task of language translation then stopwords are useful, as they have to be translated along with other words.
There is no hard and fast rule on when to remove stop words. But I would suggest removing stop words if our task to be performed is one of Language Classification, Spam Filtering, Caption Generation, Auto-Tag Generation, Sentiment analysis, or something that is related to text classification.
On the other hand, if our task is one of Machine Translation, Question-Answering problems, Text Summarization, Language Modeling, it’s better not to remove the stop words as they are a crucial part of these applications.
Pros and Cons:
One of the first things that we ask ourselves is what are the pros and cons of any task we perform. Let’s look at some of the pros and cons of stop word removal in NLP.
* Stop words are often removed from the text before training deep learning and machine learning models since stop words occur in abundance, hence providing little to no unique information that can be used for classification or clustering.
* On removing stopwords, dataset size decreases, and the time to train the model also decreases without a huge impact on the accuracy of the model.
* Stopword removal can potentially help in improving performance, as there are fewer and only significant tokens left. Thus, the classification accuracy could be improved
Improper selection and removal of stop words can change the meaning of our text. So we have to be careful in choosing our stop words.
Ex: “ This movie is not good.”
If we remove (not ) in pre-processing step the sentence (this movie is good) indicates that it is positive which is wrongly interpreted.
How to remove stop words in python using:
Removing stop words using python libraries is pretty easy and can be done in many ways. Let’s go through one by one.
Using NLTK library:
The Natural Language Toolkit, or more commonly NLTK, is a suite of libraries and programs for symbolic and statistical natural language processing for English written in the Python programming language. It contains text processing libraries for tokenization, parsing, classification, stemming, tagging, and semantic reasoning.
Let’s see how we can remove stop words using the NLTK python library.
We can observe that words like ‘this’, ‘is’, ‘will’, ‘do’, ‘more’, ‘such’ are removed from the tokenized vector as they are part of NLTK’s stopwords set. We can have a look at all such stop words for English by printing the stopwords.
Using SpaCy Library:
spaCy is an open-source software library for advanced natural language processing. spaCy is designed specifically for production use and helps you build applications that process and “understand” large volumes of text. It can be used to build information extraction or natural language understanding systems or to pre-process text for deep learning.
Before moving on make sure you install spaCy and its English language model. You can use the below commands to do that.
$ pip install -U spacy
$ python -m spacy download en_core_web_sm
Let’s look at how we can remove stop words using this library.
The output of NLTK and spaCy tokenized vectors without stop words is the same. But spaCy got a bigger set of stop words(326) than that of NLTK(179).
Using Gensim Library:
Gensim is an open-source library for unsupervised topic modeling and natural language processing, using modern statistical machine learning. Gensim is designed to handle large text collections using data streaming and incremental online algorithms, which differentiates it from most other machine learning software packages that target only in-memory processing. For more details checkout Gensim documentation.
Using Gensim we can directly call remove_stopwords(), which is a method of gensim.parsing.preprocessing. Next, we need to pass our sentence from which you want to remove stop words, to the remove_stopwords() method which returns the text string without the stop words. We can then tokenize the returned sentences.
Let’s look at how we can remove stop words using the Gensim library.
We can observe that the output of NLTK, spaCy, and gensim is the same even though each of them has a different set of a default set of stop words. Let's look at 337 Gensim stop words.
Custom stop words:
If you feel that the default stop words in any python NLP language tool are too many and are causing loss of information, or are too less to remove all unnecessary words in your corpus, then we can opt for custom stop words list.
For this, you can simply obtain the default stop words to a list and append or delete the required words from the list as per the requirement.
If we want a very few stop words, then we can define our own list of stop words and use it for removing respective words for our corpus.
my_stopword_list = [‘the, ‘is’, ‘as’, ‘a’, ‘are’, ‘in’, ‘this’, ‘that’]
In this article, we have learned what stop words are, the pros and cons of removing stop words. We’ve also seen various libraries in this article which can be used to remove stop words from a Python string. You also saw how to add or remove stop words from lists of the default stop words that different libraries have provided to make custom stop words lists.
Full code as a Jupyter notebook is available in my GitHub.