Remove/Add Stop Words

Tushar Srivastava
7 min readAug 3, 2019

In this article we will be discussing about what are Stop Words, their importance in data pre-processing and we will be doing Spacy vs NLTK fight to see which library suits your needs the most.

The most commonly used words in the sentence is called as Stop Words.

For example ‘A’, ‘The’, ‘is’, ‘are’ etc are considered as Stop Words.

Check Wikipedia for more info.

But what is the relevance of these Stop Words?

We human talk a lot :). For conversation betterment we have created some words in our vocabulary which can help us get the relevance meaning when combined with other words.

Confused, okay lets start with an example.

We will go to movie after the dinner

In the above sentence there are some important words which are defining the context of sentence, they are “Movie”, “Dinner” etc.

But there are some other words also present which are helping us to understand the whole meaning of the sentence, they are “to” etc

What if I say “We will go movie after the dinner.

Yes from English grammar point of view the sentence doesn’t seems correct but still you are getting the meaning right?

From the above example we can deduct there are some words which can be removed without hampering its meaning.

These words are added to Stop Words list. Currently we don’t have any universal list of Stop Words but there are libraries which have already trained their models and can remove Stop Words from your input sentence. Isn’t this cool just adding a library and all unimportant words will be removed.

We will see 2 such libraries, their list of Stop Words and we will compare their before and after result for same input.

Now before adding any library we should always ask a question do I really need this library?

Lets deep dive into Stop Words Irrelevance.

You must have heard about cleaning the input before you provide it to your model. Because why to waste our precious resource(Time and Computation power) on input which doesn’t/minimal contribute towards the result.

When working in Natural Language Processing we should follow the same practice. The cleaner the input, the better the results will be.

by removing Stop Words we are cleaning the input so that we can get more relevant output Faster and Accurate.

Let us try Spacy library Stop Words first.

To install it run the below command in Anaconda Prompt.

pip install spacy==2.1.3

Open Jupyter Notebook and copy paste below code.

Below is the output.

As we can see currently Spacy is having 312 Stop Words.

Now let us remove Stop Words from my previous input sentence and see how it looks after that.

Open Jupyter Notebook and copy paste below code.

Note: If this code gives you error then run below command in Anaconda Prompt

python -m spacy download en_core_web_sm

What is we want to add our custom Stop Word?

If you want to add a single Stop Word then use the below code.

Here i have added “Test” as a stop word. As you can see the count is increased to 313 and a new entry is added into the Stop Words list.

If you want to remove the above added Stop Word then use below code.

If you want to add a list of words then… you have guessed it right, we will be providing a list to add.

Use the below code for the same.

Here we have added 2 Stop Words and count is increased to 314.

We are using “|” symbol to add these 2 Stop Words because in python | Symbol acts as a Union Set Operator. Means, If these 2 words are not present in the list then and only then they will be added to stop words list otherwise they will be discarded.

If you want to remove these 2 Stop Words then use the below code.

Here count again comes back to 312 and both the Stop Words are removed.

I hope you must have understood the impact behind Stop Words by using above library. So let us now Go ahead with the famous NLTK library and try the same above examples.

To install NLTK or Natural Language ToolKit run the below command in Anaconda Prompt

pip install nltk==3.4.4

If you are using NLTK for the first time then it will show below screen, download All Packages to continue.

Before its downloading if you will print Stop Words then below message will be shown.

Let us first see how many Stop Words are already present in NLTK Library

Copy paste the below code into Jupyter Notebook

As you can see in the output we are getting 179 Stop Words, in Spacy we were getting 312 Stop Words.

Now let us try removing Stop Words from out previous sentence and see how the output comes out to be.

Copy paste below code in Jupyter Notebook.

Let us take some seconds to analyse what we got.

“Movie” and “Dinner” were also present in output after we removed the Stop Words by spacy but here “We” and “GO” are also added to output.

The reason behind it total number of Stop Words present in spacy are more then NLTK and hence output is more trimmed.

We should choose our library very carefully we may need some Stop Words in out input (Depending on scenarios) hence we can switch between Spacy and NLTK as required.

Currently in NLTK is having support for 11 languages and Spacy is having support for 8 languages.

NLTK is a very good language if you are working as researcher but in case you are a developer then Spacy will give you all the latest tools and support.

Note: In case you are facing any error in “punkt” library then run the below code in jupyter Notebook, this will take 2 minutes to execute and will download all the necessary classes needed for Tokenization used above.

nltk.download(‘punkt’)

As of now we are not having method to add/remove Stop Words from NLTK library but we can explicitly do the same with python. We can either pass a single word or a list to NLTK stopwords and rest everything will work like a charm.

Below is the code to add a single word in NLTK Stop Words list

As you can we have successfully added a word

But if we will try to import it again then total words will be 179 again.

STOP_WORDS = nltk.corpus.stopwords.words(‘english’)

We can delete previously created Stop Word from list by remove() method of list.

Below is the code.

If you want to add a list then use below code.

In the above example we have passed a new list and extended our old Stop Word list.

You can also remove a list of words from given Stop Words list. Below is the code for the same.

I hope by now you would have got some descent idea about Stop Words and how we can manipulate them to get desired results.

In case you want to checkout the Jupyter Notebook here is the link to my github repo.

--

--

Tushar Srivastava

¿ʞuıɥʇ ǝuıɥɔɐɯ uɐƆ 3+ years of experience in applied Machine Learning, Deep Learning, Public Speaking, Programming Outreach.