A Few Words on Stop Words

Published in

CodeX

5 min readAug 12, 2021

If you ever use or hear of Siri, Alexa, Google Assistant, Cortana, or any other personal voice assistants, you are witnessing the emergence of some of the first exciting technologies to come out of Natural Language Processing (NLP). What is paradoxical is for that progress to come about, in the late 1980s to early 1990s, a split took place between hardcore linguists on one hand and computer scientists, statisticians, etc. on the other hand (AI Magazine Volume 18 Number 4 (1997); Natural Language Processing Elizabeth D. Liddy, Syracuse University, liddy@syr.edu; Natural language processing by E. Kumar). The latter school was making major breakthroughs in Computational Natural Language Processing. But they did so at a price. They brushed aside most of the classical linguistic approach in favor of a more “statistical/numerical” approach. The more progress they made, the more they push away the acquisitions of “classical computational linguistics”.

One quirk of the statistical approach is the advent of the concept of words with the greatest frequencies in such languages as English, French, etc., the so-called “stop words”. “Stop words” create a web (network) within which the “other words” reside. Per the “statistical approach”, these words in themselves have very little meaning, and if included in any analysis, they will drown out the other words that carry more “meaning”. So up to the time of Deep Learning (DL), transformers, and Bidirectional Encoder Representations from Transformers (BERT), the basic approach has been to indiscriminately clean out these “stop words”

As an aside, the state-of-the-art work being done on the relationship between language and mind suggests that these stop words are very stable in languages where they exist (think of the last time in your lifetime you heard someone creating a new preposition or definite article, the same does not hold for non-stop words). In other disciplines, such as the intersection of neuroscience, psychology, linguistics, and study of the mind, stop words re-emerge as in a concept called closed-class words. And there are specific parts of the brain involved in their production. With very specific localized brain damage you may use the ability to use them. Also, languages that do no use much stop words make up for that using a lot of declension, and morphology, where one word is a whole complex sentence (See the lecture series “Language and the Mind” by Spencer Kelly , 2020). Let us repeat here an example from Spencer. Consider a word in the Bantu language Kivunjo “Näïkìmlyìïà”, this word roughly translates to “he is eating it for her” (I have no idea what that means nor do I understand the cultural context). “Näïkìmlyìïà” contains 8 morphemes. The verb stem is “lyì” which means to eat. The word includes morphemes for tense, focus, mood and gender agreement. There are 16 genders in Kivunjo (They are not necessarily related to biology; they seem to be a way of grouping things). The human brain is saying here something fundamental about how it constructs languages, “either it uses stop words to the max, or it uses morphemes to the max”. No human language escapes that kind of grouping. Since NLP has to do with getting machines to be able to extract patterns out of human language, why would we abstract away something so fundamental to the way human beings think and construct languages? ( By the way, can you imagine doing NLP in Kivunjo, how does one tokenize?)

Now that we have established the importance of stop words for the human brain, mind, and languages; let us do a thought experiment on sentiment orientation by taking a subset of English stop words from English language using NLTK (a library that I love dearly) and do some basic utterances. The most primal use of language probably had to do with conveying sentiments, territoriality, and certain basic biological needs.

Here follows a subset from NLTK stop words module,

['after', 'against', 'ain', 'am','aren', "aren't", 'but', 'couldn', "couldn't", 'didn', "didn't", 'doesn', "doesn't", "don't", 'few', 'hadn', "hadn't", 'hasn', "hasn't", 'haven', "haven't", "I" ,'isn', 'it', 'mightn', "mightn't", 'mustn', "mustn't", 'myself', 'needn', "needn't", 'no', 'nor', 'not', 'now', 'out', "shan't", "shouldn't", 'under', 'up', 'very', 'wasn', "wasn't", 'weren', "weren't", 'won', "won't", "wouldn't", "you're"]

Most of the words being shown here have to do with some form of negation, so the concept of “no” is very central here (Can you imagine the “no-phase” in childhood development without the word “no”? There goes human intelligence! a child cannot create identity boundary without it).

To continue with our experiment, I utter the phrase “I am happy”, by the time you finish cleaning my sentence using some classical NLP process with the above stop words, I am left with just the word “happy”; “I” is a stop word, and “am” is a stop word, so they are both gone. We can all agree that “happy” has a positive orientation; never mind the fact that I no longer have a clue about who is “happy”, me or my cat, my son or my dog? _ I always know when my cat is not happy with me; he either scratches me or bites me. I have to respond quickly, lest he achieves alpha-dominance.

Let us move on with our thought experiment and utter “I am not happy”. For any native speaker of the English language (which I happen not to be), this sentence does convey a negative sentiment, even my 13 years old son (born here in the USA) would agree on that point. Now let us do the usual cleaning of the utterance using the above NLTK stop words. “I” goes away, “am” goes away, “not” goes away. I am left with “happy”. “Native speaker” or not, this is not the sentiment that I am trying to convey. This arbitrary removal of stop words (including negation), took me from an unhappy state to a happy state, (so much for Valium and other happy pills)! If you are “not happy”, just take your mind to a stop word grinder and you’ll always end up “happy”. As a matter of fact, you might not even know who is “happy”.

If you ask me, this is nothing that I can be “happy” about. And please keep your “stopwords” away from me. For a short while, I would like to contemplate my “not happy”ness and be certain that I! am the one doing the “not happy”.

Let us do another experiment. My son is doing well in school and brings back a A+ for math on his report card. So again me being “very happy” (no stopwords, please!), I yell out to my son, “You’re the man!”. Holy cow! I forgot that I started my precocious and nerdy son on a NLP course and he just learned about stop words. So he cleans up my sentence using the NLTK stop words filter and all he hears is “man”, even the exclamation point did not survive! As a matter of fact when he learns about the frequency filter to remove low frequency words, the word “man” too which I only utter once will be gone. So I end up with silence. Anyway you get the idea.

So where does that leave us? should we be using stop words or not? The answer is…yes, but judiciously. It should be guided by a semantic light. But that is another story for another time.

A Few Words on Stop Words

Written by Arald Jean-Charles