What Are “Stopwords” Precisely…?
An interesting point in the relationship between stop words, word frequency, document frequency, and document-frequency-high/low-cut-off, comes from the parameter descriptions in sklearn for the two functions TfidfVectorizer & CountVectorizer (more details below).
First: max_df & min_df: Document frequency vs. word frequency
or
“What does ‘df’ mean in these parameters?”
Document-frequency is how often per document (not in the corpus-of-all-texts as a massive block) words occur. E.g. if “coffee” occurs in almost every document (high document frequency), and “koffffeeee!” occurs in only one document (low document frequency), both have extreme document frequencies and can (therefore) be ignored. Ignored…like “stop words”?
“Stop Words”
At first, or loosely, it may seem that cutting off the top and bottom extremes of document-frequency representation is ‘another way to approach or define or think about stop words,’ and loosely speaking this may be the case, but the details prove more interesting.
For example: if extreme document frequency removal were literally another way to do “stop word removal” then the recommended or defined cut-offs would be inclusive of stopwords.
But, if oddly, these function-parameters (max_df & min_df (document frequency)) are defined by being set outside that threshold of ‘stopwords’ ( in other words these functions are are defined by not including stop-words).
On the one hand this may seem counter-intuitive. For example if “the” is a model stop word. How could the percent cut off be higher than the most common words? Would that not result in zero percent removed?
So what might the specific differences be, between “stop words” and “extreme document frequency words”?
Possible Differences:
1. The parameter refers to “corpus-specific stop words,” but that phrase is not clearly inclusive or exclusive of general stop words. So the set of “corpus specific stop words” may be the whole set of stop words for that corpus-of-all-texts (including words such as “the”), or it may be only those terms specific to that corpus-of-all-texts (excluding words such as “the”).
2. Stop words are a more inclusive list than we might at first think and include some less common but ‘not-contributing-to-meaning-words’ as in the full spaCy list. “The” may be a great vanilla example, but rarer terms like “henceforth” are also on the list.
3. Another factor is the TF/IDF emphasis on punishing high document frequency…but then the same parameter is in the overall word-count function as well (which superficially doesn’t punish high document frequency).
4. Sweeping up the High Frequency Crumbs
It may be that these parameters are specifically meant to supplement and not replace the other steps (lest someone skip stop word removal thinking the min/max df removal was an alternative). In case there are still random words that are not included in the corpus high frequency stop words, this filter will remove them. (But it still begs the question: Are these really not stop words by definition, even if it took a subsequent step to remove them? Didn’t we use frequency to supplement the list of “stopwords” previously?)
5. How Low Can You Go
The low frequency filter may be a more clear difference, as ‘stop words’ tend not to be exceedingly infrequent. This may also touch on the issue of not using a spell checker and changing the document but rather sweeping up rare unique misspelled words (though in the case of text written on cell phones…might not spelling errors be very common?)
Here is the text of the parameter descriptions. I think these parameter descriptions in both functions are identical:
&
max_df : float in range [0.0, 1.0] or int, default=1.0
When building the vocabulary ignore terms that have a document frequency strictly higher than the given threshold (corpus-specific stop words). If float, the parameter represents a proportion of documents, integer absolute counts. This parameter is ignored if vocabulary is not None.
min_df : float in range [0.0, 1.0] or int, default=1
When building the vocabulary ignore terms that have a document frequency strictly lower than the given threshold. This value is also called cut-off in the literature. If float, the parameter represents a proportion of documents, integer absolute counts. This parameter is ignored if vocabulary is not None.