NLP: How to find the Most Distinctive Words Comparing two Different Classes of Text?

Dmitriy Fisch
6 min readMay 15, 2022

--

What follows is just a little NLP data science exercise, but it seems to be both instructive and entertaining. Here’s a simple problem:

We want to analyze some positive and negative product reviews on Amazon (why not?) and find the most important distinctive keywords that set those reviews apart. No doubts, that bad reviews should contain words like “terrible”, ‘horrible’, and “disappointing” while good — will have “great”, “wonderful”, and “perfect”. All we have to do is to identify such words in all our reviews and list them for each sentiment class. What should be our approach?

One of the common methods would be identifying the most frequent words for each class of texts. After counting how many times each word was used in each class, let’s hope that the most common “positive words” will be very different from the most common “negative” ones.

To prepare for this exercise we already have our word frequency counts created. Both distributions (one per class) are stored in two dictionaries. All we have to do is to select the most common words from each dictionary and see if it works.

Data:

  • Words are extracted from 17,000 positive and 3,000 negative Amazon reviews (these are Amazon Food reviews, by the way).
  • All words are already converted to a base/dictionary form. For example: “thought” changed to “think”, “helped” changed to “help”, etc.
  • All words are minimum 3 letters long
  • No common words were removed

First Attempt:

It does not seem that the words differ too much in positive and negative reviews. Let’s create a function for better visualizations:

We cannot say for sure which class is which. There are too many common words in both groups. Let’s check how many words do both classes share:

16 out of 20 words are the same!

Second Attempt:

Let’s see if dropping some of the common words (“stop words”) helps to get better results.

That’s what we’ve got:

After we dropped stop words the positive keywords started looking better (“good”, “great”, “love” are noticeable). The issue is with negative reviews. If we take a glance on a negative review wordcloud, we’ll immediately figure out that those reviews are related to food, but their sentiment is still not obvious. Most of negative review words with few exceptions look rather neutral.

Third Attempt:

Let’s try something else: as short words are more common, let’s see what we’ll get if we leave only longer words filtering the shorter ones.

That’s visualization for 6 letter and longer words:

And this is —for 8 letter and longer:

Only with 8+ letter words we finally start getting some “hints” that few keywords in negative wordcloud might be originated from a not-so-much-satisfied customer . But even now those words are not among the most frequently used. We see another set of dominating words (‘purchase’, ’different’, ’chocolate’, ’ingredient’ etc.) that is shared by both classes.

We could keep removing more and more words frequently used in both bad and good reviews until we get better results. As a result we are going to lose many words, because as soon as you remove one set of common words, the next common set will emerge. The process is manual and not convincing. A modified strategy is to identify and remove words that can be found in more than a certain percent of all the reviews. However, there is a risk that truly positive or negative words will be removed because they are too popular. Also we saw that 70–80% of most common words for each class match. Once again we might have to get rid of too many words.

We can summarize that just a simple comparing of the dominant words in each class do not work that well and we have to find something else.

Different approach:

The strategy that I figured out while working on one of the NLP projects is more elegant and rather simple:

To start with: we do not remove any words at all, not even the “stop words”. Instead we subtract each word frequency in one class (say, negative) from corresponding word frequency in another (say, positive). Differences greater than 0 are assigned to the positive review class; absolute values of less-than-0 differences will be assigned to the negative review class. For example, if a word “coffee” is used 100 times in all the bad reviews and 150 times in good reviews, the difference will be assigned to the positive class: positive_dictionary[‘coffee’]=50. As a result “coffee” will be removed from negative class.

There is a catch though. As is this method would work great if we have the same number of good and bad reviews. But we have 85% percent of our reviews positive (17,000), and only 15 percent (3,000) negative. The counts of even not so popular words in prevailing positive class might still be greater than the corresponding counts of the most frequent words in a smaller “bad reviews” class.

The key solution is to “normalize” the frequency counts. Rather than counting the word “coffee” 150 times and 100 times respectfully, we will count it 150/(sum of good review words) for positive reviews and 100/(sum of all bad review words) for negative ones. And than we can try our “differences” approach. As a result the words that are similarly common in both classes will be close to canceling each other. If a word is extremely popular in one class, and moderately popular in another one, it will stay only in the class where it’s really important. Let’s try.

Final Attempt:

Let’s check out our visualization:

Right away without removing/filtering any words we have much better results. For each class we received a unique set of words, and it’s clear the key content of negative reviews. Positive reviews are full of praises , while the key sentiment for negatives is “Not” with problems related to products, orders, ingredients, with key words “bad”, “disappointed” and others.

Based on our code to generate dictionary differences let’s create a function. We’ll also add a parameter length to select minimum length of a word. This way we can play with various lengths using dictionary of differences and see how it works:

Let’s see visualization of 6+ letter words using dictionary differences:

And now — 8+ letter words visualization using dictionary differences:

Regardless of the word length the results are convincing . We can actually understand a lot from those visualizations. In the end I will create a prettier visualization of the same dictionaries:

I might create another post looking at the ways to use images for wordnet visualizations

--

--