Sentiment Analysis — vaderSentiment Library

Baban Deep Singh
The Startup
Published in
7 min readJul 8, 2020
Photo by Markus Winkler on Unsplash

With the advancement of technology, we are(or are rather slowly) moving from industrial sector to service sector. Which means it becomes an imminent part of our life to continuously review what we are offering to our customers and how they are perceiving our products. Nevertheless, with everybody who have access to internet, and advocates free will and freedom of speech is entitled to praise or loathe a service they receive via voicing their opinions.

Inline to make a better experience for our customers or clients, it becomes inevitable for an organization to analyze their decisions. It will be arduous to read surplus of reviews, testimonies, etc to understand how the service is doing ? or perhaps understand how products are doing in the market? or perhaps How are customers reacting to a particular product? or maybe What is the consumer’s sentiment across products? Many more questions like these can be answered using sentiment analysis.

In this article we will touch base with

WordClouds

Sentiment Analysis

About Data set :

'''
Getting the data
'''
# importing All the necessary libraries for working with the data
import numpy as np # For numerical processing
import pandas as pd # working with the dataframes
import matplotlib.pyplot as plt # nice looking plots
%matplotlib inline
# #Read the data
df = pd.read_csv('Reviews.csv')
df.info()
df.info() output
  • I am using the Amazon’s fine food reviews which can be available to you from here.
  • The attributes we have here are — productID, userID, score( which loosely translates into star rating), summary, text(their review), etc.
  • We observe that there are missing details in ProfileName and Summary attributes.

Natural Language Processing starts here where we load libraries and work with the data

# Import libraries
from nltk.corpus import stopwords
from textblob import TextBlob
from textblob import Word

Stop words

StopWords are basically a set of commonly used words in any language, not just English. The reason why stop words are critical to many applications is that, if we remove the words that are very commonly used in a given language, we can focus on the important words instead. Examples of minimal stop word lists that you can use:

Determiners — Determiners tend to mark nouns where a determiner usually will be followed by a noun
examples: the, a, an, another

Coordinating conjunctions — Coordinating conjunctions connect words, phrases, and clauses
examples: for, an, nor, but, or, yet, so

Prepositions — Prepositions express temporal or spatial relations
examples: in, under, towards, before

TextBlob

TextBlob is a Python (2 and 3) library for processing textual data. It provides a simple API for diving into common natural language processing (NLP) tasks such as part-of-speech tagging, noun phrase extraction, sentiment analysis, classification, translation, and more.

After importing the libraries, we shall remove StopWords and Punctuations, because they can heavily influence data and belie the accuracy. Additionally, we have to make all the words into similar textual casing i.e, lower.

# Lower Casing 
df['Text'] = df['Text'].apply(lambda x: " ".join(x.lower() for x in x.split()))
# Removing Punctuations
df['Text'] = df['Text'].str.replace('[^\w\s]','')
# Removal of Stop Words
stop = stopwords.words('english')
df['Text'] = df['Text'].apply(lambda x: " ".join(x for x in x.split() if x not in stop))

Exploratory Data Analysis

# Create a new data frame "reviews" to perform exploratory data analysis upon that
reviews = df
# Dropping null values
reviews.dropna(inplace=True)
reviews.Score.hist(bins=5,grid=False)
plt.show()
print(reviews.groupby('Score').count().Id)

We can observe that the data is highly unbalanced towards higher rating.

Does this mean that the product is performing good? or does it mean that people are liking the service provided by Amazon’s fine foods? Or does it mean people are giving 5 star to the things they are receiving or to the service?

SO why sentiment analysis ? Right?

One potential reason between the discrepancy between the explicit ratings and scores extracted from open ended comments may be that people tend to use more neutral language while expressing their opinions in natural language. If that is the case, to be compatible with star ratings, sentiment analysis techniques need to be more sensitive to the subtleties in natural language expressions. This, of course, is a significant challenge.

Also, we see that the data is imbalanced, so lets try to sample based on the minimum number of samples amongst the star ratings.

# To make it balanced data, we sampled each score by the lowest n-count from above. (i.e. 29743 reviews scored as '2')
score_1 = reviews[reviews['Score'] == 1].sample(n=29743)
score_2 = reviews[reviews['Score'] == 2].sample(n=29743)
score_3 = reviews[reviews['Score'] == 3].sample(n=29743)
score_4 = reviews[reviews['Score'] == 4].sample(n=29743)
score_5 = reviews[reviews['Score'] == 5].sample(n=29743)
# Here we recreate a 'balanced' dataset.
reviews_sample = pd.concat([score_1,score_2,score_3,score_4,score_5],axis=0)
reviews_sample.reset_index(drop=True,inplace=True)
# Printing count by 'Score' to check dataset is now balanced.
print(reviews_sample.groupby('Score').count().Id)

WordClouds

A tag cloud (word cloud or wordle or weighted list in visual design) is a novelty visual representation of text data, typically used to depict keyword metadata (tags) on websites, or to visualize free form text. Tags are usually single words, and the importance of each tag is shown with font size or color.

# Let's build a word cloud looking at the 'Summary' text
from wordcloud import WordCloud
from wordcloud import STOPWORDS
# Wordcloud function's input needs to be a single string of text.
# Here I'm concatenating all Summaries into a single string.
# similarly you can build for Text column
reviews_str = reviews_sample.Summary.str.cat()
wordcloud = WordCloud(background_color='white').generate(reviews_str)
plt.figure(figsize=(10,10))
plt.imshow(wordcloud,interpolation='bilinear')
plt.axis("off")
plt.show()

Wow! it looks like a promising wordcloud. People are liking the fine foods. words like good, taste, flavor, great, etc. the summary of the words are really a great start for us to dig deeper and check what positive and negative words correspond to.

# Now let's split the data into Negative (Score is 1 or 2) and Positive (4 or #5) Reviews.
negative_reviews = reviews_sample[reviews_sample['Score'].isin([1,2]) ]
positive_reviews = reviews_sample[reviews_sample['Score'].isin([4,5]) ]
# Transform to single string
negative_reviews_str = negative_reviews.Summary.str.cat()
positive_reviews_str = positive_reviews.Summary.str.cat()
# Create wordclouds
wordcloud_negative = WordCloud(background_color='white').generate(negative_reviews_str)
wordcloud_positive = WordCloud(background_color='white').generate(positive_reviews_str)
# Plot
fig = plt.figure(figsize=(10,10))
ax1 = fig.add_subplot(211)
ax1.imshow(wordcloud_negative,interpolation='bilinear')
ax1.axis("off")
ax1.set_title('Reviews with Negative Scores',fontsize=20)
ax2 = fig.add_subplot(212)
ax2.imshow(wordcloud_positive,interpolation='bilinear')
ax2.axis("off")
ax2.set_title('Reviews with Positive Scores',fontsize=20)
plt.show()

We can see that there are many words in the positive and negative wordclouds which are recycled in both.

Sentiment Analysis

#Importing required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
import re
import os
import sys
import ast
plt.style.use('fivethirtyeight')
# Function for getting the sentiment
cp = sns.color_palette()
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
analyzer = SentimentIntensityAnalyzer()

VADER (Valence Aware Dictionary and sEntiment Reasoner) is a lexicon and rule-based sentiment analysis tool that is specifically attuned to sentiments expressed in social media. VADER has been found to be quite successful when dealing with social media texts, NY Times editorials, movie reviews, and product reviews. This is because VADER not only tells about the Positivity and Negativity score but also tells us about how positive or negative a sentiment is.

VADER has a lot of advantages over traditional methods of Sentiment Analysis, including:

  • It works exceedingly well on social media type text, yet readily generalizes to multiple domains
  • It doesn’t require any training data but is constructed from a generalizable, valence-based, human-curated gold standard sentiment lexicon
  • It is fast enough to be used online with streaming data, and
  • It does not severely suffer from a speed-performance tradeoff.
# Generating sentiment for all the sentence present in the dataset
emptyline=[]
for row in df['Text']:
vs=analyzer.polarity_scores(row)
emptyline.append(vs)


# Creating new dataframe with sentiments
df_sentiments=pd.DataFrame(emptyline)
df_sentiments.head(5)

The Positive, Negative and Neutral scores represent the proportion of text that falls in these categories.

The Compound score is a metric that calculates the sum of all the lexicon ratings which have been normalized between -1(most extreme negative) and +1 (most extreme positive). This means the first review is highly positive (0.94) however the second entry is tad bit negative inclined.

# Merging the sentiments back to reviews dataframe
df_c = pd.concat([df.reset_index(drop=True), df_sentiments], axis=1)
# Convert scores into positive and negative sentiments using some threshold
df_c['Sentiment'] = np.where(df_c['compound'] >= 0 , 'Positive','Negative')

Finally, checking the appropriateness of the VADER library.

result=df_c['Sentiment'].value_counts()
result.plot(kind='bar')

Conclusion

We can be certain now that the fine foods are doing extremely good as there are many people liking the products or reviews. There are very few people who dislike the products and service offered by Amazon’s Fine Foods.

One can clearly work with Wordclouds in the negative annotations to understand what words ( service product if any) are causing distress among their customers and work on those sectors to increase customer engagements.

For instance the word taste, flavor, coffee, dog(interesting) needs revisions to appease customers. Fascinatingly, the word “dog” is mentioned twice which means in the section for dog food needs revisions to keep their four legged customers intact.

References

--

--

Baban Deep Singh
The Startup

I am currently enrolled in Masters in Business Analytics program at University of Illinois at Chicago. I am aiming to work within Research/PhD in Data science.