Sentiment Analysis with VADER- Label the Unlabelled Data

Sandeep Panchal

Published in

Analytics Vidhya

6 min readMar 6, 2020

Source link: https://www.google.com/url?sa=i&url=https%3A%2F%2Fwww.kdnuggets.com%2F2018%2F03%2F5-things-sentiment-analysis-classification.html&psig=AOvVaw3UYgjYoiej2XtJwUQML0sT&ust=1583602610275000&source=images&cd=vfe&ved=0CAIQjRxqFwoTCODUhs6xhugCFQAAAAAdAAAAABAk

My warm welcome to all the readers!

1) Short Note on Sentiment Analysis:

Before diving into VADER, let us first understand what ‘Sentiment Analysis’ is? Sentiment Analysis also known as ‘Opinion Mining’. Sentiment Analysis is the analysis of a text (word or sentence or a document) as a ‘positive’ , ‘negative’ or ‘neutral’. Few analyzers might not take the label ‘neutral’ into consideration as it merely depends on the business requirements. Sentiment analysis is used in analyzing the sentiment score of the movie reviews, food reviews, speech reviews, etc.

(Note: In few cases, movie reviews, food reviews, etc, might not have direct positive and negative ratings. It might have ratings like 1, 2, 3, 4, 5…, that defines the level of positivity or negativity of a text review. Based on the business requirements and easy analysis, the analyzer might set the range of ratings to label as positive and negative. For example: if we have ratings 1, 2, 3, 4, 5, and analyzer might set the review as positive if it has ratings 1, 2, or 3, and negative otherwise.)

2) VADER:

VADER stands for ‘Valence Aware Dictionary and sEntiment Reasoner’. (Note: in the spelling ‘sEntiment’, first letter ‘s’ is a small letter and second letter ‘E’ is capital and it is correct). VADER is a lexicon and rule-based sentiment analysis tool. It is used to analyze the sentiment of a text. Lexicon is a list of lexical features (words) that are labeled with positive or negative based on the semantic meaning. Even an unlabelled text data can be labeled with VADER sentiment analyzer.

3) Sentiment Analysis with VADER:

3.1) Installation of VADER Sentiment Analyzer:

Open cmd prompt or anaconda prompt and type in ‘pip install vaderSentiment’

Or in the local jupyter notebook as shown in the below image or in any other notebook you are using, just type in ‘!pip install vaderSentiment’ and run the line. (Note: As I have already installed it, it is giving ‘Requirement already satisfied….’)

3.2) Coding part:

Example — 1

We first need to import nltk (Natural Language Toolkit) and then download ‘vader_lexicon’. Refer to the below image for the code.

Now we need to import the VADER Sentiment Analysis from nltk. Then create an instance for the imported library.

Now let us check how VADER sentiment analyzer works with a few examples. I am analyzing the text ‘I went to the movie, yesterday. It was amazing! Everyone acted well.’

The ‘polarity_scores(text)’ is used for the analysis of the text data. Refer to the below images for the code and its output.

{'neg': 0.0, 'neu': 0.592, 'pos': 0.408, 'compound': 0.7345}

From the above images, we can see the sentiment scores are in the dictionary format wherein key is the label and value is the sentiment score. Here, neg — negative, neu — neutral, pos — positive. In the output of the sentiment analysis dictionary, the fourth key is the ‘compound’ followed by a score. In general and mostly, the compound score is used as a threshold value for the analysis of the text data. We can even leverage the flexibility of changing the compound score (threshold value) and label the text data. This merely depends on business requirements and domain knowledge.

Below is the standard scoring metric followed by most of the analyzers.

Positive sentiment: compound score >= 0.05
Neutral sentiment: (compound score > -0.05) and (compound score < 0.05)
Negative sentiment: compound score <= -0.05

Coming back to our above example, the sentiment scores of the text ‘I went to the movie, yesterday. It was amazing! Everyone acted well’, are {‘neg’: 0.0, ‘neu’: 0.592, ‘pos’: 0.408, ‘compound’: 0.7345}. The compound score 0.7345 clearly states that the text is positive. Indeed, the movie review is positive. Hurray! VADER sentiment analyzer worked perfectly.

Example — 2

Now, let us make a few changes in the text data and see how the sentiment scores are reflected. I have replaced the text ‘Everyone acted well’ with ‘1st half was good. 2nd half was bad’. The entire text data reads as ‘I went to the movie, yesterday. It was amazing! 1st half was good. 2nd half was bad.’ If we manually analyze, we can say that the text review has both positive and negative meanings. It can be considered as a bit neutral review. Check out how the scores are reflected in the below images.

{'neg': 0.201, 'neu': 0.632, 'pos': 0.167, 'compound': -0.1531}

Based on the compound score and standard scoring metric, the text data ‘I went to the movie, yesterday. It was amazing! 1st half was good. 2nd half was bad’, can be labeled as negative.

4) Pros and Cons:

Pros:

Label the unlabelled text data with VADER.
We can leverage the flexibility of changing the compound score (threshold value) and label the data. This merely depends on business requirements and domain knowledge.
Reduces the manual effort.

Cons:

Accuracy of the analysis sometimes can not be great.
VADER can not analyze the sarcastic text.

For example: Sarcastically, if a person says ‘Oh! The movie was great!’. Humans can tell based on the tone of the person that the review is sarcastic and we will label the movie as a negative review. But the machine will label it as positive.

5) Other Sentiment Analysis Library:

Other than VADER which is used for sentiment analysis of the text data, there is also another nltk library ‘Text Blob’ that can be used for sentiment analysis of the text. Text Blob is not limited to sentiment analysis. It can even be used for parts-of-speech tagging, text translation, word count, etc. Refer to the below images for the code and what are the things we can do with text blob.

['__add__', '__class__', '__contains__', '__delattr__', '__dict__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__getitem__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__iter__', '__le__', '__len__', '__lt__', '__module__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', '__weakref__', '_cmpkey', '_compare', '_create_sentence_objects', '_strkey', 'analyzer', 'classify', 'correct', 'detect_language', 'ends_with', 'endswith', 'find', 'format', 'index', 'join', 'json', 'lower', 'ngrams', 'noun_phrases', 'np_counts', 'np_extractor', 'parse', 'parser', 'polarity', 'pos_tagger', 'pos_tags', 'raw_sentences', 'replace', 'rfind', 'rindex', 'sentences', 'sentiment', 'sentiment_assessments', 'serialized', 'split', 'starts_with', 'startswith', 'strip', 'subjectivity', 'tags', 'title', 'to_json', 'tokenize', 'tokenizer', 'tokens', 'translate', 'translator', 'upper', 'word_counts', 'words']

I am not covering details of text blob in this blog. I will plan to write a blog on ‘Text Blob’ .

References:

Connect me:

LinkedIn: https://www.linkedin.com/in/sandeep-panchal-682734111/
GitHub: https://github.com/Sandeep-Panchal

Thank you all for reading this blog. Your suggestions are very much appreciated!