Sentiment Analysis- Lexicon Models vs Machine Learning

Published in

Nerd For Tech

7 min readJan 4, 2021

Why are we even interested in Sentiment Analysis? Well, here’s situation to help you understand.

Emily runs a movie theatre in a small town and wants to know which movie she should opt to show. She finds out that movies having good vibes and which promotes positivity among its audiences are most sought after in her town. In order to do that she collects data about past screenings of different movies in the theatre and finds out that movies with a specific genre such as drama, comedy and animated movies are driving sales much higher than horror, thriller and biographical movies. So, she thereby predicts that the upcoming movie in the theatre must be one of the former genres.

Let’s take another example.

TripAdvisor Review with positive sentiment

TripAdvisor Review with negative sentiment

We see that the first review appreciates the food quality and it conveys the positive sentiment 😃 of the customer. Whereas in the second review it is noticed that the customer has a bad experience with the hotel. This conveys negative sentiment. 😞

We as humans can easily classify reviews, but how to classify large number of reviews? Here, comes sentiment analysis. It automatically classifies each review into positive, negative and neutral categories so that it’ll enable the restaurant to have a data driven approach to improve their service and increase sales.

Basic Terminologies

As now we know the importance of sentiment analysis, let’s dive deeper into its basic terminologies.

Sentiment analysis refers to the method to extract subjectivity and polarity from the text and semantic orientation refers to the polarity and strength of words, phrases, texts.

A sentence is said to be subjective if it contains non-factual information such as personal opinions, predictions and judgements. E.g., “COVID-19 vaccines are dangerous and it’s risky to get it at early stages of development. The side effects are deadly.”

A sentence is objective if it contains facts rather than opinions. E.g., The sun rises in the east. We clearly see that objective sentences are not preferred for sentiment analysis.

Polarity of a text is given by a decimal (float) value in the range of [-1,1]. It denotes the positivity of the tone of the given sentence.

Negative Sentiment: Polarity < 0
Neutral Sentiment: Polarity =0
Positive Sentiment: Polarity >0

Before we analyse the data further we need to clean our data. It means to pre-process and normalize the text of which we want an analysis.

· Removal of punctuation marks and special characters.

· Removal of stop words like a, an, and, for etc.

· Expansion of word contractions. E.g., They’ll as They will; Haven’t as Have not etc.

· Tokenization- Conversion of sentences into words

· Maintain case uniformity- Conversion of words into either upper or lower case

· Stemming and Lemmatization– Converting a word into it’s base form. E.g., apples -> apple; happiness->happy;

In most cases we prefer Lemmatization over stemming because the former conducts a morphological analysis of words, thus resulting in more accurate word roots.

Precision, Recall and F1 Score:

True Positives depict the events which actually happened and were correctly predicted by our model. True Negatives depict the events which did NOT happen and were predicted as NOT happened by our model.

False Positive depict the events which did NOT happen, but were predicted to have happened by our model. False Negative depict the events which happened in reality but were predicted to NOT have happened by our model.

Precision= number of true positives / all positives
Recall =True positives / ( True Positives + False Negatives)
F1 Score= 2 * (Precision *Recall) / (Precision + Recall)

Sentiment Analysis using Lexicon Based Models

Lexicon means the vocabulary of a person, language or branch of knowledge. Here, in lexicon based sentiment analysis we already have a given set of dictionary of words with each labelled as positive negative, neutral sentiments along with polarity , parts of speech and subjectivity classifiers, mood, modality and the like. A sentence is tokenized and each token is matched with the available words in the model to find out its context and sentiment (if any). A combining function such as sum or average is taken to make the final prediction regarding the total text component.

IMDB movie reviews dataset is used to make the following predictions.

AFINN Lexicon:

AFINN Lexicon is the most simplest and popular lexicons for sentiment analysis. The current version is AFINN-en-165.txt and it contains 3382 words along with it’s polarity score. Head over to the official repository to know more.

SentiWordNet:

SentiWordNet is a lexical resource for opinion mining. SentiWordNet assigns to each synset of WordNet three sentiment scores: positivity, negativity, objectivity.

VADER:

VADER (Valence Aware Dictionary and Sentiment Reasoner) is a lexicon and rule-based sentiment analysis tool that is specifically attuned to sentiments expressed in social media. It is fully open-sourced and is available in the NLTK package which can be applied directly to unlabelled text data. VADER is capable of detection of polarity and intensity of emotion.

Here’s an example-

sentences = ["VADER is smart, handsome, and funny.",                 # positive sentence example"VADER is smart, handsome, and funny!",# punctuation emphasis handled correctly(sentiment intensity adjusted)"VADER is very smart, handsome, and funny.",# booster words handled correctly (sentiment intensity adjusted)"VADER is VERY SMART, handsome, and FUNNY.",# emphasis for ALLCAPS handled"VADER is VERY SMART, handsome, and FUNNY!!!",# combination of signals - VADER appropriately adjusts intensity"VADER is VERY SMART, really handsome, and INCREDIBLY FUNNY!!!",# booster words & punctuation make this close to ceiling for score

Classification of Sentiment with Supervised Learning

With supervised learning we get each textual data along with label of their polarity , subjectivity and objectivity. Here we need to build a machine learning model to classify and predict future inputs into different categories of sentiments.

Text pre-processing and data normalization

It is the most important thing before training the model. We must have balanced class distribution so that no classes are biased. Furthermore, we have to remove punctuations, html tags, numbers, convert accented characters to ASCII along with lowercasing all texts.

Feature Engineering

Feature engineering is the process of transforming raw data into features that better represent the underlying problem to the predictive models, resulting in improved model accuracy on unseen data. — Dr. Jason Brownlee

Model Training, Prediction and Evaluation

This involves using various ML models to train the data and evaluate them to see which is the most appropriate for our use case. We must do hyper parameter tuning to achieve more accurate results.

Before Training the data with SVM and Logistic Regression we need to convert data to feature vectors. We use Count Vectorizer and TF-IDF Vectorizer for the same.

Bag of Words Model- (BOW)

A Bag of words model is a method of extracting features from text for ML modelling. In BOW the words in a text are extracted and a list of all the words and their frequency are made. In other words, a dictionary of all the words containing in the text is created. It is known as a Bag of words because the structure of words and their meaning in the context are removed.

Model Performance Report for BOW features:

SVM model on BOW features:

Term Frequency-Inverse Document Frequency (TF-IDF)

The central idea behind TF-IDF is to provide more importance to words that occur more frequently in a document than to words with lesser occurrence. Here, Term Frequency refers to the frequency of each term.

TF=freq. of a word/total words.

Inverse Document Frequency is calculated by:

IDF=ln(total no. of docs/no. of docs containing that word)

In Practical cases, we will use the TfidfVectorizer from sklearn.feature_extraction.text.

Model Performance Report for TF-IDF features:

The Left one shows the Logistic Regression Model on TF-IDF features and the right image shows the SVM Model on TF-IDF features

Which one is better?

We find that AFINN lexical model has better accuracy (72%)than other lexical models. It is found that other models’ performance is close to AFINN on the given data.

In case of Supervised Learning models Logistic Regression model on Bag of Word model features, is the best as it is having an accuracy of 89.94%.

Check out the notebook here.