Sentiment Analysis On IMDb

Dereckjos
4 min readJun 24, 2020

--

Introduction

Given the availability of a large volume of online review data (Amazon, IMDb, etc.), sentiment analysis becomes increasingly important. In this project, a sentiment classifier is built which evaluates the polarity of a piece of text being either positive or negative

The problem at hand

The problem at hand is sentiment analysis or opinion mining, where we want to analyze some textual documents and predict their sentiment or opinion based on the content of these documents.

Sentiment analysis is perhaps one of the most popular applications of natural language processing and text analytics with a vast number of websites, books and tutorials on this subject. Typically sentiment analysis seems to work best on subjective text, where people express opinions, feelings, and their mood. From a real- world industry standpoint, sentiment analysis is widely used to analyze corporate surveys, feedback surveys, social media data, and reviews for movies, places, commodities, and many more. The idea is to analyze and understand the reactions of people toward a specific entity and take insightful actions based on their sentiment

Getting the Dataset

The “Large Movie Review Dataset”(*) shall be used for this project. The dataset is compiled from a collection of 50,000 reviews from IMDb on the condition there are no more than 30 reviews per movie. The numbers of positive and negative reviews are equal. Negative reviews have scores less or equal than 4 out of 10 while a positive review have score greater or equal than 7 out of 10. Neutral reviews are not included. The 50,000 reviews are divided evenly into the training and test set.

The Training Dataset used is stored in the zipped folder: aclImbdb.tar file. This can also be downloaded from: http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz.

The Test Dataset is stored in the folder named ‘test’

Dataset is also available in Kaggle .

Check this Link: https://www.kaggle.com/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews

Text Preprocessing

Following steps were done for text preprocessing:

1) Remove punctuation

We remove punctuation so that we don’t have different forms of the same word.ex.

Input: “Hey, did you know that the summer break is coming? Amazing right!! It’s only 5 more days!!”
Output: “Hey did you know that the summer break is coming Amazing right Its only 5 more days”

2) Tokenize sentence

Tokenization basically refers to splitting up a larger body of text into smaller lines, words or even creating words for a non-English language. The various tokenization functions in-built into the nltk module itself.

3) Remove stopwords

Stopwords are the English words which does not add much meaning to a sentence. They can safely be ignored without sacrificing the meaning of the sentence. For example, the words like the, he, have etc.

4) Lemmatize words

Lemmatization change words based on the dictionary from different algorithms, such as “went” to “go”. Based on the different type of the word (verb, noun), it can change to different meaning of word which solve the disambiguation problem. While it demands more computational power.

5) Calculate TFIDF

TF-IDF are word frequency scores that try to highlight words that are more interesting, e.g. frequent in a document but not across documents. The TfidfVectorizer will tokenize documents, learn the vocabulary and inverse document frequency weightings, and allow you to encode new documents

6) Train ML models

we have used both supervised and unsupervised model.

o Supervised Base Model-

1. Logistic Regression

2. Stochastic gradient descent

3.Random Forest Classifier

4.Ada Boost Classifier

o Unsupervised Lexicon-Based Models

1. AFINN Lexicon

2. VADER Lexicon

Comparing The models:

  1. Logistic Regression
Supervised() is the function used to print Accuracy score,Precision,Recall,F1 Score,Classification report and Confusion Matrix

2. Stochastic gradient descent

3.Random Forest Classifier

4.Ada Boost Classifier

5. AFINN Lexicon

display_model_performance_metrics() is the function used to print Accuracy score,Precision,Recall,F1 Score,Classification report and Confusion Matrix

6. VADER Lexicon

Conclusion:

We can observe that both Logistic Regression and SGDClassifier are performing well compared to other classifiers.

--

--