Sentiment Analysis — IMDB 50k Dataset-Kaggle

7 min readMay 27, 2024

As I began my data science journey on Kaggle a few months ago, I wondered how textual machine learning algorithms capture data, word semantics, and syntactic and what is happening underneath the hood. Capturing meaningful relationships and semantic similarities from raw textual data comes under the big umbrella of natural language processing tasks. So, I decided to explore this.

I started working on the IMDB Dataset of 50K Movie Reviews and uploaded my source code on github.[Github]

Exploratory Data Analysis:

I loaded the CSV file from the dataset into my jupyter notebook. There are 50000 data points in this well-compiled dataset with zero null values. I found that it had two columns, which were review and sentiment. The review column has reviews given by people, and each review has a variable length of English words, some punctuation marks, and Unicode characters. The sentiment column has two strings, either positive or negative. The dataset is balanced, having equal positive and negative review counts.

Text Cleaning:

For text cleaning, I imported the re library from Python, famous for regular expression tasks. I successfully removed unwanted charters other than alphabets by defining a function in Python, remove_special_characters.

import re

def remove_special_characters(text):
    # Remove special characters except whitespace and alphanumeric characters
    text = re.sub(r'[^\w\s]', ' ', text)
    # Replace multiple whitespace characters with a single space
    text = re.sub(r'\s+', ' ', text).strip() 
    # Remove URLs
    text = re.sub(r'http\S+', '', text)
    # Remove digits
    text = re.sub(r'\d', ' ', text) 
    # Remove the word 'br' (commonly found in HTML text, e.g., <br />)
    text = re.sub(r'\bbr\b', ' ', text)
    # List of common words to remove
    common_words = ['film', 'movie']
    # Construct a regular expression pattern to match any of the common words as whole words
    pattern = r'\b(?:' + '|'.join(map(re.escape, common_words)) + r')\b'
    # Remove common words from the text
    text = re.sub(pattern, ' ', text)
    return text

Tokenization:

The NLTK library provides us with two methods of tokenization depending on a particular task: word_tokenize and sent_tokenize.

I applied word_tokenization mostly and only used sent_tokenizion with sentence transformers to create word embeddings.

# Tokenize the review column
from nltk import word_tokenize

df['tokens'] = df['review'].apply(word_tokenize)

Label Encoding:

Sentiments are of object type, so using Sklearn preprocessing techniques, I imported the Label encoder and encoded 1 for positive and 0 for negative. So, it is now a binary classification task.

from sklearn.preprocessing import LabelEncoder

# Initialize the label encoder
label_encoder = LabelEncoder()

# Label encoding of sentiment column
df['encoded_sentiment'] = label_encoder.fit_transform(df['sentiment'])

# Assign the encoded sentiments to the y variable
y = df['encoded_sentiment']After all this preprocessing, my data was ready for vectorization. As a beginner, I needed more intuition about which algorithm would be best for semantic analysis. Hence, I applied every word vectorization algorithm to see the outcomes.

Top Algorithms for Word Vectors and Embeddings

1. CountVectorizer

CountVectorizer is a straightforward algorithm that converts textual data into numerical vectors by counting the occurrence of each word in the text.

from sklearn.feature_extraction.text import CountVectorizer
# Initialize the TfidfVectorizer
vectorizer = CountVectorizer()
# Fit and transform the training data
X_train_vec = vectorizer.fit_transform(X_train)
# Transform the validation data
X_valid_vec = vectorizer.transform(X_valid)

When I trained the numerical vectors generated by the CountVectorizer on the MLPClassifier model, I achieved a fantastic accuracy of 89%.

With the XGBClassifier and Logistic Regression Model, the accuracy was 85% and 89%, respectively.

2. Term Frequency — Inverse Document Frequency:

This technique is a statistical measure to evaluate the importance of a word relative to a collection of documents(corpus). The goal is to scale down the impact of tokens that occur frequently in a given corpus and are empirically less informative than features in a small fraction.

# Initialize the TfidfVectorizer
vec = TfidfVectorizer(
    ngram_range=(1, 2),
    min_df=3,
    max_df=0.9,
    strip_accents='unicode',
    use_idf=1,
    smooth_idf=1,
    sublinear_tf=1,
    binary=1,
    stop_words='english'
)

# Fit and transform the training data
X_train_tfidf = vec.fit_transform(X_train)

# Transform the validation data
X_valid_tfidf = vec.transform(X_valid)

I have always found this technique to perform well in producing word vectors.

In my dataset, this vectorization technique gave me an astounding accuracy of 90% when trained with MLPClassifier and 89% with the Logistic Regression Model.

Nevertheless, it also performed well with the XGB classifier, achieving an accuracy score of 85%.

3. Word2vec

Google researchers introduced it in 2013. It is a primitive yet promising algorithm for vectorizing words. It has two models:

A. Continuous Bag of Words(CBOW)

B. Skip-gram Model.

I applied the model and embeddings to this dataset and generated contextualized word embeddings. I padded the array of word embeddings to a max_sequence length of 100. Consequently, I obtained a 3D array comprising 25000 data points. Each data point consisted of a sequence length of 100, with each sequence containing 100 features.

# Train the Skip-gram Model 
model = Word2Vec(sentences, min_count=1, vector_size=100, window=5, sg=1)

# Apply model to generate embeddings for each sentence in the flattened text
X_train_model = [generate_embeddings(sentence, model) for sentence in sentences]

# Flatten the embeddings
X_train_model_flat = [np.array(embeddings) for embeddings in X_train_model]

# Pad or truncate sequences to ensure fixed number of features per row
X_train_model_padded = pad_sequences(X_train_model_flat, maxlen=max_sequence_length, dtype='float32', padding='post', truncating='post', value=0.0)

I implemented the Multi-layer perceptron classifier on the word embeddings with the following hyperparameters: 100 hidden layers, an identity activation function, and a learning rate 0.001.

The results were promising with MLPClassifier: 79%. And satisfactory with Logistic regression: 73% and XGBClasifier: 71%

4. Sentence Transformers

Sentence transformers are a class of models built upon the architecture of BERT (Bidirectional Encoder Representations from transformers), used to generate contextualized word embeddings from sentences, paragraphs, or other text. These embeddings capture the semantic meaning of the text, making them useful for various tasks such as semantic search, text classification, machine translation, summarization, etc.

I leveraged this framework, hyper-tuned the pre-trained sentence transformer models, and drew incredible outcomes on my dataset.

all-MiniLM-L6-v2

It is an all-around model tuned for many use cases. It was trained on a large and diverse dataset of over 1 billion training pairs with six transformer layers. It is faster but still provides good-quality results. It takes a max_sequence_length of 256 and yields embeddings of 384 dimensions.

# Load the SentenceTransformer model
model = SentenceTransformer("all-MiniLM-L6-v2")

# Convert each list of words into a tuple with None as the second element
sentences_tuple = [(words, '') if words is not None else ([], '') for words in sentences_list]

# Encode the sentences using the model
X_train = model.encode(sentences_tuple)

2. bert-base-nli-mean-tokens¶

This transformer model is based on BERT and is trained to generate sentence embeddings by averaging the embeddings of all tokens in the input sentence. It is commonly used for sentence similarity, semantic search, and text classification tasks.

# Load the SentenceTransformer model
model = SentenceTransformer("bert-base-nli-mean-tokens")

# Convert each list of words into a tuple with None as the second element
sentences_tuple = [(words, '') if words is not None else ([], '') for words in sentences_list]

# Encode the sentences using the model
X_train = model.encode(sentences_tuple)

3. multi-qa-mpnet-base-dot-v1

This model is widely used for semantic search. Given a query/question, it finds relevant passages. It was trained on a large, diverse set of (question-answer) pairs, leveraging the MPNet architecture to capture nuanced semantic information across various contexts. It takes a maximum input of 512 sequence lengths, and sand produces word embeddings of 768 dimensions.

# Load the SentenceTransformer model
model = SentenceTransformer("multi-qa-mpnet-base-dot-v1")

# Convert each list of words into a tuple with None as the second element
sentences_tuple = [(words, '') if words is not None else ([], '') for words in sentences_list]

# Encode the sentences using the model
X_train = model.encode(sentences_tuple)

4. roberta-large-nli-stsb-mean-tokens

roberta-large-nli-stsb-mean-tokens is a robust sentence transformer model based on the RoBERTa architecture, fine-tuned on Natural Language Inference (NLI) and the Semantic Textual Similarity Benchmark (STSB). This model produces highly accurate sentence embeddings using mean pooling of token embeddings.

# Load the SentenceTransformer model
model = SentenceTransformer("roberta-large-nli-stsb-mean-tokens")

# Convert each list of words into a tuple with None as the second element
sentences_tuple = [(words, '') if words is not None else ([], '') for words in sentences_list]

# Encode the sentences using the model
X_train = model.encode(sentences_tuple)

Comparison of Performances of various Sentence Transformers with Recurrent Neural Network

Conclusion:
So, after extensive experimentation with different word vectorization and embedding algorithms and then their training on several kinds of machine learning algorithms, I found the three sets of combinations best for the sentiment analysis on my dataset:
1. TF-IDF with Logistic Regression
2. TF-IDF with MLPClassifier
3. Sentence Transformer Model (multi-qa-mpnet-base-dot-v1) with Recurrent Neural Network