Getting started with Sentiment Analysis

6 min readMar 22, 2023

A banner showing different facial expressions: sadness to happiness.

Sentiment analysis is a valuable tool for businesses, brands, and policymakers to understand public perception and consumer behavior. It involves extracting emotions, opinions, and attitudes from text data using natural language processing (NLP) to determine whether a text expresses a positive, negative, or neutral sentiment towards a topic. This is done by analyzing linguistic features, such as words, tone, and context.

In various industries, sentiment analysis is applied in different ways:

Marketing — Helps businesses to analyze customer feedback, monitor brand reputation, and understand customer needs and preferences.
Customer service — Helps identify dissatisfied customers and resolve their issues before they escalate.
Politics — Helps analyze public opinion on political issues and predict election outcomes.

By the end of this article, we’ll detect the sentiment associated with a statement as negative or positive using Machine Learning. The notebook used for this exercise is linked at the bottom of this post.

Data Extraction

This article aims to introduce sentiment analysis using a Twitter dataset available on Kaggle containing a collection of tweets.

First, we’ll import all the required libraries. Then, I load the dataset into a Pandas dataframe and view the top 5 rows.

import pandas as pd
import nltk
import pickle
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report

# Dataset Source: https://www.kaggle.com/datasets/kazanova/sentiment140
column_names = ['target', 'ids', 'date', 'flag', 'user', 'text']

df = pd.read_csv('/kaggle/input/sentiment140/training.1600000.processed.noemoticon.csv', header=0, names=column_names, encoding='latin')

# Observe the first 5 rows
df.head()

Data Cleaning

Before we dive into the world of Sentiment Analysis, it’s essential to understand the importance of data cleaning. It is an important step in the data analysis process that ensures the accuracy, consistency, and completeness of the data, enhances decision-making, saves time and resources, increases productivity, and ensures compliance with regulatory requirements.

Learn about the dataset i.e checking for null values

df.info()

2. Drop columns which I don’t require (target where the value is 2 i.e neural, ids, date, flag, user)

neutral_index = df[df['target'] == 2].index
df.drop(neutral_index, inplace=True)
df.drop(columns=['ids', 'date', 'flag', 'user'], inplace=True)

3. Check if the dataset is balanced i.e almost similar number of tweets that are positive and negative

df['target'].value_counts().plot(kind='pie')

Data Preprocessing

Data preprocessing can significantly impact the accuracy of your model. Preprocessing the data helps clean and prepare it for analysis. It involves transforming the text data into a format that can be analyzed by the model. This step is essential as it removes noise from the text and allows the model to focus on the essential information.
We used several techniques to preprocess our data, including removing stop words and non-alphanumeric characters. Removing stop words helps remove common words that don’t add value to the analysis, such as “the” and “and.” We also removed non-alphanumeric characters, such as punctuation marks, to ensure that our model focused on the words’ meaning.

# Obtain the English stopwords
stopwords = set(stopwords.words('english'))
# The dataset is balanced
# Pre-process the text (Removal on non-alphanumeric words and stopwords)
def preprocess_text(text):
    """
    1. Convert the words to lower case
    2. Tokenize the words
    3. Ensure the words are alphanumeric
    4. Remove all stopwords
    """
    text = text.lower()
    words = nltk.word_tokenize(text)
    words = [word for word in words if word.isalnum()]
    words = [word for word in words if word not in stopwords]
    return ' '.join(words)

df['text'] = df['text'].apply(preprocess_text)

Data Modelling

Data modeling involves creating a predictive model that can be used to make accurate predictions on new data. There are various types of models used in Sentiment Analysis. Classification algorithms are beyond the scope of this article. You can learn more about classification algorithms from the resources below.

7 Types of Classification Algorithms in Machine Learning

This blog will help you master the fundamentals of classification machine learning algorithms with their pros and cons…

www.projectpro.io

Classification Algorithms; Classification In Machine Learning | Serokell

Remember the No Free Lunch theorem? No, it is not about food (yet). But if you are hungry, get a snack before reading…

serokell.io

In this article we’ll be using Logistic Regression.

Split the dataset to training and testing datasets

X_train, X_test, y_train, y_test = train_test_split(df['text'], df['target'], test_size=0.25, random_state=42)

2. Convert the text data into numerical vectors using TfidfVectorizer

TfidfVectorizer focuses on the frequency of words present in the corpus but also provides the importance of the words. You can learn more about how TfidfVectorizer works from the article below. It’s compared to Count Vectorizer which only focusses on the frequency of words present in the corpus.

Count Vectorizer vs TFIDF Vectorizer | Natural Language Processing

Follow Geeky Dude AI for more AI related content. One of the major challenges that any NLP Data Scientist faces is to…

www.linkedin.com

vectorizer = TfidfVectorizer()
X_train = vectorizer.fit_transform(X_train)
X_test = vectorizer.transform(X_test)

3. Train a logistic regression model on the training data

model = LogisticRegression(solver = 'sag', max_iter = 2500, class_weight='balanced')
model.fit(X_train, y_train)

4. Make predictions on the testing data and evaluate the performance of the model

Currently the accuracy of my model is at 0.8 (80%).

y_pred = model.predict(X_test)
print(classification_report(y_test, y_pred))

Predict the sentiment of a new text

We can now predict the sentiment of a given text from our recently trained model. Our model predicts the sentiment is 4 (positive) which is accurate.

new_text = 'This is a great movie!'
new_text = preprocess_text(new_text)
new_text_vectorized = vectorizer.transform([new_text])
sentiment = model.predict(new_text_vectorized)
print('Sentiment:', sentiment[0])

Saving the model

To save your trained model to disk for later use, you can use Pickle. In case you want to retrieve the model you have saved, you can Pickle to retrieve it.

# Saving the model to disk using Pickle
filename = 'sentiment_model.sav'
pickle.dump(model, open(filename, 'wb'))
 
# Incase you want to retrieve it (load the model from disk)
# loaded_model = pickle.load(open(filename, 'rb'))
# result = loaded_model.score(X_test, Y_test)

Notebook on Kaggle

The link to the notebook at Kaggle is shown below.

sentimental-analysis-on-twitter-dataset

Explore and run machine learning code with Kaggle Notebooks | Using data from multiple data sources

www.kaggle.com

Conclusion

In this article, we went through data extraction, data cleaning, data preprocessing, and data modeling to achieve sentiment analysis, where we can detect the sentiment associated with a statement as negative or positive using Machine Learning.

I hope you found this article informative and helpful. If you have any feedback or suggestions on how I can improve my approach or model, please feel free to share your thoughts in the comments below. I would love to hear your feedback and engage with you on this topic!

You can also check out the article below for more information.

Sentiment Analysis — Intro and Implementation | by Farzad Mahmoodinobar | Towards Data Science