Sentiment Analysis: Building from the Ground Up

Published in

Analytics Vidhya

10 min readMay 9, 2020

Sentiment analysis is the interpretation and classification of emotions (positive, negative, and neutral) within text data using text analysis techniques. Sentiment analysis allows businesses to identify customer sentiment toward products, brands, or services in online conversations and feedback.

Well, that all sounds great. But how exactly do we intend to do it? The answer lies with something which has caught our attention for quite some time. Machine Learning.

To be honest, as a full-stack developer, I had always found a reason to not start learning ML. Convincing myself with assurances like, “How am I ever gonna find time to work on it?”, or “I know software engineering the best, let’s stick to it.”, I had kept this gorgeous mistress at bay.

But honestly, those were just a bunch of excuses I gave myself to keep myself from facing the truth. The fact that I was intimidated by it. Yet I decided to tackle it once and for all and to my surprise, ML is as intriguing as it is attractive.

So I decided to write this blog, mapping out my technical journey into ML while building a Sentiment Analysis model. I will be mapping out all the steps, code snippets, data operations required. So let’s get started.

Overview

We aim to build a sentiment analysis model that, once ready would predict the sentiments on the statements given by users. I work in a domain where customers constantly flood our pages with reviews and feedbacks. So my target is to build a model that would map out the sentiments on those inputs. Having recently started with ML myself, I would consider it a big achievement.

The subfield of ML and linguistics that deals with classifying text and analyzing it, is called Natural Language Processing. But more on that later.

Problem Statement

Build a sentiment analysis model that classifies the input text as either Positive or Negative. The input would be reviews or feedbacks of users but can very well be extended to other kinds of text messages like social media posts or tweets. This is intended to be achieved by training our model on a dataset of already labeled Amazon product reviews.

Metric

The validity of the model will be assessed at the end of the article based on parameters including Precision, Recall, F1-Score, and Accuracy. We will also understand what do these mean exactly.

Where it all begins

We intend to start looking at what happens to be the bedrock of every ML project. The Data. We are going to be using a dataset of Amazon product reviews. The data was lifted from Xiang Zhang’s Google Drive dir. It contains around 3 million reviews which are already rated from 1 to 5. Our model will learn from this. The dataset is in CSV format.

Data loading and cleaning

I am using python for this project. This awesome library called Pandas makes it super easy to load and read the data.

import pandas as pddf = pd.read_csv('../input/amazon/train.csv')
df_test = pd.read_csv('../input/amazontest/test.csv')

df is a Pandas DataFrame. And just like that, we have read in our CSV.

We should always familiarize ourselves with the kind of data we are working with. And by that, I mean the structure of the dataset. So let’s check the number of rows and columns.

There are shy away from 3M rows and 3 columns.

As you can see, the names of the columns are kind of screwed up. Let’s fix that. Also, let’s initially train our model with 60,000 records instead of 3M. Hence we need to sample it.

#Taking 60000 random samples from the data
df_sam = df.sample(n=60000, random_state=1)df_sam.columns = ['rating', 'title', 'text']df_test_sam = df_test.sample(n=12000, random_state=1)df_test_sam.columns = ['rating', 'title', 'text']

And just like that, we have sampled 60k rows and renamed the columns to ‘rating’, ‘title’, and ‘text’. ‘rating’ contains the rating given the respective reviews and ‘text’ contains the actual reviews.

Data cleaning is something that is of great importance if you want to improve the accuracy of the model. You wouldn’t want your model to learn from wrong data now, would you?

We first check if there are any null values in the dataset.

print(df_sam.isnull().sum())

There is one record where the title is null. We have multiple options to deal with null values, like dropping that row, replacing the null records with modes or means. But since there are no null records in rating and text columns, let’s consider the data clean since we only need those to columns.

Let us look at how our data looks.

print(df_sam.head())

Before we get to extracting the features from the dataset, let us talk about NLP for a bit. Computers don’t interpret texts as humans do. So we need to find a way to convert text into numbers or vectors that the model can understand.

Now that we have clean data, let’s talk about the classes. Currently, we have rating values ranging from 1–5. It might be difficult from the model to predict the text to such granularity, hence we label it ourselves. 1–2 is considered as Negative and 4–5 is considered as Positive. 3 is neutral, hence we will drop all the rows with 3 right now. It can be used to classify as Neutral but let’s keep it simple.

def convertToLabel(rating) :
    if rating > 3 :
        return 'Positive'
    else :
        return 'Negative'#Converting ratings to positive and negativedf_new = df_sam.drop(df_sam[df_sam.rating == 3].index, axis=0)
df_new.rating = df_new.rating.apply(convertToLabel)df_new_test = df_test_sam.drop(df_test_sam[df_test_sam.rating == 3].index, axis=0)
df_new_test.rating = df_new_test.rating.apply(convertToLabel)

Another thing that we need to look out for is bias in our data. Biased data can lead to skewed predictions. Hence we want an almost equal number of Negatives as Positives. Let’s check that out with a graphical representation.

df_new.rating.value_counts().reset_index().plot(kind = 'bar' ,x='index', y='rating', figsize=(8,8))

As we can see that the data is not biased, hence there is no need to perform any drop operations.

Now, we need to perform operations on the text to get rid of words that do not make sense in the data. These include URLs, HTML tags, even digits in some cases. We also need to remove words that do not change the meaning of a sentence much. These include articles like ‘a’, ‘the’, but not restricted to these. These are called stop words.

Tokenization is the process by which big quantity of text is divided into smaller parts called tokens. NLTK is an amazing library that helps us achieve that and more.

import nltk
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
from nltk.tokenize import word_tokenize
from nltk.stem.wordnet import WordNetLemmatizer
from nltk.corpus import stopwords#tokenizer function will be passed to CountVectorizer at a later stagedef tokenize(text) :
    text = text.lower()
    #Remove punctuations
    text_normalized = re.sub(r"[^a-zA-Z0-9]", " ", text)
    
    tokens = word_tokenize(text_normalized)
    lemmatizer = WordNetLemmatizer()
    clean_tokens = [lemmatizer.lemmatize(w).strip() for w in tokens if w not in stopwords.words('english')]
    return clean_tokens#Create X and yX_train = df_new['text']
y_train = df_new['rating']X_test = df_new_test['text']
y_test = df_new_test['rating']

We could have used train_test_split, this awesome function provided by sklearn but I have imported an entirely different test dataset.

Feature Extraction & ML Pipeline

As mentioned before, computers do not understand the text as we do. Hence it is needed to convert it to a matrix of token counts. It is nothing but a matrix that holds the count of occurrence of each token in the document. This can be achieved with CountVectorizer provided by sklearn.

I am using a Pipeline here which facilitates in chaining the steps required to train the model. I will not go into the details of the Pipeline in this article.

Just keeping a count of occurrences of words is not enough at times. Hence we use another Transformer called Term Frequency Inverse Document Frequency, which basically also considered how common a word is throughout the corpus, hence building its importance or weightage.

Finally we will select a classifier for the model. We have multiple choices over here. We can use Multinomial Naive Bayes, Supporting Vector Machines, Random Forest, or others. They implement different algorithms to train and predict the model. We achieve different accuracies with different models. I have run my code with Multinomial NB, SVC, and RandomForestClassifier.

Let us take a look at the code now.

from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import SVCpipeline = Pipeline([
    
    ('vect', CountVectorizer(tokenizer=tokenize, ngram_range=(1, 2))),
    ('tfidf', TfidfTransformer()),
    ('clf', RandomForestClassifier(n_estimators=100, n_jobs=-1))])start_time = time.time()
pipeline.fit(X_train, y_train)
print("time taken to fit {}s".format(time.time() - start_time))

Here, we simply convert our text into tokens and then into the TFIDF matrix and passing it to the RandomForestClassifier model. We fit the model by calling the ‘fit’ function of the model (or pipeline in this case).

Once the model is trained, we call predict function on the test data. Consider this as us setting a benchmark for our model, since we already have current outcomes. We then compare the predicted outcome with the expected outcome or y_test. This tells us how well did our model performs on unseen data. Please note that accuracy is not actually the best measure for a classifier’s performance. But that is something, I am still learning.

y_pred = pipeline.predict(X_test)
print("Accuracy is {}".format((y_test.values == y_pred).mean()))

We got an accuracy of 81.74% with RandomForestClassifier. We will be comparing it with other classifiers soon enough.

Other parameters that define the performance of a model are:

Precision is defined as the number of true positives divided by the number of true positives plus the number of false positives. True positives are the outcomes that were predicted to be true by the model and were true in the expected outcome as well. False positives are the outcomes that were predicted to be true by the model but were in fact false.

Recall is defined as the number of true positives divided by the number of true positives plus the number of false negatives.

F1 Score is the weighted average of Precision and Recall.

Let’s see how our model did in these categories.

from sklearn.metrics import classification_reportres = classification_report(y_test, y_pred)
print(res)

Time to put our model to a test of our own. I am gonna pass two reviews to it given by two users on a product I’m engaged in. I’d leave the assessment to you.

Following is the classification report for Multinomial Naive Bayes Classifier.

The accuracy received for Multinomial NB is 56%, which is pretty lower than RandomForestClassifier.

Increasing the number of classes

Let us add one more class in the dataset. Let us assign a rating of 3 to be neutral and see how does it affect our parameters.

Following is the revised code for the function convertToLabel.

def convertToLabel(rating) :
    if rating == 3 :
        return 'Neutral'
    elif rating < 3 :
        return 'Negative'
    else:
        return 'Positive'

As can be noticed, increasing the number of classes has reduced the parameters. Intuitively

I also tweaked the distribution of the Positives and the Negatives to make them imbalanced. The following are the values of parameters I got for my model.

The fun part

I have also built a website around this model that takes input from the user and classifies that as Positive or Negative. Unfortunately, it is not deployed anywhere, but feel free to clone the project from my GitHub repository and run it locally.

Please read the README file of the project on GitHub for detailed information about running the project.

Conclusion

Following key conclusions can be made from the experiment that we ran:

Much of everything we obtain post data cleaning is dependent on how well the data has been cleaned. Hence it is considered to be a crucial part of any ML model. When I first started off with an uneven distribution of classes in my data, I had received accuracy close to 76%. But once the data had been cleaned, the accuracy of my model bumped up to 82%.
Proper tokenization of the data is very crucial. Getting rid of all the stop words reduces the size of the tokens that the term frequency matrix would be built on.
Choosing between just using the term frequency matrix or going beyond and building of the Term frequency Inverse document matrix is crucial. Since certain words would have higher term frequency but they might be occurring in the corpus very often.
Choosing the right classifier can certainly increase the performance metric values of the model.

Improvements

This model is far from perfect. There are a lot of things I could have done differently to increase it’s performance further.

Tweaking the hyperparameters of the classifier.
Following is the code of a Transformer that I added to my pipeline that counts the length of the sentence and uses it as a feature too.

class TextCountExtractor(BaseEstimator, TransformerMixin):def getlength(self, text):
        return len(text)def fit(self, X, y=None):
        return selfdef transform(self, X):
        X_tagged = pd.Series(X).apply(self.getlength)
        return pd.DataFrame(X_tagged)

Following is how the pipeline would look like, including this Transformer.

pipeline = Pipeline([
    
    ('features', FeatureUnion([
        ('text_pipeline', Pipeline([
            ('vect', CountVectorizer(tokenizer=tokenize, ngram_range=(1, 2))),
            ('tfidf', TfidfTransformer())
        ])),
        ('text-length', TextCountExtractor())
        
    ])),
    
    ('clf', RandomForestClassifier())])

I’ll leave you by paraphrasing a sentiment by N.H Kleinbaum. “There have been data scientists before you, if you listen real close, you can hear them whisper their legacy to you. Carpe Diem, seize the day people, make your data-centric lives extraordinary.”