Sentiment Analysis classification using 2 different methods.

6 min readMay 13, 2018

So in sequel to my previous post about sentiment analysis, I felt its time to build one straight on. The previous post explain some few details about sentiment analysis and how it can be used to generate insight from reviews, available on the web.

In this implementation, we are going in full details and I would try my best to explain steps to take in building a simple sentiment classifier with sample codes. We are going to use the IMDB dataset, and we would compare our results with 2 different algorithms which would be similar if you want to use another algorithms. So lets get started shall we?

Consider the figure below, the process is similar to any data mining process and we are going to reference this later on.

Data

Well we are doing data science, so we gotta talk about the data we want to use right? In most NLP tasks, getting the data for your job isn’t rosy and at the same time not rocket science. Mostly, it’s crawling text data from digital platforms. Since we are dealing with natural languages, most times the data are on the web. Luckily, we aren’t building a gigantic project so we don’t need to scrape the web for our task as there are availiable data we could use. Gloory to open data !!

We would be using IMDB data set. Its a movie review datasets with 50,000 movie reviews equally separated for training and testing. You could download the datasets from the official site here

Data Cleaning

Just like they say, “garbage in, garbage out”. It is necessary to supply our Machine learning algorithm with clean data so we could get a good result afterwards. Alright if you take a glimpse of the data you’ve just downloaded, you would see something like what I have in the snapshot below. Hell yeah! Its messy and it’s our job as a data scientist to clean it and make it ready for whatever form of analysis we need.

We are going to make use of 2 common pre-processing method. Stemming and stop word removal. Stemming is a processing tool in natural language processing that puts together different variations of a token. Say for example the word ‘dance’ would have variations like ‘dancing’, ‘danced’ and so on. So in most cases, we need to stem the tokens (words). After stemming, the next thing is stop word removal. This means there are words that commonly occur is a datasets (corpus) should be removed. In most cases, words like articles (e.g the), pronouns (e.g I) need to be removed. Stop word removal is a common practice of data cleansing in NLP tasks and such words have no relevance in classification, information retrieval, word clustering or analysis of any kind.

def process(sentence):
    # The processing done is stopword removal and stemming
    sentence_split = sentence.split()
    sentence_words = [ps.stem(word) for word in sentence_split if word not in stopWords]
    result = ''.join(sentence_words)
    return result

Feature Extraction

Now we have our cleaned data, we should understand that the data is still in natural language format but the data has to be transformed into a vector form which is a requirements for ML algorithms. The crucial bottleneck of sentiment classification is engineering an effective set of features which are used in a feature based supervised statistical classifier. Examples of feature engineering technique in this case include, TF-IDF, Part of Speech Tagging, Opinion words/phrases. We would be using TF-IDF. TF-IDF (term frequency-inverse document frequency) refers to the a statistical weight that measures how important a word is in the document or corpus. Fortunately, scikit-learn has an implementation for us to use. The code below transform our cleaned data into vector form called bag of words (BOW)

count_vect = CountVectorizer()
X_counts = count_vect.fit_transform(X_train)

tf_idf_transformer = TfidfTransformer()
X_train = tf_idf_transformer.fit_transform(X_counts)
X_train.shape

and if you get the shape of the data, you would find something like (50000,36687) now we are ready for building our ML model for sentiment classification.

Build Model

Okay seems we’re good to start training our model. Most times we are faced with question like “What machine learning algorithm should I use for this task?” Well there’s a long discussion on that but it’s often a good practice to start small first. and yeah we are just going to use 2 different models. The task here is to implement a sentiment analysis or a classification model using Multinomial Naive Bayes Classifier (MNB) and Support Vector Machine (SVM). There are various algorithms we could use for this classification task. MNB requires a number of parameters which are linear. I am not assuming you know already about these two algorithms but you could check other tutorial about the algorithms.

from sklearn.pipeline import Pipeline
from sklearn.naive_bayes import MultinomialNB

model = Pipeline([('count_vect', CountVectorizer()), ('tf_idf_transformer', TfidfTransformer()), 
                     ('classifier', MultinomialNB())])

model.fit(X_train, y_train)

import numpy as np

predicted = model.predict(y_train)
np.mean(predicted == y_test)

Note in the code above that we use a pipeline and the reason for using a pipeline is to assemble several steps that can be cross-validated together while setting different parameters. We repeat the same for SVM classifier in the code below

from sklearn.linear_model import SGDClassifier
model_svm = Pipeline([('vect', CountVectorizer()), 
                         ('tfidf', TfidfTransformer()),
                         ('clf-svm', SGDClassifier(loss='hinge',
penalty='l2', alpha=1e-3, n_iter=5, random_state=42)),])

model_svm.fit(X_train, y_train)
predicted_svm = model_svm.predict(X_test)
np.mean(predicted_svm == y_test)

Well that is it. We’re done. Err not yet. It’s important we know how our model fares with new instances. That is we need to get a report of our model classification. In scikit-learn, we could implement the classification report. We could do that in 2 lines of code. Yass!!

from sklearn.metrics import classification_report

print(classification_report(y_test, predicted_svm))

It generates a table-like report like the one shown below.

Precision means the accuracy of the positive predictions and recall is the sensitivity or true positive rate that is the ratio of positive instances that are correctly classified by the classifier. f1-score is a combination of precision and recall into a single metric and its like an harmonic mean that gives more weight to low values. So if both precision and recall have high score, then f1 would also have a high score. Another report we could look at is the confusion matrix of the classifier which is a way to measure the number of times instances of a particular class is classified another class. In our case, it counts the number of instance positive sentiments are classified as negative sentiments and vice versa. We could implement that with code below.

from sklearn.metrics import confusion_matrix

confusion_matrix(y_test, predicted_svm)

We should note that, we could use cross -validation method for evaluating out model and before that use a Grid-search method to select the best possible parameters. An example of using a grid-search method for selecting the best parameters for SVM model example is given below

from sklearn.model_selection import GridSearchCVparameters_svm = {'vect__ngram_range': [(1, 1), (1, 2)],
                  'use_idf': (True, False),
                  'alpha': (1e-2, 1e-3)}

gs_model_svm = GridSearchCV(model_svm, parameters_svm, n_jobs=1)
gs_model_svm = gs_model_svm.fit(X_train, y_train)
# To find out the best parameters for the model
gs_model_svm.best_params_

With the above codes, you can successfully build a sentiment classifier model.

If you have any concern about it, put a comment down in the comment box and I would reply.

Good-luck in building your model can’t wait for you to implement yours. You could share your GitHub link of the implementation of you have done it. I would like to see your implementation.

Sentiment Analysis classification using 2 different methods.

Data

Data Cleaning

Feature Extraction

Build Model

Written by Jerry Fadugba