Sentiment Analysis using SVM

Vasista Reddy
ScrapeHero
Published in
6 min readNov 12, 2018

Sentiment Analysis is the NLP technique that performs on the text to determine whether the author’s intentions toward a particular topic, product, etc. are positive, negative, or neutral.

Sentiment Analysis helps data scientists to analyze any kind of data i.e., Business, Politics, Social Media, etc. If you want to collect data for your research or data science needs, ScrapeHero is a great choice.

Sentiment Analysis is a task of NLP which is the subfield of artificial intelligence that helps machines to deal with human languages. Dealing with 6500 human languages is not easy. Read about NLP here.

NLTK(Natural Language Tool Kit), TextBlob, and Spacy are the modules for NLP tasks.

What is SVM?

SVM is a supervised(feed-me) machine learning algorithm that can be used for both classification and regression challenges. Classification is predicting a label/group and Regression is predicting a continuous value. SVM performs classification by finding the hyper-plane that differentiates the classes we plotted in n-dimensional space.

optimal separating hyperplane between two classes

SVM draws that hyperplane by transforming our data with the help of mathematical functions called “Kernels”. Types of Kernels are linear, sigmoid, RBF, non-linear, polynomial, etc.,

The tuning parameter Kernel — “RBF” is for non-linear problems and it is also a general-purpose kernel used when there is no prior knowledge about the data. Kernel —” linear” is for linear separable problems. Since our problem is linear(just positive and negative) here, we will go for “linear SVM”.

Steps needed to build a model

  • Gathering perfect Data for training and testing. This can be resolved by the ScrapeHero blogs and scrapers.
  • Vectorizing the data
  • Creating a Linear SVM Model to train and then predict

Gathering Data

I choose data from sentiment polarity datasets 2.0 which is a properly classified movie dataset and transformed it into CSV for easy usage.

import pandas as pd# train Data
trainData = pd.read_csv("https://raw.githubusercontent.com/Vasistareddy/sentiment_analysis/master/data/train.csv")
# test Data
testData = pd.read_csv("https://raw.githubusercontent.com/Vasistareddy/sentiment_analysis/master/data/test.csv")

Let's look at the sample data

trainData.sample(frac=1).head(5) # shuffle the df and pick first 5      Content                                             Label
56 jarvis cocker of pulp once said that he wrote ... pos
1467 david spade has a snide , sarcastic sense of h... neg
392 upon arriving at the theater during the openin... pos
104 every once in a while , a film sneaks up on me... pos
1035 susan granger's review of " american outlaws "... neg

Vectorizing the data

“Torture the data, and it will confess to anything.” — Ronald Coase

Preparing-the-text-data-with-scikit-learn — go for this tutorial to find out why we choose tf-idf for vectorizing our data.

from sklearn.feature_extraction.text import TfidfVectorizer# Create feature vectors
vectorizer = TfidfVectorizer(min_df = 5,
max_df = 0.8,
sublinear_tf = True,
use_idf = True)
train_vectors = vectorizer.fit_transform(trainData['Content'])
test_vectors = vectorizer.transform(testData['Content'])

Read about the parameters in the documentation here.

Creating a Linear SVM Model

import time
from sklearn import svm
from sklearn.metrics import classification_report
# Perform classification with SVM, kernel=linear
classifier_linear = svm.SVC(kernel='linear')
t0 = time.time()
classifier_linear.fit(train_vectors, trainData['Label'])
t1 = time.time()
prediction_linear = classifier_linear.predict(test_vectors)
t2 = time.time()
time_linear_train = t1-t0
time_linear_predict = t2-t1
# results
print("Training time: %fs; Prediction time: %fs" % (time_linear_train, time_linear_predict))
report = classification_report(testData['Label'], prediction_linear, output_dict=True)print('positive: ', report['pos'])
print('negative: ', report['neg'])
--------------------------------------------------------------------Training time: 10.460406s; Prediction time: 1.003383s
positive: {'precision': 0.9191919191919192, 'recall': 0.91, 'f1-score': 0.9145728643216081, 'support': 100}
negative: {'precision': 0.9108910891089109, 'recall': 0.92, 'f1-score': 0.9154228855721394, 'support': 100}

f1-score is 91% in both cases which is the harmonic mean of precision and recall. Read more about precision and recall in here.

f1-score = 2 * ((precision * recall)/(precision + recall))

Test the SVM classifier on Amazon reviews

review = """SUPERB, I AM IN LOVE IN THIS PHONE"""review_vector = vectorizer.transform([review]) # vectorizing
print(classifier_linear.predict(review_vector))

--------------------------------------------------------------------
['pos']
review = """Do not purchase this product. My cell phone blast when I switched the charger"""review_vector = vectorizer.transform([review]) # vectorizing
print(classifier_linear.predict(review_vector))

--------------------------------------------------------------------
['neg']
review = """I received defective piece display is not working properly"""review_vector = vectorizer.transform([review]) # vectorizing
print(classifier_linear.predict(review_vector))

--------------------------------------------------------------------
['neg']
review = """It's not even 5 days since i purchased this product.I would say this a specially blended worst Phone in all formats.ISSUE 1:
Have you ever heard of phone which gets drained even in standby mode during night?
Kindly please see the screenshot if you want to believe my statement.
My phone was in full charge at night 10:07 PM . I took this screenshot and went to sleep.
Then I woke up at morning and 6:35 AM and battery got drained by 56% in just standby condition.
If this is the case consider how many hours it will work, during day time.
It's not even 5 hours the battery is able to withstand.
ISSUE 2:Apart from the battery, the next issue is the heating issue .I purchased a iron box recently from Bajaj in this sale.
But I realized this phone acts a very good Iron box than the Bajaj Iron box. I am using only my headphones to get connected in the call. I am not sure when this phone is will get busted due to this heating issue. It is definitely a challenge to hold this phone for even 1 minute. The heat that the phone is causing will definitely burn your hands and for man if you keep this phone in your pant pocket easily this will lead to infertility for you. Kindly please be aware about that.
Issue 3:Even some unknown brands has a better touch sensitivity. The touch sensitivity is pathetic, if perform some operation it will easily take 1-2 minutes for the phone to response.
For your kind information my system has 73% of Memory free and the RAM is also 56% free.
Kindly please make this Review famous and lets make everyone aware of this issue with this phone.
Let's save people from buying this phone. There are people who don't even know what to do if this issue happens after 10 days from the date of purchase. So I feel at least this review will help people from purchasing this product in mere future."""
review_vector = vectorizer.transform([review]) # vectorizing
print(classifier_linear.predict(review_vector))
--------------------------------------------------------------------
['neg']

The complete code of SVM linear classification is here.

Pickling the Model

To reuse, we can dump the model and load it whenever we want. Vocabulary is also needed to vectorize the new documents while predicting the label.

import pickle# pickling the vectorizer
pickle.dump(vectorizer, open('vectorizer.sav', 'wb'))
# pickling the model
pickle.dump(classifier_linear, open('classifier.sav', 'wb'))

Load the vocabulary and the model and use it as a Flask app. Check the git code here.

The data-set we trained here is just1,800 movie documents and accuracy is 91%. For better accuracy, we can add more documents to the data-set. ScrapeHero is a good choice if you want to collect datasets for training models.

Thanks for reading! If you like the concept, please don’t forget to endorse my skills on Linkedin.

If you’ve found this article helpful or intriguing, don’t hesitate to give it a clap! As a writer, your feedback helps me understand what resonates with my readers.

Follow ScrapeHero for more insightful content like this. Whether you’re a developer, an entrepreneur, or someone interested in web scraping, machine learning, AI, etc., ScrapeHero has compelling articles that will fascinate you.

--

--

Vasista Reddy
ScrapeHero

Works at Cognizant. Ex-Turbolab-ian and loves trekking…. Reach_Me_Out_on_Linkedin: https://www.linkedin.com/in/vasista-reddy-100a852b/