Sentiment Analysis on Amazon Reviews using TF-IDF Approach.

Praveen Sujanmulk
Analytics Vidhya
Published in
5 min readFeb 4, 2021

Feature Extraction: TF-IDF (term frequency-inverse document frequency)
Classification: SVM, Logistic Regression

As the digital era evolves the online shopping has seen tremendous growth. Every Business person wants to analyze what their customers are talking about their products. The reviews, star-rating are the accessories of the product which describes the customers engagement. The process of analyzing the customer feelings is said to Sentiment Analysis.

In this Article, Sentiment Analysis is performed on Amazon Jewelry Dataset.

Dataset link: https://s3.amazonaws.com/amazon-reviews-pds/tsv/amazon_reviews_us_Jewelry_v1_00.tsv.gz

Import the Required Packages

import pandas as pd
import numpy as np
import nltk
import re

Read the dataset using the pandas

df=pd.read_csv(‘data.tsv’, sep=’\t’, header=0, error_bad_lines=False)

Preview the dataset

df.head(3)

We require only the review_body, star_rating columns which describes the reviews, star rating of each review respectively.

df=df[[‘review_body’,’star_rating’]]

Remove the Null, missing values and reset the index

df=df.dropna()
df = df.reset_index(drop=True)
df

Labelling Reviews:
Now we have 17,66,748 reviews. The reviews with star rating 4,5 are labelled as positive reviews and 1,2 are labelled as negative reviews. Remove the reviews with star rating 3 as they are considered as neutral.

df['star_rating']=df['star_rating'].astype(int) #convert the star_rating column to intdf=df[df[‘star_rating’]!=3]
df['label']=np.where(df['star_rating']>=4,1,0) #1-Positve,0-Negative

Number of reviews group by star rating

df[‘star_rating’].value_counts()

Now we are creating the model by considering the 100000 reviews. In the 1,00,000 reviews 50,000 are positive and 50,000 are negative.

I am shuffling the review as to take random 1,00,000 reviews from 16,07,094. You can ignore if you don’t want to shuffle.

df = df.sample(frac=1).reset_index(drop=True) #shuffledata=df[df['label']==0][:50000]
data=data.append(df[df['label']==1][:50000])
data = data.reset_index(drop=True)
display(data['label'].value_counts())
data

Pre-Processing

The first step is convert the all reviews into the lower case.

data[‘pre_process’] = data[‘review_body’].apply(lambda x: “ “.join(x.lower() for x in str(x).split()))

Remove the HTML tags and URLs from the reviews.


from bs4 import BeautifulSoup
data[‘pre_process’]=data[‘pre_process’].apply(lambda x: BeautifulSoup(x).get_text())
import re
data[‘pre_process’]=data[‘pre_process’].apply(lambda x: re.sub(r”http\S+”, “”, x))

Perform the Contractions on the reviews.
Example: it won’t be converted as it will not be

def contractions(s):
s = re.sub(r”won’t”, “will not”,s)
s = re.sub(r”would’t”, “would not”,s)
s = re.sub(r”could’t”, “could not”,s)
s = re.sub(r”\’d”, “ would”,s)
s = re.sub(r”can\’t”, “can not”,s)
s = re.sub(r”n\’t”, “ not”, s)
s= re.sub(r”\’re”, “ are”, s)
s = re.sub(r”\’s”, “ is”, s)
s = re.sub(r”\’ll”, “ will”, s)
s = re.sub(r”\’t”, “ not”, s)
s = re.sub(r”\’ve”, “ have”, s)
s = re.sub(r”\’m”, “ am”, s)
return s
data[‘pre_process’]=data[‘pre_process’].apply(lambda x:contractions(x))

Remove non-alpha characters

data[‘pre_process’]=data[‘pre_process’].apply(lambda x: “ “.join([re.sub(‘[^A-Za-z]+’,’’, x) for x in nltk.word_tokenize(x)]))

Remove the extra spaces between the words

data[‘pre_process’]=data[‘pre_process’].apply(lambda x: re.sub(‘ +’, ‘ ‘, x))

Remove the stop words by using the NLTK package

from nltk.corpus import stopwords
stop = stopwords.words(‘english’)
data[‘pre_process’]=data[‘pre_process’].apply(lambda x: “ “.join([x for x in x.split() if x not in stop]))

Perform lemmatization using the wordnet lemmatizer

from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
data[‘pre_process’]=data[‘pre_process’].apply(lambda x: “ “.join([lemmatizer.lemmatize(w) for w in nltk.word_tokenize(x)]))

The final Pre-processed reviews look as:
Original: It looks better in picture. In reality, printing quality is not so good, and I don’t feel any coating.
Preprocessed: look better picture reality printing quality good feel coating.

data

Feature Extraction

TF-IDF: It is a method of extracting the features from the text data. TF stands for Term Frequency and IDF stands for Inverse Document Frequency.

Term Frequency: Number of times word occurs in a review. For an example consider 2 reviews where w1,w2.. represents the words in both reviews and table defines the frequency of words in the particular review.

IDF is computed as

idf(t) = log [ n / df(t) ] + 1
= log[ number of documents / number of documents containing the term]+1

If smooth_idf=True.
Smooth-IDF = log [ n / df(t) +1 ] + 1

TF-IDF is implemented using sklearn: https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html

Split the Data into Training and Testing sets

from sklearn.model_selection import train_test_split
X_train,X_test,Y_train, Y_test = train_test_split(data[‘pre_process’], data[‘label’], test_size=0.25, random_state=30)
print(“Train: “,X_train.shape,Y_train.shape,”Test: “,(X_test.shape,Y_test.shape))

Using TF*IDF Vectorizer

print(“TFIDF Vectorizer……”)
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer= TfidfVectorizer()tf_x_train = vectorizer.fit_transform(X_train)
tf_x_test = vectorizer.transform(X_test)

SVM

Implementing SVM with sklearn for classification

from sklearn.svm import LinearSVC
clf = LinearSVC(random_state=0)

Fitting the Training data into model

clf.fit(tf_x_train,Y_train)

Predicting the Test data

y_test_pred=clf.predict(tf_x_test)

Analyzing the results

from sklearn.metrics import classification_report
report=classification_report(Y_test, y_test_pred,output_dict=True)

By Using the SVM classifier we got an accuracy of 91.55%

Logistic Regression

Logistic regression is implemented using sklearn

from sklearn.linear_model import LogisticRegression
clf = LogisticRegression(max_iter=1000,solver=’saga’)

Fit the Training data to the model

clf.fit(tf_x_train,Y_train)

Predicting the test data

y_test_pred=clf.predict(tf_x_test)

Analyzing the Report

from sklearn.metrics import classification_report
report=classification_report(Y_test, y_test_pred,output_dict=True)

By Using the LR classifier we got an accuracy of 91.80%

Thus we can implement the sentiment analysis on any data.

Thank and Regards
Praveen Sujanmulk

--

--