Sentiment Analysis on Amazon Reviews using TF-IDF Approach.
Feature Extraction: TF-IDF (term frequency-inverse document frequency)
Classification: SVM, Logistic Regression
As the digital era evolves the online shopping has seen tremendous growth. Every Business person wants to analyze what their customers are talking about their products. The reviews, star-rating are the accessories of the product which describes the customers engagement. The process of analyzing the customer feelings is said to Sentiment Analysis.
In this Article, Sentiment Analysis is performed on Amazon Jewelry Dataset.
Dataset link: https://s3.amazonaws.com/amazon-reviews-pds/tsv/amazon_reviews_us_Jewelry_v1_00.tsv.gz
Import the Required Packages
import pandas as pd
import numpy as np
import nltk
import re
Read the dataset using the pandas
df=pd.read_csv(‘data.tsv’, sep=’\t’, header=0, error_bad_lines=False)
Preview the dataset
df.head(3)
We require only the review_body, star_rating columns which describes the reviews, star rating of each review respectively.
df=df[[‘review_body’,’star_rating’]]
Remove the Null, missing values and reset the index
df=df.dropna()
df = df.reset_index(drop=True)
df
Labelling Reviews:
Now we have 17,66,748 reviews. The reviews with star rating 4,5 are labelled as positive reviews and 1,2 are labelled as negative reviews. Remove the reviews with star rating 3 as they are considered as neutral.
df['star_rating']=df['star_rating'].astype(int) #convert the star_rating column to intdf=df[df[‘star_rating’]!=3]
df['label']=np.where(df['star_rating']>=4,1,0) #1-Positve,0-Negative
Number of reviews group by star rating
df[‘star_rating’].value_counts()
Now we are creating the model by considering the 100000 reviews. In the 1,00,000 reviews 50,000 are positive and 50,000 are negative.
I am shuffling the review as to take random 1,00,000 reviews from 16,07,094. You can ignore if you don’t want to shuffle.
df = df.sample(frac=1).reset_index(drop=True) #shuffledata=df[df['label']==0][:50000]
data=data.append(df[df['label']==1][:50000])
data = data.reset_index(drop=True)
display(data['label'].value_counts())
data
Pre-Processing
The first step is convert the all reviews into the lower case.
data[‘pre_process’] = data[‘review_body’].apply(lambda x: “ “.join(x.lower() for x in str(x).split()))
Remove the HTML tags and URLs from the reviews.
from bs4 import BeautifulSoup
data[‘pre_process’]=data[‘pre_process’].apply(lambda x: BeautifulSoup(x).get_text())import re
data[‘pre_process’]=data[‘pre_process’].apply(lambda x: re.sub(r”http\S+”, “”, x))
Perform the Contractions on the reviews.
Example: it won’t be converted as it will not be
def contractions(s):
s = re.sub(r”won’t”, “will not”,s)
s = re.sub(r”would’t”, “would not”,s)
s = re.sub(r”could’t”, “could not”,s)
s = re.sub(r”\’d”, “ would”,s)
s = re.sub(r”can\’t”, “can not”,s)
s = re.sub(r”n\’t”, “ not”, s)
s= re.sub(r”\’re”, “ are”, s)
s = re.sub(r”\’s”, “ is”, s)
s = re.sub(r”\’ll”, “ will”, s)
s = re.sub(r”\’t”, “ not”, s)
s = re.sub(r”\’ve”, “ have”, s)
s = re.sub(r”\’m”, “ am”, s)
return s
data[‘pre_process’]=data[‘pre_process’].apply(lambda x:contractions(x))
Remove non-alpha characters
data[‘pre_process’]=data[‘pre_process’].apply(lambda x: “ “.join([re.sub(‘[^A-Za-z]+’,’’, x) for x in nltk.word_tokenize(x)]))
Remove the extra spaces between the words
data[‘pre_process’]=data[‘pre_process’].apply(lambda x: re.sub(‘ +’, ‘ ‘, x))
Remove the stop words by using the NLTK package
from nltk.corpus import stopwords
stop = stopwords.words(‘english’)
data[‘pre_process’]=data[‘pre_process’].apply(lambda x: “ “.join([x for x in x.split() if x not in stop]))
Perform lemmatization using the wordnet lemmatizer
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
data[‘pre_process’]=data[‘pre_process’].apply(lambda x: “ “.join([lemmatizer.lemmatize(w) for w in nltk.word_tokenize(x)]))
The final Pre-processed reviews look as:
Original: It looks better in picture. In reality, printing quality is not so good, and I don’t feel any coating.
Preprocessed: look better picture reality printing quality good feel coating.
data
Feature Extraction
TF-IDF: It is a method of extracting the features from the text data. TF stands for Term Frequency and IDF stands for Inverse Document Frequency.
Term Frequency: Number of times word occurs in a review. For an example consider 2 reviews where w1,w2.. represents the words in both reviews and table defines the frequency of words in the particular review.
IDF is computed as
idf(t) = log [ n / df(t) ] + 1
= log[ number of documents / number of documents containing the term]+1
If smooth_idf=True.
Smooth-IDF = log [ n / df(t) +1 ] + 1
TF-IDF is implemented using sklearn: https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html
Split the Data into Training and Testing sets
from sklearn.model_selection import train_test_split
X_train,X_test,Y_train, Y_test = train_test_split(data[‘pre_process’], data[‘label’], test_size=0.25, random_state=30)print(“Train: “,X_train.shape,Y_train.shape,”Test: “,(X_test.shape,Y_test.shape))
Using TF*IDF Vectorizer
print(“TFIDF Vectorizer……”)
from sklearn.feature_extraction.text import TfidfVectorizervectorizer= TfidfVectorizer()tf_x_train = vectorizer.fit_transform(X_train)
tf_x_test = vectorizer.transform(X_test)
SVM
Implementing SVM with sklearn for classification
from sklearn.svm import LinearSVC
clf = LinearSVC(random_state=0)
Fitting the Training data into model
clf.fit(tf_x_train,Y_train)
Predicting the Test data
y_test_pred=clf.predict(tf_x_test)
Analyzing the results
from sklearn.metrics import classification_report
report=classification_report(Y_test, y_test_pred,output_dict=True)
By Using the SVM classifier we got an accuracy of 91.55%
Logistic Regression
Logistic regression is implemented using sklearn
from sklearn.linear_model import LogisticRegression
clf = LogisticRegression(max_iter=1000,solver=’saga’)
Fit the Training data to the model
clf.fit(tf_x_train,Y_train)
Predicting the test data
y_test_pred=clf.predict(tf_x_test)
Analyzing the Report
from sklearn.metrics import classification_report
report=classification_report(Y_test, y_test_pred,output_dict=True)
By Using the LR classifier we got an accuracy of 91.80%
Thus we can implement the sentiment analysis on any data.
Thank and Regards
Praveen Sujanmulk