Yelp Reviews Classification

Nischitha Sadananda

4 min readNov 30, 2021

Introduction

CSE 5334: ASSIGNMENT-3

Goal:

The goal of this assignment is to learn about the Naive Bayes Classifier (NBC).

Problem Statement:

In this project, Natural Language Processing (NLP) strategies will be used to analyze Yelp reviews data.
Divide the dataset as train, development and test.
Calculate the probabilities
Compare the effect of Smoothing
Calculate the final accuracy.

Import Libraries


import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import mathfrom sklearn.feature_extraction.text import CountVectorizer
import string
string.punctuation
from collections import Counter, defaultdict
from bs4 import BeautifulSoup
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report, confusion_matrix
import nltk
from nltk.corpus import stopwords
import re
from wordcloud import WordCloudfrom sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score

Import Dataset

yelp_df = pd.read_csv(“yelp_labelled.txt”, 
 delimiter=’\t’, 
 header=None, 
 names=[‘Review’, ‘sentiment’])
yelp_df
yelp_df.head(10)yelp_df.info()

Visualise Dataset

# Let’s get the length of the messages
yelp_df[‘length’] = yelp_df[‘Review’].apply(len)
yelp_df.head()yelp_df['length'].plot(bins=100, kind='hist')

yelp_df.length.describe()
yelp_df[yelp_df['length'] == 149]['Review'].iloc[0]
yelp_df[yelp_df['length'] == 11]['Review'].iloc[0]
yelp_df[yelp_df['length'] == 58]['Review'].iloc[0]
yelp_df_1 = yelp_df[yelp_df['sentiment']==1]
yelp_df_1
yelp_df_0 = yelp_df[yelp_df['sentiment']==0]
yelp_df_0
yelp_df_0_1 = pd.concat([yelp_df_0,yelp_df_1])
yelp_df_0_1plt_sentiment = sns.countplot(x='sentiment', data=yelp_df)
plt_sentiment.set_title("Sentiment distribution")
plt_sentiment.set_xticklabels(['Negative', 'Positive'])
plt.xlabel("");

Removal Punctuation and StopWords

def message_cleaning(message):
 Test_punc_removed = [char for char in message if char not in string.punctuation]
 Test_punc_removed_join = ‘’.join(Test_punc_removed)
 Test_punc_removed_join_clean = [word for word in Test_punc_removed_join.split() if word.lower() not in stopwords.words(‘english’)]
 return Test_punc_removed_join_cleanclass Tokenizer:
    def clean(self, text):
        no_html = BeautifulSoup(text).get_text()
        clean = re.sub("[^a-z\s]+", " ", no_html, flags=re.IGNORECASE)
        return re.sub("(\s+)", " ", clean)def tokenize(self,message):
        clean = self.clean(message).lower()
        remove_punctuation = [char for char in message if char not in string.punctuation]
        remove_punc_join = ''.join(remove_punctuation)
        remove_punc_join_clean = [word for word in remove_punc_join.split() if clean not in stopwords.words('english')]
        return remove_punc_join_cleanyelp_df_clean = yelp_df_0_1[‘Review’].apply(message_cleaning)yelp_df_clean[0]print(yelp_df_0_1[‘Review’][0]) # show the original version

Training the Model

from sklearn.model_selection import train_test_split
import numpy as npX = yelp_countvectorizer
y = yelp_df_0_1[‘sentiment’].values
text = “ “.join(review for review in yelp_df.Review)

Multinomial Naive Bayes

Naïve Bayes classifiers are a family of probabilistic classifiers based on Bayes Theorem with a strong assumption of independence between the features. These are not only fast and reliable but also simple and easiest classifier which is proving its stability in machine learning world. Despite its simplicity, it gives accurate prediction in text classification problems. They are probabilistic classifiers uses Bayes theorem to calculated the conditional probability of the each label given a given text, and the label with highest will be output. In the last few years, it has been widely used in text classification. These are very simple, fast, interpretable, and reliable algorithms.

Multinomial Naive Bayes classifiers has been used widely in NLP problems compared to the other Machine Learning algorithms, such as SVM and neural network because of its fast learning rate and easy design. In text classification these are giving more accuracy rate despite their strong naive assumption.

Bayes’ Theorem

Bayes Theorem is a simple mathematical formula used to calculated the conditional probability of for each target label given a data set. Conditional probability measures the probability of an event occurring when another related event is already has occurred.

P(y|X) is the posterior probability of class (target) given predictor (attribute).
P(y) is the prior probability of class.
P(X|y) is the likelihood which is the probability of predictor given class.
P(X) is the prior probability of predictor.

class naiveBayes:
 def __init__(self, classes, tokenizer):
 self.tokenizer = tokenizer
 self.classes = classes
 
 def group_by_class(self, X, y):
 data = dict()
 for c in self.classes:
 data[c] = X[np.where(y == c)]
 
 return data
 def fit(self, X,y):
 self.n_class_items = {}
 self.log_class_priors = {}
 self.word_counts = {}
 self.vocab = set()n = len(X)
 
 grouped_data = self.group_by_class(X, y)
 
 for c, data in grouped_data.items():
 self.n_class_items[c] = len(data)
 self.log_class_priors[c] = math.log(self.n_class_items[c] / n)
 self.word_counts[c] = defaultdict(lambda: 0)
 for text in data:
 counts = Counter(self.tokenizer.tokenize(text))
 for word, count in counts.items():
 if word not in self.vocab:
 self.vocab.add(word)self.word_counts[c][word] += count
 
 return self
 def laplace_smoothing(self,word,text_class):
 num = self.word_counts[text_class][word] + 1
 denom = self.n_class_items[text_class] + len(self.vocab)
 return math.log(num / denom)
 def predict(self, X):
 result = []
 for text in X:
 class_scores = {c: self.log_class_priors[c] for c in self.classes}words = set(self.tokenizer.tokenize(text))
 for word in words:
 if word not in self.vocab: continuefor c in self.classes:
 log_w_given_c = self.laplace_smoothing(word, c)
 class_scores[c] += log_w_given_c
 
 result.append(max(class_scores, key=class_scores.get))return result

Lets train the model using above classifier

X = yelp_df[‘Review’].values
y = yelp_df[‘sentiment’].valuesX_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.15, random_state=0)X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.25, random_state=1) 
classifier = naiveBayes(
 classes=np.unique(y), 
 tokenizer=Tokenizer()).fit(X_train, y_train)
# Predicting test results
y_pred = classifier.predict(X_test)
accuracy_score(y_test, y_pred)
print(classification_report(y_test, y_pred))

Accuracy achieved: 79%

Confusion Matrix

conf_matrix = confusion_matrix(y_test, y_pred)class_names = [“negative”, “positive”]
sns.heatmap(pd.DataFrame(conf_matrix), annot=True, xticklabels=class_names, yticklabels=class_names)
plt.ylabel(‘Actual sentiment’)
plt.xlabel(‘Predicted sentiment’);