Sentiment Analysis of Yelp Data — Text classification

Analytics Vidhya
Published in
9 min readJan 10, 2021


Customer feedbacks in terms of reviews has become very important part of the online business as it helps to understand user emotions towards the product. Online web based business services such as Amazon reviews, Google My Business, yelp, city search give options to their customer to provide reviews as their opinion in the form of textual data as well as numerical rating out of 5. These feedbacks from thousands of the customer acts as the “online word-of-mouth”. These feedbacks given by consumers or reviewers becomes the selection criteria of the product or services for new or existing customers as the reviews overtime provides the current quality of the services being offered. In the recent period several studies have shown that these feedbacks creates both positive and negative impact on the business because of the opinions shared by the customer on the online platform.

Problem Definition : The objective of this project is to classify the reviews of the customers in to positive and negative sentiments using natural language processing and machine learning techniques. This is an important problem to solve as it will save time of an individual while making decision about the product to solely allow for checking the ratings instead of textual reviews.


The dataset used in our project is provided by Yelp . This company produces crowd source reviews about the variety of Business sources. The dataset obtained from Yelp is massive in volume. It includes five json files for different aspects of the business namely: tips, check-in, reviews, user, business. Each one of them are stored in json file. The user dataset contains 1968703 records with 22 features, the business dataset contains 209393 records with 14 features records, the review dataset contains 80211222 with 9 features, the tip dataset contains 1320761 records with 5 features, the check-in dataset contains 175187 records.

The business json file has attributes such as business ID, business name, business location, rating stars, count of the reviews and the information whether the restaurant is open or not, it also has categories as the feature which include the information about the different types of business data present in the dataset.

The review json file also has some attribute column namely review id (indicates the id of the review), user id (indicates the id of the user giving reviews), business id (indicates the id of the different business), it also has star rating as the column and text which stores the reviews of each individual.

Reviews Data-frame

The main data used for building the machine learning model in our project is present in business json and review json file. The yelp dataset adds lot of values to the business as the data from any category of the business can be used to understand the behaviour of the individual towards that product. Therefore, building model on such data can help business to remain competitive in the market by providing accurate recommendation to the user based on their sentiments. Therefore, for our project we will be using the reviews of the restaurant category of the business which has 5056227 records. Due to computing constraints, in this work we only make use of 10000 records to build our model. The target label is binary, as the ratings above 3 stars are labelled as positive review and any other rating are labelled as negative review.

Data Preprocessing

Data clean up and pre-processing is considered one of the main tasks to be performed in any machine learning project to make sure that machine learning models are built on the quality data. Relevant or important information is extracted from the data after doing pre-processing on it.

In this project we have performed multiple steps to clean and process the data. At first, we dropped all the unnecessary column present in the dataset which were not required in building the models. We also performed the following data pre-processing technique on the reviews.

Removing the unnecessary column

Removing the contraction from the reviews

Removal of digits, special characters and emoticons from the reviews

Removing of stop words from the review

Converting all the word in the lowercase format

Performing lemmatization on the reviews

The stop words from the dataset of reviews were eliminated by using NLTK python library. The most common stop words such as “I”, “am” etc. were dropped as they tend to skew the results but some of the stop words were not removed such as “not”, “non” etc. As in our project of “Sentiment Classification of Yelp data” word importance is crucial factor as it will define the sentiment of the reviews so if we remove the word “not” from “not good” this will convert a negative sentiment sentence to a positive one. Therefore, some of the stop words were not eliminated using the NLTK library while doing data pre-processing to preserve the real sentiments of the reviews. We also did perform lemmatization on the reviews by using NLTK library. Lemmatization is a method in which words are transformed in their lemma form. For example, the word “Walking” will be transformed to “walk”. This process will help in the proper analysis of the words that are present in the reviews and the target label were converted in to binary form ( 0 : Negative Reviews, 1: Positive Reviews).

Processed Data-Frame

Data Visualization

Quite similar to other components of machine learning, data visualization is also one of the important parts, as it provides a visual insight into the kind of the data model is dealing with. More precise and predictable insights can be drawn from the model if the input data is well understood. As such we created various visualization for the input data to our model.

Sentiment Polarity plot of the reviews

The above figure shows the sentiment analysis of positive and negative reviews. This plot provides a clear differentiation between positive and negative review based on their sentiments. As we can see the negative reviews are classified more towards negative polarity and positive reviews are mostly skewed towards positive polarity.

Text length of the reviews

The above figure shows the text length for both positive and negative reviews. It gives more concise understanding as in the above figure the negative reviews have almost same text length distribution as positive reviews. This means that users tend to write both positive and negative reviews with almost same length.

Unigram word frequencies of the reviews

The above figure shows the top 20 frequent unigram words for both positive and negative reviews. For the negative reviews we have words such as “not”, “bad” most occurring words, and for the positive review’s words such “great”, “best” are more frequent occurring words. Hence, It gives us the clear understanding of the unigram words used in the dataset.

Bi-gram word frequencies of the reviews

The above plot shows the top 20 bigram words that appears both in positive and negative reviews of the text. It can be clearly seen by the above plot that the bigrams are more relevant as we can see the word “not good” completely occurred in the negative reviews whereas the word “good food” has appeared mostly in the positive reviews of the text. Hence, we can conclude that bi-gram vectors of words might be one of the important features in building the models.

Feature Engineering

Feature Engineering is a task which is carried out to improve the overall predictive performance of a model by transforming its feature space . Feature engineering is central and the most important step in the process as it the source using which machine learning models predict the outcome.

Textual Feature: After splitting the dataset in to train and test. Two important feature engineering techniques were used to convert text to vectors.

  1. Term Frequency Inverse document frequency: It is the statistical approach in a collection or the corpus. This technique identifies how important is a word in a document. We have used TF-IDF to convert text into numerical representation of vectors. In our project we have used the bigram variant of TF-IDF as we can conclude from the bigram words plot that these feature might be as one of the best while predicting the outcome.
  2. SpaCy: The second textual feature engineering technique used for the prediction of positive and negative reviews is called SpaCy. SpaCy is a word embedding technique that is used to convert the reviews from the dataset into vectors. Using “en_core_web_lg” module of SpaCy the textual reviews of each user is tokenized into words and vectors are created from them. These vectors are then averaged to result in 300-dimension numerical features which we feed down to the downstream machine learning algorithm.
Reviews converted in to vectors using SpaCy library

Numerical Feature: Text length numerical feature was developed using the reviews given by the customers.


Multinomial Naive Bayes : The first baseline model used was multinomial naive bayes with TFIDF and text length features and all the parameters of the algorithms were set to default.

The above results obtained from the baseline model shows that the model was able to perform better on positive class compared to the negative class. It seems like model is biased towards one class.

Support Vector Machine: This algorithm was again used with TFIDF and text length features along with the default parameters.


The predictions of the second model was quite similar to first model as again this model was biased towards positive class.

In order to overcome the above problem and to built accurate models we did oversampling of the training data using the imblearn library so that the size of the minority class can match to the majority class.

Random Forest : This algorithms was trained using the oversampled training dataset with features TFIDF and Text length. Hyperparameter tuning was also performed using Grid search algorithm to get the best parameters for the model.


As we can see there is significant improvement in the results of the random forest model. The model was able to perform better on both the classes of the target label.

Support Vector Machine: This algorithm was trained on the oversampled dataset with SpaCy vectors and text length features. Grid search algorithm was used to obtain the best parameters for this model.


The results obtained from the above model were satisfying as it performed better on both the classes of the target variable. In fact, Support Vector Machine model was the best performing with F1 score of 90 percent.


The goal of this project was to build a predictive model to classify the user reviews to either positive or negative sentiments. We build two models with different scenarios. The first model utilized the algorithms with their default parameters and because of the imbalance problem in the classes model 1 performed poorly. The second model was built by training various machine learning algorithm after balancing the classes of the training dataset with different features. The result obtained from the Support vector machine algorithm from model 2 was best among all of them with SpaCy and Text length feature. We conclude that if the data can be curated properly of the yelp reviews and also proper feature engineering should be performed to yield good results.

Code Link :

References :