A Real-time “Star Prediction” Application for Yelp Reviews Using the Google Natural Language API

10 min readAug 7, 2017

A little while ago, two classmates, Spencer Chin and Philip Pavlov, and I finished a project that centered around using machine learning to perform sentiment analysis of Yelp Reviews. It was a fantastic experience. By the end of the project, we had not only applied data science concepts to a practical problem at hand but had implemented a business use case for our trained models.

In our Data Science Class, we were tasked with comparing and contrasting the performance of using two different machine learning algorithms to train a sentiment analysis classifier. We decided to augment the specifications of the project by building a distributed application using Node.js, Python, and Docker that predicted the number of stars of Yelp Review as a user typed it in real time. This application used the sentiment analysis classifier of the Google Natural Language API but also displayed the results taken from the models we trained in our project.

Using machine learning to train our models and develop our application was a blast, and I wanted to share my experiences with developers and students who are just starting out or are experienced in machine learning. In this blog post, I’ll cover the approach that we took for performing Exploratory Data Analysis, benchmarking our machine learning algorithms (Logistic Regression and Naive Bayes), utilizing an iterative approach to determine an optimal preprocessing pipeline, and developing our predictive analysis application.

Introduction to Sentiment Analysis

For readers unfamiliar with sentiment analysis, it is a category of Natural Language Processing that utilizes data science techniques to determine the underlying sentiment of a document or a given piece of text.

For example, “Jar Jar Binks is the epitome of a well developed and fleshed out character.” would most likely be interpreted as having positive sentiment, although figurative language, including sarcasm, might result in a negative interpretation. Although knowing the Star Wars fandom, it’s probably the latter.

The rise of powerful algorithms that can perform effective sentiment analysis has immediate and powerful implications. An example would be corporations augmenting their market segmentation strategies by observing shifting sentiments among various age groups on Twitter following a product launch. However, there are many applications of this technology which have yet to be explored.

To train our own models to perform sentiment analysis, we relied on supervised learning algorithms. The dataset of Yelp Reviews, which was from UCI’s machine learning repository, classified each document in the dataset as having either positive (1) or negative sentiment (0). Our code was primarily written in Python and made use of implementations of machine learning algorithms in the scikit learn library and preprocessing modules from the nltk library.

Exploratory Data Analysis

Exploratory Data Analysis (EDA) is the initial process of attempting to tease out trends and clues in the data that may help us train a better predictive model.

Our dataset, which was obtained from the UCI Machine Learning Repository, contained a total of 1000 reviews that contained a Yelp review and the binary labeled sentiment score. The following figure demonstrates what our data set looks like after each review has been tokenized (separated by each word).

We observed the distribution of lengths of positive and negative reviews as this has the potential to skew our model when training our algorithms. For example, an excessive presence of a negative classification may result in our models becoming more likely to predict a negative classification, resulting in a poorly fit model for our real world data.

Thankfully our data had an identical number of positive and negative scores with similar distributions of the length of reviews.

Naive Bayes

The Naive Bayes algorithm was our first choice to perform Sentiment Analysis. In class, we had discussed its applications as a text-classifier and felt that it would be a great benchmark for our secondary algorithm. In this section, I’ll provide a brief introduction to Naive Bayes, but if you’re looking for a great explanation of Naive Bayes with actual examples, you can find it here.

Naive Bayes is a generative machine learning algorithm that uses learned conditional probabilities via Bayes Rule (the algorithm’s namesake) to generate the probability that a particular feature set belongs to a class.

A generative model is one that is capable of creating new pieces of data from an underlying probability distribution. In Naive Bayes, we can use conditional probabilities we that train of our data as a substitution for this distribution. Therefore, we can predict the probability that a piece of data belongs to any given class.

Naive Bayes works by storing the probability of the classifications we observe in our training data, as well as the conditional probabilities of all features we observe given that classification. Therefore, when attempting to predict the probability of a piece of test data given its features, we can use Bayes Rule to determine the probability of that piece of data belonging to a specific classification. This process is demonstrated in the figure below.

Once we’ve calculated the probability of a piece of data belonging to different classifications, the classification with the highest probability given a document’s features becomes the class that the document becomes classified as. For this reason, it is easy to think about modeling the tokens of a document using a Naive Bayes classifier. In addition to this, the algorithm is quite efficient, making it worth taking note of if your system has performance restrictions.

The lack of complexity of Naive Bayes turns out to be one of its greatest strengths. Therefore, we wanted to benchmark the algorithm against a secondary algorithm so that someone new to machine learning might also seek to use for classification.

Logistic Regression

Logistic Regression is an algorithm that can be trained as a classifier and uses a linear decision boundary as a means of classifying a set of features.

Being that Logistic Regression is a discriminative as opposed to a generative supervised learning algorithm, we decided it was a great choice to contrast against Naive Bayes. A discriminative model cannot generate new data from a distribution, however, it is used the predict the value of an independent variable from a single or multiple dependent variables.

Logistic Regression is derived from Linear Regression and has a very similar cost function. Additionally, Logistic Regression is a great first discriminative classification algorithm to learn and can be used to classify multivariate data as well as fit polynomial terms to find a decision boundary.

A great way to visualize Logistic Regression is to draw a decision boundary that acts as a binary classifier. In the figure below, a linear classifier has been trained to find the best decision boundary that separates data belonging to two different classes. Assuming we name our classifications “red” and “blue”, we would classify any new plotted point of data that falls above the line as ‘red’, and anything below the line as ‘blue’.

Obviously this line is not a perfect classifier for our dataset, which oftentimes is our goal as to avoid overfitting, but it does, for the most part, provide a boundary for us to predict the class of new pieces of data in our dataset.

Iterative Approach to Preprocessing

One of our team’s goals was performing preprocessing on our dataset to improve the accuracy of our models. Potential solutions we contemplated included Parts of Speech Tagging, Stopword Removal, Lancaster Stemming, Porter Stemming, and Snowball Stemming (Parts of Speech Tagging, Stopward Removal, and Stemming will all be explained in the following paragraphs). Instead of testing these solutions manually, we developed an iterative way of benchmarking their performance.

Stopword removal is a preprocessing technique that attempts to throw out tokens from documents that are considered to play very little significance in the classification at hand, in this case, sentiment analysis. For example, conjunctions such as “and” and “but” may be of no relevance to the ultimate sentiment score and can be thrown out.

Often times, the presence of different parts of speech can affect sentiment more than others. For example, adjectives may be very likely to contribute to the overall sentiment score of a piece of text. In order to enable our algorithms to incorporate Parts of speech relationships into its model, we can use a technique called Parts of Speech Tagging to assign each token each piece of data a specific part of speech before we train our model. This enables our machine learning algorithms to also incorporate the presence of various parts of speech into our model.

Stemming algorithms are also a technique that we utilized in our preprocessing step. They operate by grouping various tokens that may derive from the same root word. For example “play” and “playing” may be replaced by the same token ‘pla’ before training the data, as their sentiment classification can be considered to be extremely similar.

To find the optimal combination of preprocessing algorithm and machine learning algorithm, we decided to use an iterative branching approach that involved different modifications to the features at each step. This process can be visualized using a tree and is represented in the figure below. We ended up with 16 different combinations of preprocessing algorithms applied to our dataset. On the resulting 16 datasets, we ran our Naive Bayes and Logistic regression algorithms for a total of 32 combinations of preprocessing and implementation of our two algorithms.

Results

To evaluate the performance of our 32 preprocessing and algorithm combinations we used 5-fold cross validation. In this evaluation technique, we first break our dataset into 5 equal portions. We then test the algorithm five times on each portion, while training it first on the remaining portions each time.

In our experiment using 5-fold cross validation, we found that Logistic Regression consistently outperformed Naive Bayes in accuracy. These results can be visualized in the graph below.

In order to see if these results were unique to the Yelp data set, we also decided to test our approach on datasets of IMDB film and Amazon product reviews. These reviews differed largely in the distribution of the length of product reviews.

Despite this variability in length review, we found that Logistic Regression still noticeably outperformed Naive Bayes across each dataset.

Finally, we wanted to know what preprocessing and machine learning algorithm combination had the highest accuracy. The figure below shows for each data set, what the most effective preprocessing algorithm was for Naive Bayes and Logistic Regression.

Star Score Prediction Application

After we finished the main objective of our project, we wanted to create a real business use case that could apply the machine learning models that we created.

We built a tool to predict the number of stars that Yelp Review would have as a user types it in real-time. This would aid users by helping them to determine how their reviews were coming across and giving them a potential star rating to select if they concurred with it.

We decided to use an industry standard implementation of sentiment analysis as our “Gold Standard” and already had created an application to interface with the Google Natural Language API via the Node.js client prior to our project. Given the text, this API returns a sentiment score on a scale of -1.0 to 1.0.

To predict the number of stars, we mapped this result to a 0 to 5 scale. We then took the expected value of our top two algorithms for a given review. The result was as score on a scale between 0 to 1.0 that was calculated using the expected value formula: score = 0*Probability(0) + 1*Probability(1). This score was also mapped to a scale of 0 to 5.

To enable our web application to interact with our predictive algorithms, we designed an api-gateway node.js web app that both served our Angular.js dashboard and exposed interfaces to the results of the Google Natural Language API and implementations of our algorithms. To enable this web app to access our implementations, we used Flask, a lightweight Python library, to wrap the prediction functions for our top two algorithms, with a RESTful interface.

Finally, we Dockerized each of our applications and used docker-swarm to provide an easy way to deploy them locally and share them with anyone who wanted to do the same.

Our resulting application lets users type in their reviews and see in real time how many stars the application would predict via the Google Natural Language API. We also provided a visualization of our results of our top two algorithm and preprocessing combinations by mapping the expected value of a review to between 0–100 %.

Conclusion

In aggregate, the project helped us as budding data scientists to explore the steps involved in a typical data science project. The opportunity to contrast two different machine learning algorithms also lent insight into the complexity that can go into determining the right algorithm for a project given the number of them that now permeate the data science world. I’ve come away with a greater respect for the field and hope to continue to explore projects in it.

The chance to build an application using our trained models helped build connections with the skill set we already have. Across many engineering teams, data scientists and traditional software and DevOps engineers are working closely together to bring machine learning solutions to market. This project has made me hopeful that it this trend will result in an educationally and professionally enriching experience for all involved.

Recognition

Thank you to Professor Sukanya Manna at Santa Clara University for her help and guidance on this project!

Links:

Star Prediction Application: https://github.com/deenaariff/Yelp_Star_Prediction_Application/tree/master/prediction-dashboard

Initial Readme and Jupyter Notebooks: https://github.com/deenaariff/Yelp_Star_Prediction_Application