What are your customers saying?Natural Language Processing (NLP) with Yelp Review Data)

Photo by Eaters Collective on Unsplash


Natural Language Processing (NLP) is a very hot topic in the world of machine learning. I recommend you reading more about it by checking out the wiki link. As for this blog, follow along and you will be able to complete your very own text analysis. I will be using Python on my Windows to do this, but all the tools I’m using is also available on mac. I have a Macbook Pro and a PC, so both of them have Python & Anaconda installed. What I am about to show you can be applied to any text analysis you might do in the future for movie reviews, app reviews, etc… As for me, I work on a product at Microsoft where I will be applying these skills for analyzing app reviews from Google Play Store and Microsoft App Store.

This blog will be more of a step by step instruction to do your first text analysis rather than explanation of all the concepts. You may need to know basic understanding of Python and command line to begin this tutorial. If any of this looks familiar, it’s because I had some inspiration from several blogs and repositories from Kaggle, Medium and Udemy (see resources at end of blog). I hope you find this tutorial useful. Let’s get started!


We’re going to first start by preparing your workstation with the right tools and software to do this analysis. If you run into any issues, please don’t hesitate to leave a comment.

1.1 DATA

First, we’re going to need data to work with. I will be using data from Kaggle. If you’re learning Data Science, this site is a must. Go save it in your bookmarks now.

  • You can also use the yelp_review data set here


You don’t necessarily need these environment to do the analysis, but the bottom two is my go-to setup for my data science lab.


A good analysis starts with a question or an objective. They help you stay focused on what problem is at hand. As a data analyst/scientist working for a company, one thing you always want to remember is to answer questions that are actionable. For this blog, I’m only going to be answering the questions below.

  1. What are the customer sayings?
  2. What do customers that leave negative review say?
  3. What do customers that leave positive review say?

Photo by Dlanor S on Unsplash


Enough blabbering. Let’s get coding! Before you launch your Jupyter Notebook, make sure to install the following packages to Anaconda:


These are the packages I will utilizing in this walk-through. Each of these packages can be an article/book in itself. If you want to learn more, I recommend you check out the documentations.

To get these packages, into your anaconda, type the following into your anaconda prompt:

conda install pandas textblob nltk wordcloud seaborn


Next, you will be launching Jupyter notebook from the prompt. Once you run the script below, it should launch your browser to http://localhost:8888/.

jupyter notebook

It should look something like this:


Now that you’re in Jupyter notebook, import the following packages and dependencies:

import numpy as np
import pandas as pd
from textblob import TextBlob
import nltk
from nltk.corpus import stopwords
from wordcloud import WordCloud
import seaborn as sns
import re, string
import sys
import time

2.4 DATA

Now that all of the packages are imported and your environment is ready to go. Let’s import your data into the notebook. We will be using pandas package to read the csv. Remember to know where your directory/path is set to.

yelp = pd.read_csv('yelp.csv') // You are going to put your own path in place of 'yelp.csv'

A great way to find out what directory your notebook opened in is using the pwd.


Something I like to do is inspect the data, once I import the data with the head(), info() and describe() method. Run them in separate lines.

5,261,668 rows!

We’re going to first check the length of the reviews to see understand whether or not people are leaving meaningful reviews.

The following line of code will create an extra column called “review length”. This will count the length of the review by checking how many characters there are in the review.

yelpreview['review length'] = yelpreview['text'].apply(len)

We will now create 5 histograms based on the star rating and review length. I’m utilizing seaborn and matplotlib packages here. Please check out the documentations for these packages to get a better understanding.

%matplotlib inline
g = sns.FacetGrid(yelpreview,col='stars')
g.map(plt.hist,'review length')

It seems like generally the review length is the same except people leaving 4 and 5 stars tend to leave shorter reviews.

I’m going to separate the reviews into “bad” and “good” reviews based on the stars. We’ll categorize “bad” as 1 and 2 stars and “good” as 4 and 5 stars. This will help us answer our questions we created previously.

yelpbadreviews = yelpreview[(yelpreview.stars <= 2 )]
yelpgoodreview = yelpreview[(yelpreview.stars >= 4)]

Then we will remove all the columns except “text”, so we can begin out text analysis.

badreviews = yelpbadreviews.text
goodreviews = yelpgoodreviews.text

If you remember checking out the data, there were a lot of rows! Over 5.2 million rows. I want to make sure this analysis doesn’t take too long, so let’s take a random sample from each of the dataframes. Let’s take 0.1% from each. I would not recommend this step if you’re making business decisions, but this will help us speed up this analysis. However, if you have a pretty powerful computer, feel free to skip this step or take a bigger sample.

badreviews = badreviews.sample(frac = .001, replace = True )
goodreviews = goodreviews.sample(frac= .001, replace = True)

The next lines of code is a function that will print the most frequent N-grams in a given file. I am utilizing the following Github respository below. (I love Github. Don’t reinvent the wheel!) However, I recommend everyone to learn what the code is doing. (Again, feel free to ask me any questions below).

Print most frequent N-grams in given file

Now, execute the following lines of code:

def tokenize(s):
"""Convert string to lowercase and split into words (ignoring
punctuation), returning list of words.
word_list = re.findall(r'\w+', s.lower())
filtered_words = [word for word in word_list if word not in stopwords.words('english')]
return filtered_words
def count_ngrams(lines, min_length=2, max_length=4):
"""Iterate through given lines iterator (file object or list of
lines) and return n-gram frequencies. The return value is a dict
mapping the length of the n-gram to a collections.Counter
object of n-gram tuple and number of times that n-gram occurred.
Returned dict includes n-grams of length min_length to max_length.
lengths = range(min_length, max_length + 1)
ngrams = {length: collections.Counter() for length in lengths}
queue = collections.deque(maxlen=max_length)
# Helper function to add n-grams at start of current queue to dict
def add_queue():
current = tuple(queue)
for length in lengths:
if len(current) >= length:
ngrams[length][current[:length]] += 1
# Loop through all lines and words and add n-grams to dict
for line in lines:
for word in tokenize(line):
if len(queue) >= max_length:
# Make sure we get the n-grams at the tail end of the queue
while len(queue) > min_length:
return ngrams
def print_most_frequent(ngrams, num=10):
"""Print num most common n-grams of each length in n-grams dict."""
for n in sorted(ngrams):
print('----- {} most common {}-word phrase -----'.format(num, n))
for gram, count in ngrams[n].most_common(num):
print('{0}: {1}'.format(' '.join(gram), count))
def print_word_cloud(ngrams, num=5):
"""Print word cloud image plot """
words = []
for n in sorted(ngrams):
for gram, count in ngrams[n].most_common(num):
s = ' '.join(gram)

cloud = WordCloud(width=1440, height= 1080,max_words= 200).generate(' '.join(words))
plt.figure(figsize=(20, 15))

We create 4 functions above. the count_ngram function utilizes the tokenize function already. We will be using the other three functions

  1. count_ngram
  2. print_most_frequent
  3. print_word_cloud

Let’s start with the bad reviews.

most_frequent_badreviews = count_ngrams(badreviews,max_length=3)
print_word_cloud(most_frequent_badreviews, 10)

Spend about 30 seconds looking at the word cloud. It will give you a general idea about what the customers are saying. With the next line of code, you can see what the most frequent 2-words and 3-word phrases are. The formal word is called n-gram which means contiguous sequences of n-items in a sentence. Read more about it here.

print_most_frequent(most_frequent_badreviews, num= 10)

From 2-word phrases, you can see phrases like “customer service”, “10 minutes” and“30 minutes”, which can help you understand why the customer left a bad reviews. Some other phrases like “go back”, “last time”, “tasted like”, and “come back” can be assumed with some bad things, but it’s not less obvious. Therefore, you can look at 3-word phrases that can bring a bit more context. “Poor customer service”, “worst experience ever”, “asked speak manager”, etc… provides more context. It’s obvious these businesses need a strategy or a plan to improve a customer service. This is an actionable insight. The reason we don’t usually go above 3-phrases, because it’s rare to find frequent 4-word phrases.

Now that we’ve seen the bad reviews, let’s look at the good ones.

most_frequent_goodreviews = count_ngrams(goodreviews,max_length=3)
print_word_cloud(most_frequent_goodreviews, 10)
print_most_frequent(most_frequent_goodreviews, num= 10)

What does this tell us? Customers that left good reviews seem like they are more likely to recommend this to others and come back again as a returning customers. Analyzing good reviews will help you to find out where you are doing well and realize what a good product/service can provide for the business.


We can see how beneficial text analysis can be when it comes to reviews. In big companies/businesses, you’ll find an overwhelming amount of reviews left by your customers. When you have over tens and hundreds of thousands of reviews (even millions), it‘s practically impossible to read through all the reviews. That’s when you can apply text analysis to save time and effectively summarize what your customers are saying. If you are working for a company or a business where customers can provide feedback/reviews, I recommend you to apply these techniques to your own set of data.

If you have any questions or even feedback (I love feedback), please feel free to leave your question or comment below.

Otherwise, I will be posting more interesting tutorials and blogs every week about data science, machine learning, etc… If you like my content, please don’t forget to click follow.

Twitter: https://twitter.com/kennyk1m

Medium: https://medium.com/@kennykim.90

Website: https://kennykim.github.io/

Inspiration and Resources:

  1. Python for Data Science and Machine Learning
  2. Natural Language Processing for Beginners using TextBlob — Analytics Vidhya
  3. Ultimate guide to deal with Text Data (using Python) — for Data Scientists & Engineers — Analytics Vidhya