Analyzing 4 Million Yelp Reviews with Python on AWS

15 min readMar 9, 2017

This was originally written/published by Gareth Dwyer on the DevelopIntelligence blog.

Yelp runs a data challenge every year in which it invites people to explore its real-world datasets for unique insights. In this post, we’ll cover show how to load the dataset into a Jupyter Notebook running on a powerful but cheap AWS spot instance, and produce some initial explorations and visualizations.

This post is aimed at people who:

Have some existing Python knowledge
Are interested in learning more about how to process and visualise large-scale data with Python

If you are interested in taking part in the Yelp challenge, this tutorial will leave you in a good place to start more interesting analyses.

Overview

In this post, we’ll be looking at the Yelp data from the Yelp Dataset Challenge. This is an annual competition that Yelp runs where it asks participants to come up with new insights from its real-world data. We will:

Launch an AWS EC2 Spot instance with enough power to process the dataset (4 million reviews) quickly.
Configure the EC2 instance and install Jupyter Notebook as well as some data processing libraries.
Display some basic analysis of the data, along with visualisations using Matplotlib.

If you have a high end desktop or laptop (with at least 32GB RAM), you can probably run most of the analyses locally. However, learning how to process data in the cloud is a useful skill, so I’d recommend following along with the entire tutorial. Even if the Yelp data is small enough for your local machine, you may well want to process larger datasets in the future. And considering that AWS offer instances with up to 2TB of RAM, the method described here will work for even larger datasets.

Creating an AWS EC2 Spot Instance

Amazon Web Services (AWS) offer Elastic Cloud Compute (EC2) instances. These are on-demand servers that you can rent by the hour. They tend to be fairly expensive, especially for the more beefy machines, but luckily AWS also offer so-called ‘spot’ instances. These are instances that they currently have in excess supply, and they auction them off temporarily to the highest bidder, normally at much lower prices than their regular instances. This is very useful for short-term needs (such as data analysis) because the chance of someone else outbidding you while you still need the machine is comparatively low. To fire up a spot EC2 instance, follow the following steps:

Visit aws.amazon.com and sign up for an account (assuming that you don’t have an account with them already). It’s a somewhat complicated signup process, and it requires a credit card, even for their free trial, so this step might take some time. You can instead use Microsoft Azure, Google Cloud Compute, Rackspace, Linode, Digital Ocean, or any of a number of cloud providers for this step if you want, but all require a credit card for sign up and they don’t all offer the same variety of instances or the same discount pricing structure as AWS.
Visit the AWS Console. Pick a region using the dropdown in the top right-hand corner. For latency reasons, it’s nice to pick a region close to you, but some regions have more instances available and have cheaper spot instances. For example, even though I’m in The Netherlands, I chose the Oregon region (us-west-2) while making this post, as there were low-priced spot instances available there. (If you really need to save every cent you can, this Mozilla Python script can help you find the cheapest instance currently available worldwide.)
Click Services in the top left-hand corner, and choose EC2 from the list. You’ll now be taken to the main EC2 page.

In the left-hand column, select Spot Requests and then click Request Spot Instances

There are many options we can modify when creating a launch request for an instance. Luckily, we can leave almost all the defaults as they are. The ones we will change are:

The AMI — this is the “Amazon Machine Image.” It defaults to Amazon Linux, but we’ll be using Ubuntu-Server instead. Choose Ubuntu Server 16.04 LTS (HVM) from the AMI drop down.
The instance type–we’ll want an instance with lots of RAM (at least 30GB but preferably 60+ GB) and at least some SSD space. Click the x next to the default selected instance to remove it and then click the Select button next to that. You should see a popup similar to the one shown below. You can use the column headings to sort by a specific column. To choose an instance, I sorted by price and then found the first instance with 30GB RAM and some SSD space, which was an m3.2xlarge. The m instances aim to balance CPU, RAM and hard disk. The r instances focus on RAM and are also good for data analysis if you find a cheap one.

At the bottom of the screen, click Next to get to the second (and last) page of settings for your instance.

Under Set Keypair and Role, click Create a new key pair. This will open a new tab and take you to the EC2 key management page. Choose to create a new key pair again, give it a name, and download the private key when prompted. Save it in your home directory as ec2key.pem.
Under Manage firewall rules select default. This will create an inbound firewall rule that allows the instance to accept SSH connections (we’ll be connecting to the instance via SSH).

Now click Review at the bottom of the page, check that everything looks as expected, and click Launch. This creates a spot instance “request,” and you might have to wait a bit for it to be “fulfilled” (meaning an instance became available that matched your request). You can see the state of the request under the Spot Requests tab (the one you used to create the request). When the request has been labeled fulfilled (and given a green icon), you’ll see the instance under the Instances section (you sometimes need to reload the Instances section to see the new instance).

Note: Be aware that the prices of spot instances can skyrocket unpredictably, leaving you with a nasty billing surprise. By default, the price is capped at the on-demand price (the price you usually pay), so if someone bids higher than that, you can lose your instance (and your work) suddenly.

Scroll to the right in the Instances window to find the Public DNS of your instance, and copy this to your clipboard.

Connecting to Our EC2 Spot Instance

Now open up a terminal or command prompt on your local machine. If you’re using Windows, you won’t be able to use SSH by default. Most people use PuTTy to SSH from Windows, but if you have a modern version of Windows it’s easier to enable WSL (Windows Subsystem for Linux). Once you’ve set that up, you can use SSH exactly as described in this post. As an alternative, you could install Git Tools for Windows. In the last step of the installation process, you’ll be asked if you want to Use Git and optional Unix tools from the Windows Command Prompt. Select yes, and you’ll be able to use SSH from the Windows CMD prompt.

Before we can connect to the instance, we need to change the permissions for the .pem key file that you downloaded earlier. Assuming that your key was saved in your home directory as ec2key.pem, run the following command:

chmod 600 ~/ec2key.pem

Now you can use it to connect to the instance. Make sure you still have the Public DNS name for your instance in your clipboard, and run the following command:

ssh -i ~/ec2key.pem -L 8888:localhost:8888 \
ubuntu@ec2–52–33–47–198.us-west-2.compute-amazonaws.com

This connects to your instance, allowing you to run commands on it via SSH. The -i flag points to your key file, which proves that you’re the owner of the instance and the -L flag sets up port forwarding. Here we specify that port 8888 on our local machine should be forwarded to port 8888 on the remote instance. We’ll need this in a bit so that we can view a Jupyter notebook locally and have it execute on our instance.

Configuring Our EC2 Instance

To set up our instance, we only need to configure pip and install some Python libraries for data processing. Run the following commands on the instance:

export LC_ALL=”en_US.UTF-8"
sudo apt update
sudo apt install python3-pip
pip3 install pip matplotlib jupyter — user

The first command sets the LC_ALL environment variable which specifies the locale. By default, Ubuntu Server often does not specify this, and pip needs locale information to function correctly. The later commands install pip using Ubuntu’s apt package manager. We then use pip to reinstall itself, as the apt versions of software sometimes lag behind the current versions. We also install jupyter which is the notebook we’ll use and matplotlib for plotting.

If you chose an instance with an SSD, you’ll have to mount that. Run:

lsblk

to see the available disks. You’ll probably see the SSD listed as /dev/xvdb (though it might be called something else). Run the following commands to mount the SSD, substituting the xvdb if necessary:

sudo mkdir /mnt/ssd
sudo mount /dev/xvdb /mnt/ssd
sudo chown -R ubuntu /mnt/ssd

If you picked a machine with about 30GB of RAM, you can still run into some issues while loading and manipulating some of the Yelp data. I created another 20GB of swap place (virtual RAM on the hard drive) just in case (this step takes a while to run):

sudo dd if=/dev/zero of=/mnt/ssd/swapfile bs=1G count=20
sudo mkswap /mnt/ssd/swapfile
sudo swapon /mnt/ssd/swapfile

Getting the Yelp Data onto Our Machine

Currently, Yelp requires that you fill out an online form to get a link to access the data. This link is then tied to the machine where you filled out the form. There may be a workaround, but I had to download the data locally and then transfer it across to AWS, which took quite a while with my slow uplink connection. Fill out the form and obtain the download link here: https://www.yelp.com/dataset_challenge/dataset.

Once you’ve downloaded the approx 1.8GB tar file, you can scp it to your instance with the following command (assuming that you saved the tar file to your Downloads folder. If not, substitute ~/Downloads/ for the path to the Yelp file). You’ll also need to substitute the DNS string for your own. Note that this command needs to be run on your local machine, not from the EC2 instance.

scp -i ~/ec2key.pem ~/Downloads/yelp_dataset_challenge_round9.tar \
ubuntu@ec2–52–33–47–198.us-west-2.compute-amazonaws.com:/mnt/ssd

Now, on the instance, you can untar the data with the following commands:

cd /mnt/ssd
tar -zxvf yelp_dataset_challenge_round9.tar

This should create a bunch of large .json files. We’ll be opening these directly in Python, so our command line work is nearly done.

Starting and Accessing the Jupyter Notebook

Now start the Jupyter notebook server on the instance by running:

jupyter-notebook

You should see output saying that no web browser was detected, and giving you a URL with a token, similar to the following:

Copy the URL to your clipboard and paste it into a browser on your local machine. You’ll see the default Jupyter Notebook page. Create a new Python 3 notebook by selecting New in the top right-hand corner and then choosing Python 3.

If you’ve never used Jupyter Notebooks before, take a few moments to get acquainted with how things work. You can insert cells, delete cells, or run specific cells. Cells are useful because you can run specific blocks of code after making a change without having to rerun all the code above. Each cell shares a namespace with any previously run cell, so you can always access your variables and imports from new cells. The most useful keyboard shortcut is Ctrl + Enter, which runs the code in the currently selected block and displays the output.

The default working directory is the directory from which you launched Jupyter. If you followed the commands as laid out above, this would have been /mnt/ssd/ on your instance, so the JSON Yelp data files should be in the current working directory. To check, you can type !ls into one the cell at the top and run it. This will output all the filenames in the current directory.

Starting our Data Analysis

Now it’s finally time to load the data into Python and play around with it. The Yelp data is in a bit of a strange format–although they provide JSON files, the files contain a separate JSON object on each line, instead of one single JSON object.

In the first cell, we’ll want to set up some import of libraries that we’ll be using. We’ll set matplotlib to work in notebook mode, which makes our plots interactive (mousing over them will show the X-Y coordinates). Put the following code into the first cell of the notebook, and run it.

%matplotlib notebook
from matplotlib import pyplot as plt
import json
from collections import Counter
from datetime import datetime

Now we’ll read in the entire review file and split them into an array of individual (string) reviews. This takes about half a minute, even on a nice machine.

t1 = datetime.now()
s = “”
with open(“yelp_academic_dataset_review.json”, encoding=”utf8") as f:
reviews = f.read().strip().split(“\n”)
print(datetime.now() — t1)
print(len(reviews))

Note that you can use tab completion for the filename which is super useful to prevent typos and speed up your coding in general (e.g., type yelp_a and press tab instead of typing out the whole name).

You should see a printout showing that the code took about 20–30 seconds to run, and that there are a little over 4 million reviews in the dataset. (All lines of script output in this post will be prefixed with >>>, but you won’t see the prefix in the actual notebook).

>>> 0:00:21.302640
>>> 4153150

In the next cell, let’s convert all the reviews to JSON. This takes a bit longer than reading them in from the file (about 45 seconds on the machine I was using).

reviews = [json.loads(review) for review in reviews]

And it’s always nice to have one review printed in full so that we have an easy reference on how to access pieces of each review. Add the following code to a new cell and run it.

print(reviews[0])

We can get a basic distribution of the star ratings that users usually leave by using a Python Counter:

stars = Counter([review[‘stars’] for review in reviews])
print(stars)

>>> Counter({5: 1704200, 4: 1032654, 1: 540377, 3: 517369, 2: 358550})

We can see that there are more 5 star reviews than the others, but a visualisation would make the distribution much clearer. Let’s create a basic bar graph of these numbers. Note that we normalize by the length of the reviews, so the Y-axis shows the percentage of total reviews in each category. This post was partly inspired by one that used an Amazon review data-set, which you can find here http://minimaxir.com/2017/01/amazon-spark/. It’s interesting that those reviews followed a similar star-distribution.

Xs = sorted(list(stars.keys()))
Ys = [stars[key]/len(reviews) for key in Xs]
plt.bar(Xs, Ys)
plt.show()

This produces the following graphic:

In notebook mode, Jupyter will keep track of whether or not a specific plot is “active.” This is useful as it allows you to plot different points onto the same image. However, it can also get in the way if you’re trying to create a new plot and the output keeps going to the previous one. After creating each plot, you’ll see it has a header that looks like this:

Press the blue button in the top right after creating each plot to deactivate it. New calls to plt.plot, etc will then be sent to new graphs, instead of being added to the previous one.

Finding the Most Prolific Reviewers

Let’s find the users who have left the most reviews. We’ll create a Counter object by User_ID (note that Yelp has encrypted the User IDs in this dataset, so they all look a bit strange). Add the following code to a new cell and run it

users = Counter([review[‘user_id’] for review in reviews])
print(users.most_common(10))

>>>[(‘CxDOIDnH8gp9KXzpBHJYXw’, 3327), (‘bLbSNkLggFnqwNNzzq-Ijw’, 1795), (‘PKEzKWv_FktMm2mGPjwd0Q’, 1509), (‘QJI9OSEn6ujRCtrX06vs1w’, 1316), (‘DK57YibC5ShBmqQl97CKog’, 1266), (‘d_TBs6J3twMy9GChqUEXkg’, 1091), (‘UYcmGbelzRa0Q6JqzLoguw’, 1074), (‘ELcQDlf69kb-ihJfxZyL0A’, 1055), (‘U4INQZOPSUaj8hMjLlZ3KA’, 1028), (‘hWDybu_KvYLSdEFzGrniTw’, 988)]

We can see that one user has left an impressive 3327 Yelp reviews. Let’s name this user Mx Prolific and create a collection of only their reviews.

mx_prolific = [review for review in reviews if review[‘user_id’] ==
“CxDOIDnH8gp9KXzpBHJYXw”]
mp_stars = Counter([review[‘stars’] for review in mx_prolific])
print(mp_stars)

>>> Counter({3: 1801, 4: 1036, 2: 390, 1: 53, 5: 47})

Note that mx_prolific’s ratings diverge strongly from the overall distribution we saw before. While overall 5-star reviews are the most common, Mx Prolific has awarded only 47 5-star reviews (Perhaps these are establishments that are worth checking out!), and nearly 2000 3 star reviews.

Now we can create a second-level Counter to count the frequencies of the number of reviews left by each individual user. We summarized our original data by combining all the reviews left by the same user into a single record. Now we want to summarize further and combine, for example, all the users who have left exactly 12 reviews into a single record.

num_reviews_left = Counter([x[1] for x in users.most_common()])

This allows us to visualise how many reviews are left by most users. Because nearly all users have left only very few reviews, we’ll visualise the drop off only up to 20. (Change line 3 below to plt.bar(Xs, Ys) to plot all the records and see how plotting more data can sometimes produce a less informative result).

Xs = [x[0] for x in num_reviews_left.most_common()]
Ys = [x[1] for x in num_reviews_left.most_common()]
plt.bar(Xs[:20], Ys[:20])
plt.xticks(range(1,21))
plt.xlabel(“Number of Reviews”)
plt.ylabel(“Number of Users”)
plt.show()

This produces the following output. We can see that a huge number of users leave just one review, and that the dropoff over 2, 3, and 4 reviews is pretty steep.

We can do the same to see how many reviews are typically received by a single business.

businesses = Counter([review[‘business_id’] for review in reviews])
num_reviews_by_business = Counter([x[1] for x in businesses.most_common()])
print(num_reviews_by_business.most_common(10))
print(len(num_reviews_by_business))
Xs = [x[0] for x in num_reviews_by_business.most_common()]
Ys = [x[1] for x in num_reviews_by_business.most_common()]
Xs = Xs[:18]
Ys = Ys[:18]
plt.bar(Xs, Ys)
plt.xticks(range(3,21))
plt.xlabel(“Number of Reviews”)
plt.ylabel(“Number of Businesses”)
plt.show()

This produces the following output and image. The most-reviewed business has received 21,908 reviews! The dropoff of the number of reviews a business receives is slower than for reviews left by users, but a low numbers of reviews is much more common. Note that Yelp has only included businesses with at least three reviews in this dataset.

>>> [(3, 21908), (4, 15473), (5, 11498), (6, 9012), (7, 7475), (8, 6061), (9, 5208), (10, 4308), (11, 3857), (12, 3508)]
>>> 947

Lastly, we’ll determine whether good or bad reviews tend to have more words. The next post will focus on natural language processing, and we’ll be covering far more sophisticated text processing techniques, but for now we’ll simply use Python’s split() function to split each review into words, and then look at averages by number of stars.

import statistics
review_lengths_by_star = [[],[],[],[],[]]
for review in reviews:
length = len(review[‘text’].split())
idx = review[‘stars’] — 1
review_lengths_by_star[idx].append(length)
print([statistics.mean(review_lengths) for review_lengths in review_lengths_by_star])

>>> [146.6973890450556, 146.11345697950077, 135.06209687863014, 118.64652826600197, 93.93472010327426]

We can see that negative reviews tend to be a bit longer, with 1-star reviews having an average of 147 words, while 5-star reviews have a lower average of 94 words. We’ll make a final plot to visualise this:

plt.bar([1, 2, 3, 4, 5], [statistics.mean(rlength) for rlength in
review_lengths_by_star])
plt.xlabel(“Stars”)
plt.ylabel(“Word length”)
plt.show()

Let’s recap what we did. We set up a powerful data processing environment, and took a cursory look at some of the Yelp data–but we’ve only just scratched the surface in terms of insights we can draw from these data. In a later post, we’ll be using the same dataset to introduce some machine learning and natural language processing concepts.