Sebastien Couvidat
Locally Optimal
Published in
2 min readMar 16, 2021

--

Yelp Open Dataset: A New Version Is Available!

Yelp Open Dataset

The Yelp Open Dataset is an ideal resource for students, teachers, academics, and discerning data sleuths who want to play with a treasure trove of real-world big data.

We just released a new version of the dataset, which was a collaborative effort between Yelp’s data scientists and engineers. The latest iteration includes a main dataset and a second dedicated to photos.

The main dataset includes more than 8.6 million reviews written by nearly 2.2 million passionate users for more than 160,000 great businesses in eight metropolitan areas. You will notice that we selected new cities this year: Atlanta, GA; Austin, TX; Boston, MA; Boulder, CO; Columbus, OH; Orlando, FL; Portland, OR; and Vancouver, Canada. You can also find this dataset on Kaggle, where it has been downloaded more than 70,000 times!

Our second dataset contains 200,000 photos from the same businesses and a file providing the photo classification. Because each dataset is quite large, you can download them separately.

Among the many things you can do with our datasets: train a convolutional neural network on the pictures, test your favorite NLP algorithm on the review text, develop a recommender system, perform a sentiment analysis… your imagination is the limit!

Lastly, here is a Python code snippet to read any JSON file inside the dataset, which we provided in response to questions we’ve received about past datasets:

import json

import pandas as pd

data_file = open(“yelp_academic_dataset_checkin.json”)

data = []

for line in data_file:

data.append(json.loads(line))

checkin_df = pd.DataFrame(data)

data_file.close()

—Zachary Metz and Pravinth Vethanayagam helped create the dataset

--

--