[Week 2 — YelpGuesser]

YelpGuesser
bbm406f16
Published in
3 min readNov 27, 2016

Getting Our Hands Dirty With the Yelp Dataset

(Yelp logo, taken from Yelp.com)

As we have mentioned in our first post, we are using Yelp Dataset for this project, but since we haven’t elaborate enough on that, today, we’re going to let you know deeper what is Yelp, the reason why we choose it, and of course, the things which we have done with it so far. Check it out!

What is Yelp?

Yelp is an American public company, and it is known mainly for its role as a site (Yelp.com) which publish reviews from its users about local business ranging from food, restaurants, shopping, nightlife, automotive, etc. Not only reviews, the Yelp users also give ratings from 1–5 to the services and business on Yelp.
Therefore, Yelp website has build an ecosystem between its users and the business in which the users reviewed so that the reviews and ratings can be used as an input for another user to make a choice, and also for the business owner to see how the costumer reacts with their services and/or products.

What is Yelp dataset? Why do we choose it?

Since 2014, Yelp has offered a dataset challenge, they make their data available freely for everyone to access, and they challenge undergraduate and graduate university students to use their data in an innovative way for research purposes. There has been seventh round of this challenge and for each round, there are winners. By the time we do this project, the Yelp Dataset Challenge is in the eight round period (1 September 2016–31 December 2016).

As stated on the Yelp Website, the Yelp challenge dataset consists of:

  • 2.7M reviews and 649K tips by 687K users for 86K businesses
  • 566K business attributes, e.g., hours, parking availability, ambience.
  • Social network of 687K users for a total of 4.2M social edges.
  • Aggregated check-ins over time for each of the 86K businesses
  • 200,000 pictures from the included businesses

The dataset is composed of a single object type, one json object per line. The objects are: business, review, user, check-in, tip, and photos. The photos dataset are available separately, since the project we did is based on sentiment analysis, we only download the dataset which consists only texts.

The reason why we decided to use Yelp data is because:

  • It is free of charge
  • It is huge and representative enough as the object of our experiment (it has both ratings and review)
  • There have been several related works which based on Yelp dataset, so it will be easier to work with
  • The data is already represented as an object, so we didn’t need to scrap the data first, we can go straight to the preprocessing

Preprocessing the Data (to be continued)

We use a Python parser which is provided by Yelp as an example https://github.com/Yelp/dataset-examples to parses the JSON file and then convert the data into .csv file. The issues which we encountered while parsing is the compatibility of the script. Since the script is written for the Python 2, we have to make several changes to make it work with Python 3. But since we didn’t want risk losing any data in this conversion process, we decided to parsed it with both Python 2 and Python 3 so we have two version of data, and we will decide it later which version we are going to use.

The .csv files which we have extracted for this is really big (the review file .csv’s size is bigger than 2 GB), so we need to split the data into several parts in order to make it easier to process. After that, since the category of this project is based on food, we won’t use all the business’s and service’s review, we will only use the data which belongs to the restaurant category (or food-related business).

In the meantime, we are still in the preprocessing state of this project, and we will update it as soon as possible, so, see you!

--

--