The DAP Journey: To eat or not to eat?

A bot that tells you if a restaurant in your area’s worth the buck, instantly.

Frans Corfiyanto
SMUBIA
4 min readMay 23, 2019

--

In this Medium series, BIA extracts the introspection of our Data Associates as they recall their academic exploration. This post features an analytics project on food reviews, directed by Frans & Kelvin, supervised by Ha.

The common denominator: FOOD

Meet the Team

Kelvin, SIS Year 1

“I wanted to join DAP to learn more about analytics and how I can work together with link-minded peers to undergo an actual, self-directed project.”

Frans, SIS Year 1

“I was excited to join DAP because I want to gain experiences from doing various analytics or machine learning projects along with the help from mentor and friends. This programme allows all the Data Associates to share their knowledge and past experiences in analytics.”

Ha Nguyen, SIS Final year

“I joined DAP as a Senior Associate because I like to exchange ideas and experience with other DAs in BIA. I like my Indon team also, their project was very interesting and ambitious.”

Introducing, Team Indon

To eat or not to eat?

Hey there! Our project was designed around the common love for food and good deals. Given the overflowing amount of information and food reviews online, sometimes finding the best deal can become a bit of a hassle.

We came up with an idea to create a telegram bot that predicts restaurant average price based on 3 factors which are location, category (Italian, Chinese, etc.) and reviews. Through this, we hope to help users evaluate their spending on a specific restaurant and conclude if it’s a good deal or not.

Project Plan

1. Sourcing for datasets

Given that we were looking for specific variables, we had to find websites with variables like Location and Food Category. At first, we attempted to find APIs from multiple food review websites, such as Zomato, HungryGoWhere and Burpple. Unfortunately, these websites did not provide an API or access to their website’s datasets. We then decided to turn towards compiling our own dataset.

2. Web scraping

None of us within the team had much experience in web crawling or scraping. We raised this question in one of our DAP sharing sessions, and one data associate suggested to use Selenium to scrape websites. With that, we head on to Burpple and collected more than 100,000 reviews from 1133 restaurants, which is a substantial 88 reviews/restaurants. We also visited Zomato’s website and collected reviews and ratings for about 8600 restaurants.

Loading a site on Burpple

3. Data Cleaning

One problem we faced was that the reviews obtained were not clean. Meaning to say, the reviews contained multiple emojis, ‘\n’ (indicating line breaks), and other textual noises. In order to clean the reviews and simplify them into pure strings, we have used string manipulation, regex and emoji cleaning library.

4. Modelling & Interface

To predict the restaurant price based on the factor that we have identified earlier, we used logistic regressions. As for the interface, we decided to use Telegram due to its prevalent use.

To use the bot, users only need to input the restaurant’s location, and the type of cuisine they want to try. The price that the bot returns is an average price for the cuisine in that area. Any price above the bot’s reply will be considered a bad deal.

Below is a snapshot from our Telegram Bot:

How users get to interact with the bot and obtain the predicted price.

Project roadblocks

1. Inconsistent Web Structure

The web structures for Zomato were not consistent. Each web page is structured differently, which makes scraping very difficult because there is no easy pattern to scrape the data.

For example, searching for “reviews” on one page can retrieve actual, typewritten reviews, but searching for “reviews” on another page would retrieve information on the number of votes or even the location, which corrupts the integrity of the entire dataset.

Hence, we decided to scrape for reviews from Burpple instead of Zomato.

2. Websites with pop-up ads, no infinite scrolling

While we were scraping, some webpages might have advertisements that pop up at random times. Hence, we were required to manually close the advertisement in order for the scraping to continue.

Moreover, to load more restaurants in Burpple, we had to manually click “load more”. Meaning, if we stop scraping at the 500th restaurant, we would need to start over from the 1st restaurant to scrape beyond the 501st restaurant.

These two problems made the scraping very time-consuming.

3. Missing values

There were missing values for some of the restaurant as well. Thankfully, we were able to exclude these data so that it does not affect the model of our predictions.

Takeaway, please!

Throughout the DAP experience, we’re proud to say that we have acquired the following skills:

  • Web Scraping
  • Data Cleaning
  • Feature Engineering
  • OneHotEncode
  • Modelling (Linear Regression, Random Forest, SVM)

We hope that our project has demonstrated how analytics can be used simply, in a way that can improve your meals times in Singapore. Till next time, ciao!

--

--