This is the second blog to share my experience as a mentee of the ChiPy Mentorship Program after the first one (https://medium.com/@juneyi1/kitchen-nightmare-with-data-science-and-a-web-app-part-1-6cf3be7540b8). As mentioned in in the previous blog, my goal for this program is to make a web app that returns the prediction of my model on a restaurant’s failure given the information of that restaurant. Even though it still needs some tuning, I will talk about it in this blog and I can go straight into the web development part in my next blog.
The data is all from Yelp. Since I wanted to build a classification model for predicting whether a restaurant in Chicago is closed or open, I needed data of both closed and open restaurants. I soon found out that the Yelp page of a closed restaurant will not be returned via the search by name or “closed restaurant” given the location (in the “Near” search bar) on Yelp; however, it still exists on Yelp.com in the form of “https://www.yelp.com/biz/restaurant_name-chicago". Therefore, if the name of a closed restaurant is known, then one will have a chance to obtain its info from Yelp. After some search online, I came across these three links that list the names of restaurants that were closed in Chicago in the past few years [ref 1–3]. Once I had the data of closed restaurants in my hand via web scrapping with the Request and XPath libraries in Python, I collected data for open restaurants centering around the closed ones in three ways listed below. In the end, there were 303 closed restaurants and 849 open restaurants in the data set.
During the course of digging out data, I realized that I had to make some assumptions. Here is the first one: the dates of the first and last review of a restaurant were used to approximate when the restaurant was opened and closed, respectively. This is because the actual opened/closed date of a restaurant is not recorded on Yelp, and I had no other good way to find those out. Therefore, restaurants that have never been reviewed were not included in the dataset.
To have a fair comparison between closed and open restaurants, I wanted them to experience roughly the same macro-economical situation time-wise. For that, I set a time window to be 2012 to 2017. That is to say, among all the closed restaurants I collected, only restaurants that were permanently closed between 2012 and 2017 were included in the dataset.
As for the open restaurants, an open restaurant can be included in the data set only if it has stayed open since 2010 till 2017. This is to have open and closed restaurants that have some overlap in time.
Only certain neighborhoods were considered when including open and permanently closed restaurants in the data set. This is also because that I wanted them to experience roughly the same macro-economical situation, location-wise. Therefore, open restaurants were sampled from the distribution of the neighborhoods where permanently closed restaurants located. Such sampled (open) restaurants need to meet the time constraint mentioned above to be included in the dataset.
3. Exclusion of top 100 chain restaurants
If the restaurant is a chain restaurant, then I assumed its closure could be due to the brand’s strategy, and not necessarily its business. Therefore, I excluded the top 100 chain restaurants reported here.
For each restaurant, the following features were considered as binomial categorical predictors: whether it is claimed by the owner, whether it has a website, accepts credit cards, is good for groups, is good for kids, takes reservations, has outdoor seating, does take-out, does delivery, and has TV. The following features were considered as polynomial categorical predictors: category, neighborhood, price range, attire (dressy, formal, casual), parking (garage, street, valet, validated, private lot), alcohol (no, full bar, beer & wine only), noise level (loud, very loud, quiet, average), and Wi-Fi (no, paid, free).
For a given restaurant, the date, star, and text of each review were collected as well from Yelp. Vader sentiment analysis was performed on the text of each review to obtain the compound score, and TextBlob to obtain the subjectivity score.
If I am to predict if a restaurant would be closed in the next 3 months in the future, I’ll have to do that without the reviews and ratings from now to 3 months later. This situation was mimicked for the model to be predictive for the restaurants in the data set: three time periods (13, 26, and 52 weeks) were used to exclude reviews within the last 13, 26, and 52 weeks for a given restaurant, and they were used for the models to predict whether a restaurant will survive in the next 13, 26, and 52 weeks, respectively, as shown in Figure 1. The remaining reviews were used to calculate the average number of reviews per year, average rating, and ratings for 5-, 4-, 3-, 2-, 1- star, as numerical predictors.
For each time period that was blocked, a voting classifier was employed as the classifier that contains logistic regression, random forest, and gradient boosted classifier. GridSearchCV was also employed to explore the parameter space with cross-validation. Accuracy scores and best parameters were recorded, and accuracy scores were used to measure the success of a model. While the blind guess accuracy (“baseline”) in test set is 0.729, the accuracy scores for 13, 26, and 52 “block weeks” models are 0.863, 0.863, and 0.860, respectively.
While I tried to put as many thoughts as possible into this model, this model by no means is a perfect one nor a rigorous one. I realized that there are countless ways to improve it when I talked to people about it. The model was built out of my interest in food/restaurants and data science. That being said, all the relevant code can be found on my GitHub. And, I have to spend a blog on it because it is the model that I want to build a Web app for by struggling through Django (to be continued …. in the next blog!).