What I learned from my first data science competition

About me

Nick Mills
6 min readJul 27, 2015

First, a bit about me. I am a research analyst at a nonprofit in the San Francisco Bay Area. I recently enrolled in SlideRule’s Foundations of Data Science Workshop. For my Capstone assignment, my mentor, Soups Ranjan, suggested participating in the Keeping it Fresh competition, hosted by DrivenData.

About the competition

The goal of the Keeping it Fresh competition was to predict the number of health code violations that restaurants received during health inspections, using data provided by Yelp and the city of Boston’s health department. I only participated in phase I of the competition, in which predictions were submitted for a set of historical restaurant violations.

Data

The data provided by Yelp included a variety of features, including business location, price level, average star rating, type of food served, ambience, and noise level. Yelp also provided the data and text of all reviews associated with a business, information about the users who wrote the reviews, and the businesses at which users checked in. The file that the city of Boston provided contained the number of violations restaurants received during a health inspection on a given date in three categories: minor, major, and severe.

Scoring predictions

Submitted predictions were scored by calculating the model’s weighted root mean square logged error. Errors for major infractions were weighted more heavily than minor weights, and severe infractions had the heaviest weights.

Features used

As a novice, I took full advantage of the starter Python script that DrivenData provided to process the reviews data from Yelp. They developed a script that loaded the data, linked the restaurant’s identifiers in the city’s file to Yelp business IDs, and created features from the loaded data. Features were generated using the term frequency, inverse document frequency method (TF-IDF).

TF-IDF Script

This technique calculates the frequency with which words appear in the data, but down-weights those that appear in a larger number of reviews. The idea is that words that appear in many reviews are less predictive of infractions, because it is likely they would appear in both reviews of restaurants with excellent health records and restaurants with poor health records.

For processing the other data, as well as fitting models, I used R. I used the JSONlite package to read in the Yelp file that contained information about business characteristics. I generated features using business category, star rating, price level, zip code, ambience, and noise level.

The models I tried

I set a goal for myself to try and beat the score of the linear regression benchmark. For this model, the number of restaurant infractions in each category was predicted using ordinary least squares regression, where the features were the 1,500 most frequently occurring single words that appeared in Yelp reviews, after adjusting for document frequency.

Unfortunately I did not meet this goal, but I got somewhat close. In the following section I describe the various features I experimented with. I then discuss the different algorithms I tried.

Review terms
I used the TF-IDF method to extract terms from the reviews.
One of my hypotheses was that the terms that occurred more frequently in the immediate weeks leading up to the review may be more predictive than just all prior reviews. This guess did not turn out to be true. Using the most frequent 250 terms from all past reviews outperformed using the 250 most frequent terms from past reviews only 30 or 90 days prior to the date of inspections. Even if I increased the number of terms to 1,500, 250 terms from all past reviews still outperformed terms drawn from more recent reviews.

I also tried using principal component analysis to reduce the number of features I was using. I found that using components drawn from the TF-IDF terms in lieu of the terms themselves did not improve the fit of the model.

Business characteristics
To explore the business characteristics dataset provided by Yelp, I estimated several logistic regression models, where the dependent variable was whether a restaurant had received any infraction in a given category in the training data, and the features were different business attributes. I noticed that price level and the “Chinese” category were frequently significant. I used these two features in most of my model submissions

I also tried submitting predictions in which other business attributes were included, such as ambience, noise level, zip codes, and other restaurant categories; however, these models performed worse than models that just contained price-level and the Chinese category.

Algorithms
I experimented with three algorithms in my competition submission- regression trees, linear regression, and Poisson regression. I achieved the best results using the regression tree algorithm implemented in R’s rpart package. I selected a complexity parameter for the algorithm by using cross-validation techniques in R’s caret package.

I also experimented with linear regression, which did not perform as well. Because linear regression sometimes yielded negative predicted values, which were nonsensical given that the number of restaurant inspections could not be negative, I also tried a Poisson regression using the GLM package. It sometimes performed better than linear regression, but always worse than the regression tree algorithm.

My best submission
My best submission came from fitting a model with 250 features from the review data using the TF-IDF method, price range, and an indicator for the Chinese restaurant category using a regression tree algorithm. I was surprised that the business attribute data was not more predictive. On the other hand, many of the words in the review text did overlap with the information that appeared in the business categories file. For example, the names of neighborhoods or the types of restaurants would often appear in the features. The R and Python scripts for this submission are located here.

What I learned from participating in the competition

Finally, I will close with some things I learned while participating in the competition, sometimes the hard way.

Borrow from others. I drew heavily on the starter code provided by DrivenData. Because my computer could not handle large datasets, I also tried setting up a virtual computer on Amazon Web Services, but there is no way I could have figured it out without Louis Aslett’s Amazon machine image for RStudio. As a beginner, not having to reinvent the wheel in these two situations saved me a substantial amount of time.

Keep track of everything you try. When developing models it is important to have a good record of what you tried and what worked.

Develop a validation system. DrivenData only allows you to only make three submissions a day. In hindsight, I should have wrote code that calculated the root mean square logged error statistic when I was developing models using the training set so that I would have a bit more information about how models might have fared on the test set without using up a submission.

Debug your program on a small dataset. When you are still working on getting code up and running, it saves a great deal of time. Sometimes I forgot this step because I was so keen on having a set of results to submit.

--

--