ChiPy Data Science Mentorship, pt. 1
This post is the first in a three-part series about my experience as a mentee in the Data Science Track of the ChiPy Mentorship Program. ChiPy, short for the Chicago Python User Group, is a community of talented and enthusiastic individuals brought together by their appreciation for the programming language Python. The warmth and excitement of this group is a microcosm of the greater Python community and part of what I find so appealing about Python.
I was exposed to ChiPy as a student of General Assembly’s Data Science Immersive program where we used Python for most of our coding. At my first ChiPy meetup, I learned about the mentorship program and eagerly applied for the opportunity to learn more about Python and immerse myself in the community. Given that I am still relatively new to coding, I feel grateful for the opportunity to keep learning in a community setting.
I could not have been more excited when I learned that Kevin Goetsch, currently a senior data scientist with GrubHub, was assigned to me as a mentor. I had already had the opportunity to meet with Kevin as he and my GA instructor were formerly colleagues at Braintree and had watched his PyData talk on pipelining with Scikit Learn. I really appreciate Kevin’s ethos when it comes to programming: he is tool agnostic, believes that a good, working model is better than a perfect one that never gets used, and focuses on how a product can be cleanly productionalized. Effectively, the Tim Gunn “Make-It-Work” approach to programming.
The guiding question for my project is: Can we predict what a beer will be rated on Untappd (an influential social media platform devoted to beer) before it goes into production? The second question is — and this cuts to the core of data science — why do we care?
We have to answer the latter question first in order to make a plan of attack for the former. If the answer to why is “because I want to know which IPAs are the most popular,” then we can perform some data analysis and keep our data science toolbox in the shed. If the answer is, “because I want the beers I’m making to be rated 4 stars or higher after 10 ratings” then it probably doesn’t make sense for us to include beers with 100,000 ratings in our model.
For the sake of this project, we are going to assume the hypothetical why is as follows:
We want to be able to advise different breweries about their upcoming beers and make recommendations as to how they might garner higher ratings on Untappd.
In order to do that, we need to build a model that can accurately predict a beer’s rating. This problem can be broken down into four distinct parts:
- Data acquisition from Untappd both of the target variable (a beer’s rating) and of potential features
- Data cleaning and exploratory data analysis
- Iterative model testing and continued feature engineering
- Embedding the trained model in a webapp to serve up predictions on demand
By the conclusion of this mentorship program, I’d like to have done the following:
- Build my Python fluency! Gain greater familiarity with different constructs and write more pythonically
- Acquire data via web scraping with Requests and Beautiful Soup
- Visualize my results with libraries I haven’t used before
- Successfully implement cutting-edge machine learning techniques like neural networks and extreme gradient boosting and compare their efficacy to industry standards
- Productionalize my model by building it into a Scikit-Learn pipeline object and pickling it
- Dip my toes into web development by building a Flask app to serve up predictions
I recently moved to Chicago from Washington, DC. Before that, I got a degree in Comparative Literature at Davidson College in North Carolina. I got interested in data science as a means of discovery and a mode of storytelling (I’m sure there’s a good median joke in there somewhere). I was born and raised in Denver and remain surprised about how cool people think it is now. And, unsurprisingly, I like beer.
Stay tuned for Part 2!