The Wild and Wonderful World of Web Scraping — Chipy Mentorship, pt. 2
In my previous post, I outlined the steps I planned to take to build a machine learning model capable of predicting a beer’s rating on Untappd, a social media platform for beer. Data acquisition was the first step in the process and is the focus of this post.
So, how do we get data from the web? Like all things, context is paramount. If I worked with Untappd, I would likely have access to their database and data acquisition would be a very different — though still not totally simple — process. Since I do not work for Untappd, I followed a different path:
- Request access t0 Untappd’s API
- If denied access to API, request a data dump
- If denied a data dump, scrape the data
APIs in Thirty Seconds
APIs (Application Program Interfaces) are procedural and serializable ways to interact with another program or in this case a website. They come in many different flavors, but for our purposes I will give the most simplistic explanation of their function: APIs provide the user with a set of rules and procedures for how to make a request from the API. These rules are not uniform across all APIs but instead unique to a given API and built by the people who made it. The response from the API also follows a procedural framework. If I know that an API will feed me back information in a same structure about a particular beer every time I feed it that beer’s name, I can easily build a framework to make that data accessible to me on demand by making a call to the API as needed.
Open APIs are publicly available, but private APIs are not and frequently require a user-specific key to be able to make requests of the API. Untappd grants keys to developers working to build apps that take advantage of Untappd’s data. I requested an API key from Untappd but was denied given that my project is academic in nature.
Data Dumps in Five Seconds
A data dump can happen in any number of ways, but at its core is just the wholesale, one-time transfer of a bunch of data from one place to another. After being denied an API key, I politely asked Untappd for a data dump. They politely declined.
So what’s an enterprising, young data scientist to do with only a pocket full of dreams and algorithms and not a row of data in sight?
In January, around the time of the presidential inauguration, there was an upsurge in interest in webscraping. The concern for many “hacktivists” was that following the change in administration, data collected by certain government agencies were in jeopardy (you can read more about that here). The underlying notion is that even if you don’t have administrative permissions on a web page, so long as you can visit that webpage, you have access to its data. Or, as Hartley Brody writes in this awesome blog post:
Any content that can be viewed on a webpage can be scraped. Period.
By using Google Chrome’s ‘Inspect’ feature, a user can see the raw HTML that is being read to build out the webpage, as illustrated above. Chrome even highlights the field on the page to which a given line of code refers. Digging into the HTML, we see all the relevant data about the beer tucked neatly away. Now, we have but to store that data in a familiar data structure and save it to a file.
There are any number of different tools you could use to scrape a webpage, but I used Requests and Beautiful Soup. I first used Requests to retrieve the HTML and then used Beautiful Soup to wade through it. Each beer was represented by its own dictionary where the keys were the features and the values the data for that particular beer. The script periodically takes all of the beers and turns them into a Pandas dataframe before writing that dataframe to a CSV. You can check out the code for this below:
Easy enough, right?
Or Is It?
Webscraping across a large data set — in this instance millions of rows — requires reverse engineering the logic of the website. Take a look at this snippet from the first line above:
requests.get('https://untappd.com/b/---/%s' % number)
The stem of every beer page on Untappd was the same, “https://untappd.com/b/”. The next two elements were the name of the beer and its id number, respectively. Fortunately, I discovered that leaving the name filled in with dashes simply interpolated a number at the end would redirect the request to the appropriate URL. By exploring the site, I realized the number began at 1, so for my basic scraping logic, I would loop through all the pages starting at 1 and continuing through the final page. Unfortunately, for reasons unknown to me, many of the number also redirected. For instance:
This means that when I did scrape beer number 505601, I would now have the same beer twice! In reality, I often had ten or twelve entries for the same beer since many numbers would redirect to the same page. While it would have been easy enough to disallow redirects, I couldn’t because the page needed to redirect to add the name of the beer in place of the dashes even when the number was correct.
No matter though, duplicate data is easily solved. I just dropped the duplicate rows from my dataset and in so doing turned myself into a cautionary tale!
The method I used to drop duplicates requires that every value in two given rows be the same in order to be considered a duplicate unless otherwise stated. In 90% of cases, this worked fine. In the other 10%, it failed. Untappd is live-data, their database constantly being updated every time someone posts to their account. It took me a couple weeks to scrape all that data. So if I scraped Weihenstephaner Hefeweissbier on Monday, Elliott Jones quaffed one on Tuesday, and I scraped it again via a redirect on Wednesday, drop_duplicates() would naively perceive those two rows as different.
When I first started learning about data science, my instructor taught us to “break the Excel habit.” You shouldn’t always need to see all your data. When you have a million rows of data it’s practically impossible. Though I am grateful for the advice, I’ve learned that it’s still important to look at a significant chunk of your data. It is only by looking at this chunk of data that I realized I still had multiple entries for certain beers and that my preliminary models were performing better than they should have.
A Brief Introduction to Modeling
Why was I so chagrined that I had duplicate data? This question cuts to the heart of Data Science.
Let’s imagine for a moment that you’ve just been hired to work at a carnival. Your boss wants youto work “The Guessing Game” booth and guess people’s weights. In order to learn how to guess a person’s weight, you shadow one of your seasoned colleagues for a few days to learn how to evaluate weight by sight. The big day comes, and you are finally doing the guessing ourselves. Your boss is there to keep track of how well you do. You need to get at least 60% within 10 pounds to pass. Coincidentally, every last person to come to the booth that day is someone who also came to the booth while you were still observing. You get all but a couple right because you remember how much they weigh from the previous few days. Your boss is thrilled, but you feel nervous. Why?
You’re nervous because you don’t know if you can actually guess how much a person weighs by sight. All you know is that you can remember how much a person weighs. This issue is at the core of predictive modeling and the basis of Train/Test Splitting. In order to measure how effective a model is, we need to test it on data it has never seen before. We do so by building our model on one portion of the data and setting aside the rest of the data to test it on. Because we already know the target value for the test data — or in our analogy are about to find out when they step on the scale — we can compare the predictions our model makes to the actual value to measure the accuracy of the model.
The chief issue with the duplicate data in my dataset was that it was artificially inflating my model’s scores. My model had no problem guessing the rating for Weihenstephaner Hefeweissbier, when it had already knew the rating for Weihenstephaner Hefeweissbier.
More to come!