Learning to scrape websites for data is essential to becoming a great data scientist. If the data you want to work with isn’t readily available, there’s always a solution, and collecting the data yourself is one of them. There are several ways to go about this — some websites have API platforms open for personal use from which you can obtain JSON data. If JSON isn’t your strong suit, however, this method can have a pretty big learning curve. Additionally, some APIs have limits on the number of calls you can make, so you may end up with a short list instead of a full spectrum. There are also built-in browser tools for scraping, but those are often messy and can’t handle more than a single page of information. So how do we gather as much information from a website as possible? Enter browser automation.
It would take hours upon hours of your time to manually click through webpages to extract their data. But what if you were able to build a program that does it for you while you work on other things? Selenium Webdriver is an application that allows you to do just that in any specific language or browser to your liking. I’m using Python and Chrome for this tutorial, but feel free to use whichever language you’re most comfortable with — just be sure to read up on the documentation and to install the proper drivers for your browser:
Before you start scraping websites, be sure to have the proper packages imported. I’ll be using both Selenium for the scraping and Pandas to organize everything into a nice looking dataset.
Now let’s get started! I’ll be scraping reviews on RottenTomatoes.com for the recent feature film ‘Us.’ First, we have to explore the website’s HTML makeup by highlighting an area and clicking “Inspect.”
If you don’t know much HTML, no need to worry; for our purposes, the inspect feature on Chrome is fairly intuitive and requires only basic knowledge of HTML. We can see that the reviews are divided by classes called “review_area,” and are further split into a description and the review itself.
Now that we know the basic makeup of reviews, we can start extracting data. First, we’ll take a list of review description objects and assign them to a variable “results.” Then, we can iterate over each object and scrape the text from each review. The output should give us a list of all the reviews on the page.
The next step before storing all of our scraped data is to create an empty Pandas dataframe with our column names. Here is the documentation for Pandas if you’re unfamiliar or need a refresher.
It’s time to start collecting! I’ve commented out some instructions for each step below:
Our scores list is a bit different; if you noticed while inspecting the HTML of the scoring area, the score and the link to a full review are under one bracket. Additionally, some reviews don’t have scores listed. To handle this, we’ll simply go through the list and slice all reviews, while changing reviews with no scores to a ‘no score’ value.
Once we’ve got all our lists sorted out, we can test to see if our data holds up to the website. It should look something like this:
This looks good; let’s set our categories in our dataframe to these lists.
As with any project, you may run into some trouble with scraping some other types of categorical data. In this instance, the names and websites of each reviewer appears to fall under a different division in the HTML.
For these, we can’t just find each element by class because the classes are too dynamic. Instead, we’ll want to find these elements by their XPath.
Here’s how I collected the data for the reviewers’ names and websites. The websites’ class names were, thankfully, less dynamic, so I was able to obtain them using ‘subtle.’ You may have to play around with different XPaths until you can process the data properly.
Finally, we can add our names and websites to the dataframe. Because the scraper works through the website chronologically from top to bottom, our data will match up to our other categories.
You’ve now scraped your first webpage using automated browsing! Now that you’re comfortable, you can tackle larger sets of data and multiple pages of websites. Handling pagination is just a matter of getting to know the documentation a little better, which shouldn’t be a problem if you’ve made it this far. However, as a disclaimer: be sure you understand the terms and conditions of websites you’re scraping, as some might not allow for data collection. If the data you are trying to obtain is public domain, then scrape away, so long as you apply timed pauses in your automation to avoid being blocked. You should always take caution when scraping, but more importantly, have fun and keep learning!