Fili-Busted

Part 1: A web scraping introduction

Darius Fuller
The Startup
5 min readSep 3, 2020

--

From: Depositphotos | Credit: wowomnom

2020 has been a year to say the least. One of the reasons I’ve felt so has to do with politics. I know a lot of folks do not necessarily like to speak about it (fine), but I wanted to do some (political) data science!

I wanted to make use of machine learning to predict the winners of a given election, but soon came to realize there were not many prepackaged datasets with this information available on the internet. This meant that, in order to do what I wanted, I would need to make my own data set through the use of web scraping. This process was a significant portion of my overall project, so I thought I would share a bit about what I learned along the way.

Scraping Wikipedia

As alluded to earlier, it took longer than expected just to scrape and organize the data in a manner that would be usable for EDA (Exploratory Data Analysis) and/or machine learning. So, hopefully for those reading this, my experiences will save you some time and grief during your own data collection!

What is “Web Scraping”?

This is the definition from the Wikipedia page for the topic:

Web scraping, web harvesting, or web data extraction is data scraping used for extracting data from websites. Web scraping software may access the World Wide Web directly using the Hypertext Transfer Protocol, or through a web browser…It is a form of copying, in which specific data is gathered and copied from the web, typically into a central local database or spreadsheet, for later retrieval or analysis.

Simply put, web scraping is the programmatic process of copying and storing information directly from a webpage(s) on the internet.

Getting Started

Despite sounding like a complex process, getting started is quite simple! If you want to follow along with my project, the only python packages you need are (links to documentation are provided):

Let’s Get Connected!

One of the most crucial parts to the scraping process is making sure to establish connection with the page that you are looking to get information from. This is handled through the use of the requests package and a tiny amount of manual work. As for the manual component, you simply need to use your favorite web browser to navigate to your target page and copy its URL (Quick tip: Ctrl+L, Crtl+C is a easy way to grab the entire URL).

Here is the code I used to begin my process:

The intuition is simple, you need one URL and use the .get() method from requests to establish a connection to your computer in the background. Line 8 is very important in this case, as the only way you will be able to continue appropriately is if your response looks like this:

There are a number of different response code one can receive, and if you do, check out this list to gain guidance on remedying your issue.

Making a Beautiful Soup!

At this point we have finished copying the data to our local computer, but it is in a state that is almost impossible for a human to be able to comprehend. If you would like to see this on your own, just use the .text attribute on your response object to display it. Here is the example from my code:

I did need to cut off a portion, but this is from the URL indicated above

This is where the Beautiful Soup package comes in to save the day! This will allow the user to convert the jumbling HTML you see above into something a bit easier to understand for a human reading it. The code I used in my project is below:

Line 2, is where “all the magic happens”. You will need to put in the response.text followed something indicating the type of parser you want to use as Beautiful Soup reads over the response.text you feed in. Typically I have used the ‘html.parser’ parameter, but the documentation indicates other options available such as ‘lxml-xml’ or ‘html5lib’ (it also may be the type of markup to be used). They recommend that you name a specific parser, so that Beautiful Soup gives consistent results across platforms and virtual environments.

Here is what the HTML looks like after becoming “beautiful”:

If you can notice, each of the lines represents a portion of the HTML layout

Now we can start targeting and collect the information we want from this page using our knowledge of HTML tags. This blog post does a good job at explaining some of the most common tags one may encounter when scraping from the perspective of someone developing a webpage.

Searching the Soup

So now that we have our soup all nice and tidy, we know just need to figure out what information to pull from it. In the case of my project, I was going to be scraping multiple pages stemming from the original Wikipedia page URL I entered, so my target was links the lead to other pages to then scrape. The HTML tag associated with my target was ‘a’. Here is an example:

<a href="/wiki/Seventeenth_Amendment_to_the_United_States_Constitution" title="Seventeenth Amendment to the United States Constitution">17th Amendment</a>

This specific tag contains the link to the Wikipedia page covering the 17th Amendment to the U.S. Constitution under the href attribute. This was pulled as a result of the opening and closing characters being ‘<a’ and ‘</a>’. The code I used to retrieve this was in line 3 of the snippet above, but here is the line itself:

If you are doing a different project, simply replace the ‘a’ with whatever HTML tag you are looking to target is. Additionally Beautiful Soup allows for different searching methods beyond the .findAll() I used here. This link will take you to those options. Some of those besides .findAll() that I use most during my exploration of webpages are:

  • .find()
  • .findNextSibling()
  • .children

At this point, if you are following along, you should have a list of links scraped from Wikipedia with all of the senate election years available. Here’s what mine looks like:

Same information from URL above

Otherwise if you are focusing on just one page, this should be enough to get you started with the compilation of the information you have with Pandas. I will be covering how to organize a dataset from this point in my next blog covering this project. The entire project is viewable on GitHub if you want to look at the project as a whole. Until then, good luck making some tasty soup!

Links

Fili-busted!

  • Part 2 — Web scraping multiple pages
  • Part 3 — Exploring the data
  • Part 4 — Predicting the vote

Link to full project

--

--