In part 1, we were working through the beginning stages of creating a data set directly from information available on Wikipedia. In that post I covered how I used the Requests and Beautiful Soup packages to scrape a list of links from one page that would lead to other pages with the information needed for data set. In this post, I will continue to walk through the process I used to create a custom function to retrieve and store targeted information directly from web pages.
Where We Left Off…
At this point we have a list of all HTML elements on our root page that hold the tag “a” (anchor), meaning they contain a hyperlink (href) to another web page or different part of the current page.
In this specific case, I did not need all of the links that were on this page, just a small section of them. In order to make sure I caught everything I wanted, I had to do some manual searching through the captured links and on the web pages themselves using the “Inspect” feature on Chrome.
In order to find the links I needed to target within the
start_links list, I just used the inspect tool on Chrome to find the first and last links and matched that to the indices they represent (depending on your needs and the page, this may not be necessary). Eventually I came up with this:
At this point, I have what is needed to access each subsequent page on the senate elections stored in
start_sen_links . My next step is to pull out the information stored in the href attribute and use it to navigate from page to page within a single scraping function.
If you’ve been following along with my code on your own, you may have noticed that each string stored in the href attribute does not contain the full URL to a web page, just what appears to be the last bits of one. This is actually the case, as we will need to determine the base URL component that will take us to the desired web page before scraping using a custom function.
Luckily, this only meant navigating to the web page in Chrome (or your preferred browser) and copying the part of the URL that precedes the information stored in the href attribute. This part will vary depending on the site, so make sure to research where you are trying to navigate and adjust your code accordingly. Wikipedia fortunately was very consistent, meaning I only needed one base URL for all the targeted pages.
In order for the function to work, I need to be able to combine the two parts into a single URL that can be used by the Requests and Beautiful Soup packages to connect and scrape the data I need.
This can be achieved with the .get() method within Beautiful Soup and Python string operations.
Currently, I have everything needed to build the function out:
- A list of HTML anchor elements containing the end of the URLs to our targeted web pages
- A confirmed base URL to combine with end URLs for more scraping
- A confirmed process that combines URLs into a usable link
I now need to implement code that will do all of these steps in an iterative fashion, storing the collected data appropriately along the way.
In hind sight, I would have taken more time to investigate more of the pages I was going to scrape prior to doing so. In my specific case, Wikipedia has updated/altered how they identify certain items in the HTML over time, meaning the same items on separate pages may need to be looked up differently. In making my function able to account for this, I used a lot of time up towards my deadline.
After a first trial with my code, it became apparent that the election result tables I wanted from each page did not come with the state’s information once scraped. I would need to come up with a (creative) solution that did not involve scraping a different source. This solution came by way of the table of contents for each page, as it contained by name each state’s election for that year.
Keeping this in mind, here is the for-loop portion of the code I came up with for my function:
One thing to point out in this code is my use of the attrs parameter in the Beautiful Soup objects. When using the attrs parameter you only need to input a dictionary with:
- The key(s) set as the HTML attribute(s) to select for
- The value(s) set as one or more strings (in a list if multiple) specifying the version of attribute to collect
By using this parameter I was able to only collect tables that contained election data (save some edge cases)!
Secondly, I took advantage of a custom filter in line 30 called
is_state . This enabled Beautiful Soup to look at each of the links in the table of contents and, based upon the link, decide whether or not to save it into the
link_toc variable. Here’s a link to the documentation that showed me how to do it: Filtering with functions.
Lastly, line 34 is very important to this process, as it turns the HTML for each table into a closer representation to what it appears like in a browser or the typical Pandas DataFrame. Tip: this method requires the HTML to be in string format for it to convert properly.
Checking Our Work
Putting together the code from part 1 with this, we now have a function that will:
- Connect to a web page with links to more pages and scrape them
- Combine ending URLs with base URL to access these pages
- Use Beautiful Soup methods to collect all tables with senate election data and the table of contents for each page for state referencing later
- Convert tables from HTML is to easily workable Pandas DataFrames
- Store table of contents and DataFrames in a dictionary with the corresponding year for the elections as the keys
During each step, we’ve verified the code will take a table from Wikipedia like this:
Turn it into it’s base HTML elements:
Then return a Jupyter Notebook friendly version via Pandas:
The next step(s) in creating a data set from this would be to reformat and/or clean the data as necessary. The specific methods required will vary depending on the source material. The more consistent the source is with their HTML, the easier this will be, so make sure to do this process in manageable chunks with respect to any deadlines you may have if the HTML is inconsistent (like in my case).
In part 3, I will show the final results of my scraping and scrubbing along with some of the cool visuals I was able to create it!