Fili-busted!

Part 3: Exploring the data

Published in

The Startup

7 min readSep 9, 2020

Creator: Guzaliia Filimonova | Credit: Getty Images/iStockphoto

In the previous two blogs, I detailed how I used the Requests, Beautiful Soup and Pandas packages to turn multiple Wikipedia pages covering United States Senate general elections into a data set for use in a machine learning model.

Part 1 went over how to web scrape a single web page and store it’s corresponding HTML components locally
Part 2 explained how I used that web page to access and scrape other pages using one function

This part will explore the final data set that I was able to put together using the methods detailed in the previous to posts. In addition, I will show some of the visuals I created using Matplotlib’s Pyplot package.

Where Were We?

In this series’ previous post, I left off with a dictionary full of the information scraped from the links found on the List of United States Senate Elections Wikipedia page. It contained each year stored as the key, with a list of Pandas DataFrames as the values, representing all of the U.S. Senate general elections for that year.

In order to keep this series shorter than Star Wars, I will not focus in on the process that I used to clean and reorganize the data into one single CSV file. I will save that for the future, as it was quite the experience and would require its own trilogy.

Knowing this, I must make it clear that there were many decisions I made during my process to arrive at the data set used in the final product. The beauty of using such raw data is how flexible it is, meaning that what I did with this information is by no means absolute and that the possibilities are endless.

The Data Set

Without further ado, here are the final results of my web scraping and cleaning:

Last 20 data points from final DataFrame

This CSV includes 8 features on the 5,588 candidates collected:

% → Percentage of votes. The portion of total turnout the candidate received.
Turnout → Total voter turnout. The sum of all votes cast in given state senate general election.
Incumb_Y → Incumbency encoded. Indicates whether or not a given candidate is the incumbent senator numerically (1/True, 0/False).
State → Where the election is. Identifies what state’s senate seat is up for election.
Cln_name/First_name → Who is running. Name of the candidate running in a given election.
Year → When it happened. The year that a given election took place in. This data set included data on the years between 1920–2016 (inclusive).
Terms_in_office → How long it lasted. Indicates how many terms an incumbent senator has sat in office.
Party_enc → Candidate’s political affiliation. Encoded representation for candidate’s party on ballot. This includes Democratic (D), Republican (R), Socialist (S), Independent (I), Third-Party (T).

These distinctions were decided based upon party distribution all-time. Third-Party designation is essentially an “Other” category here.

Seats_up% → Seats up for election. How many of a party’s seats are up for grabs in the current election year. Created as a function of total seats up divided by total seats held at the beginning of the elections.
Seats_before% → Seats held last cycle. The number of seats held by the party at the end of the last election cycle. Expressed as a function of number of seats held as of last cycle divided by the total seats in senate available (at that time).

Exploring the Data

Initally, I was concerned about the breadth of features when starting the EDA (Exploratory Data Analysis) process because the last two data sets I used for projects had over 20 features each. To my surprise, I was able to create some interesting visuals after playing around a bit with the Pandas package, especially the “groupby” objects created using the DataFrame method with the same name.

What’s in a name?

While looking at my features, I wanted to try and use some of the features that wouldn’t be easily encoded as part of my exploration of the data set. With this in mind I decided to try and see if there was anything interesting in relation to a candidates name and their success in elections. Here’s what I came up with:

Number of occurrences of senators with a given first name

I was not surprised to look at the style of names among this list, but the overwhelming lead the name John has over the next rank was quite the finding! On another note, the first occurrence of a traditional female name was not until the 34th spot with only 16 terms served (Barbara). In order to keep the project moving forward, I did not spend too much more time exploring this nor use this in my modeling, but definitely intend to as I fill this project out further.

Here is how I did it:

Creation of groupby object plotted

Who you repping?

Having 96 years worth of senate election data in front of me, I thought it worthwhile to find out how the different political parties have participated in the elections during this period. This proved to be true, as there were some peaks and valleys in the total candidates running each cycle (depending on the party) while others remained fairly constant.

The two encoded parties with the most fluctuation were the Socialist and Independent designations. Keep in mind that the former is/was organized in more of a traditional sense while the latter represents a candidate that is not aligned with a party and therefore “independent”.

Raw count of candidates per year for Independent & Socialist designations

Prior to doing this, I would have guessed that the appearance of declared Socialist candidates would decline as time went on (especially once the Cold War period hit), but I was surprised how many there were in some of the earlier years. Additionally, the volatility in declared Independent candidates was interesting. There is the ~40 year period of 10 or less candidates bordered by the two spikes in participation in 1938 and 1978, which begs for more investigation. For example, maybe those few candidates hanging around were incumbent senators winning re-election campaigns.

Here’s a the code to make the whole plot:

For loops to create groupby object for each party and plot them

Side note: I did come up with some plots of the same data put in a more zoomed-in manner. The following graph is an example of five election cycles (1990–1998) ; the code can be found here.

It’s not what you know

The last set of graphs I want to share have to do with the value senatorial experience brought to those candidates who had it. My intuition and previous knowledge on the topic led me to believe that being the incumbent senator would be at an advantage in most elections over their challengers. This notion proved to be true, however there were some surprising results once you draw the proverbial “party lines”.

The following graphs will show the difference in the average percentage of total votes received (i.e. voter turnout) between incumbents and challengers per party. Simply put, this will demonstrate how a defending Republican’s performance (on average) is differs from a challenging Republican’s.

Speaking of, here is the graph for the Republican party:

Difference in % of votes earned, Republican ~15%

Next, we have the Democratic party:

Difference in % of votes earned, Democratic ~20%

Finally, Third-Party (Other) designations:

Difference in % of votes earned, Third-Parties ~50%

Clearly demonstrated by the above graphs, historically and especially as a Third-Party candidate, you would expect to severely underperform in your elections with respect to your incumbent counterparts. Thus, the evidence suggests that it pays for challengers to “fall in line” with one of the big two parties if they want to have a good shot.

Here is the code I used to make the plots for all the parties included:

Where to now?

We’re almost done! If you’ve made it this far in my Fili-busted! series detailing my journey creating a data set straight from Wikipedia, I want to say thank you and hope you are learning something. In this post I displayed a few way that I decided to explore what information this data set can provide, and I am sure there is more I missed. All that is left now is to show how well it worked when I tried to use it in a supervised machine learning capacity. The next post will discuss this, going over how well a Sci-Kit Learn regression model was able to predict the percentage of votes a given candidate would receive in their respective election.

Links

Fili-busted!

Part 1 — A web scraping introduction
Part 2 — Web scraping multiple pages
Part 4 — Predicting the vote

Link to full project