Predicting if a democrat will win their primary

Data Science Bootcamp : Post #3 : Classification Algorithms

Sami Ahmed
Nov 3 · 7 min read

This is the third installment in a series about my data science bootcamp experience at Metis. Please read my earlier posts on predicting opioid deaths and identifying peak traffic for New York’s subway stations before this one.

My Project 3 Motivation

In the U.S., if you want to become the president of the United States, you first have to get nominated. To become the presidential nominee, a candidate must win a majority of delegates, and that happens through the party’s primaries and caucuses. I don’t have any plans for running for president, but I do want to know the question bouncing around almost every American’s head (including mine): who will the democrats nominate?

So many candidates, so many issues

As of Nov 2nd 2019, there are over 16 democrats running for the U.S. presidency. I went into this project with some assumptions. The current political climate in the U.S. is staunchly partisan, so naturally if you want to be the nominee, you should embrace the partisan current and radically lean into topics like a Bernie or bust evangelical. My findings support that hypothesis; however, some of the other attributes that bubbled to the top of my analysis raised my eyebrows. Out of the winners in my dataset, a few political action campaign (PACs) pop up again and again. Read on to learn what I discovered.

For this third project, I took a magnifying glass to the open primaries of the democratic party. The question I sought to answer: what are the relevant candidate attributes that have enabled democrats to win their respective primaries?

Data rundown

I looked at a dataset hosted on Kaggle, provided by Five Thirty Eight, New York Times, and Ballotpedia. It included 811 Democratic candidates who appeared on the ballot in 2018 in Democratic primaries for Senate, House and Governor. This excluded races featuring a Democratic incumbent, as of August 7, 2018. Each observation (i.e. row) represented a unique candidate.

Instead of loading flat files straight into my Jupyter Notebook, I setup a schema, loaded the data into a PostgreSQL database housed in my AWS cloud instance, and queried the data from there. This was a great learning process around the challenges of a schema first system, how to neatly organize tables in a database, and how to query a database hosted in the cloud from my local machine.

Initial Insights

Some notable artifacts from my exploring my data (before any modeling):

  1. Class Imbalance — Out of the 811 candidates, only about 29% won their primary. This is what we call “class imbalance”, and poses it own unique challenges in the modeling process.
  2. Nominal Features — The data had a prevalence of nominal features like: Endorsed by Biden, “yes or no.” I one hot encoded these values so that they could be comprehended by models.
  3. Missing values — A lot of these nominal features had missing values. I chose to set them to 0 i.e. ‘no.’ There are plenty of rigorous methods for imputing values — you can use the feature’s mean, or even run a regression — however, most of the missing data in my case had only two potential values: ‘yes’ or ‘no.’ So we can pretty safely assume if it is not a ‘yes’, it’s a ‘no.’
Subset of my features before cleaning — lots of nominal data with missing values

Modeling workflow

If Metis has taught me anything so far, it is the value of iterating over your process end to end instead of perfecting each step along the way. With that in mind, after getting my data into SQL, cleaning it up, and getting my data properly segmented into train/test sets, I got right to modeling. I setup a dummy model, pipeline of naive algorithms, and finally optimized the naive algorithms that best captured the ‘wins’ in my dataset (as that is my target) to get my final model.

  1. Dummy model — for this project, my dummy model is simply a model that always predicts my majority class (primary loser). The value in a dummy model is that you can benchmark it against your tuned models. I used Sklearn’s DummyClassifier and set the strategy to ‘stratified.’
  2. Naive algorithm — a model before it has been tailored to my data. There are so many kinds of classification algorithms out there — logistic regression, random forest, gaussian naive bayes, K-nearest neighbors, and support vector machines just to name a few. I established a pipeline that enabled me to quickly loop over 7 distinct algorithms, and see how they performed. I evaluated the confusion matrix out of each naive model in order to see which ones best captured wins. The top two performers were then optimized.
One of the metrics, AUC/ROC, I used to weed out weaker naive algorithms
  1. Optimized model — a model specifically tuned to my data,. My tailoring included omitting features, scaling features, tuning model hyperparameters through Sklearn’s RandomizedSearchCV and GridSearchCV, and finally SMOTE. The technique that produced a model adept at capturing primary wins was SMOTE. This makes sense considering that my classes were so imbalanced in this dataset. In a sentence, SMOTE is the synthetic oversampling of a dataset’s minority class.

Model interpretability

No matter how much fancy hyperparameter tuning and synthetic oversampling one does, what makes a good data science presentation great? Model interpretability. From my years of experience on the front lines selling data science software b2b, it is really impactful to generate declarations straight out of your model coefficients. Here are a few I generated out of my logistic regression model for this project:

Indivisible (PAC) endorsed? You have ~23% higher odds of winning.

Mom’s Demand Action/Everytown (PAC) endorsed? Your odds of winning are 24% higher.

Pretty cool right? Armed with this information, I dug a bit further and learned a bit about each one of those PACs:

  • Emily List is primarily concerned with bolstering pro-choice candidates.
  • Indivisible is focused on electing progressive leaders that oppose anything trump related.
  • Everytown works to prevent gun deaths through advocating background checks, stronger gun trafficking legislation, and opposing NRA lobbying in Washington.

What do all of these PACs have in common? They all sit squarely on the left side of the aisle. They reinforce the message heard loud and clear across America: partisan, partisan, partisan.

To bring it home, Five Thirty Eight’s calculated field called ‘Partisan Lean’ had even more predictive power according to my logistic regression in determining primary wins:

In addition to logistic regression, my random forest was strong in capturing primary wins. I have an affinity for random forests and classification and regression tree (CART) models because humans can interpret ‘Gini Importance.’ Gini Importance has its’ own shortcomings that I plan to expand on in a future article; however, for the purpose of this use case, we will interpret Gini as a measure of how important each feature is relative to one another in my algorithm’s decision making. A feature will rank higher on the Gini scale given its potential to decrease entropy at a given node in the random forest. The three PACs and Partisan Lean features that had significant % odds from my logistic model also topped the Gini Importance scores out of my random forest.

Data Visualization

Data visualization of the features deemed important by two of my models

Now that we know a handful of PACs and Partisan Lean are strong features for candidate’s winning their primary, what can we do with that information? We can make it actionable. I created a tableau dashboard with these features for all of the winners from my dataset. A ‘1’ indicates that the candidate is endorsed by the PAC in question, and you can filter the map by each office type.

Conclusions : Technical

Through a rigorous modeling workflow, I honed in on two models that were most adept at predicting my target— primary wins. The best method for optimizing these models was SMOTE. The high interpretability of logistic regression and random forest models made for a strong presentation of results.

Conclusions : Sociopolitical

Ultimately, the results of my project can be summarized by one word: polarized. As much as you or I want to believe you are an autonomous eyes-wide-open voter guided by your own opinions that align with your candidate’s opinions, we need to investigate where exactly the endorsements and funding are coming from. In 2016 alone, Emily’s List raised around $60M mostly earmarked for Hilary Clinton. It is truly a vicious cycle. More PAC funding provides the candidate with more latitude to reach voters through ads/speeches etc., and the viewpoints of those PACs (in this case) reflect ideals that I would not characterize as centrist. America is divided down the middle by a thick partisan line, fueled by PAC money and endorsements.

Thank you

Thank you to my instructors at METIS Lara, Alice, and John for the great lectures on classification.

Thank you Jay Qi for continuing to help peer review these posts. Your selfless commitment to fostering a thoughtful community of data scientists is truly inspiring.

Sami Ahmed

Written by

Data science in the AM, music in the PM

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade