Creating a Binary Classifier to Predict Eligible Voter Participation in Presidential Elections

Using results from the 2018 General Social Survey

Avonlea Fisher

Published in

Analytics Vidhya

6 min readOct 6, 2020

Introduction

The upcoming presidential election is widely considered to be among the most important elections in recent US history. Voter outreach organizations are working tirelessly to ensure a high turnout, especially amid new voter suppression concerns related to the COVID-19 pandemic. Past research has suggested that socioeconomic variables such as education, wealth, and occupation are strong predictors of voter turnout. In this article, I’ll present how I used a selection of the most recent data from the General Social Survey (GSS) to build a binary classifier to predict voter participation. Specifically, the objective of the model will be to predict whether a respondent didn’t vote in the 2016 presidential election.

The ability to predict if an eligible voter did not vote can be used to increase overall voter turnout in future elections. It’s important to note that any model designed to predict human behavior is inherently limited, in that an individual’s attitudes, motivations and circumstances can change within a given period of time — and especially over the course of an entire presidential term. Furthermore, there is a diversity of views about the value and efficacy of voting, and some eligible voters may abstain from it on principle. This project doesn’t assume that every non-voter can or should choose to vote, but is intended to serve only as a guideline for voter engagement organizations that are interested in identifying likely non-voters.

Cleaning and Encoding the Data

For this dataset, I chose variables that were related to respondents’ socioeconomic status, political identity and attitudes, and participation in the 2012 and 2016 elections. I dropped rows with responses such as “not applicable,” and “don’t know,” and also excluded the data for respondents who were ineligible to vote. If more than a third of the column contained missing data, I dropped the column.

Before encoding, nearly all of the data were categorical:

I performed two types of numerical encoding:

Binary encoding: values ‘0’ and ‘1’ for variables with only two values. Zero indicated ‘no’/absence, while ‘1’ indicated ‘yes’/presence.
Integer encoding: integer values that corresponded to a value’s level on a scale.

Fortunately, all of the features could be transformed to be compatible with either binary encoding or a numeric scale. I wrote the following function to encode each feature:

And here’s an example of how I used it to transform the “Father’s highest degree” variable into the binary variable “Father Attended College.”

Upon repeating this process for each of the columns (or using pandas.get_dummies where appropriate), I had a data frame consisting only of integer values:

In this final dataset, there were 18 columns with data from 1,083 respondents.

Exploratory Analysis

A correlation matrix of the variables showed that, overall, correlations were weaker than one might expect. No correlation exceeded .75:

I was interested in seeing the percentage of voters within each class for variables that were most strongly correlated with voting in 2016. The following helper function creates a dictionary with the unique values in a provided column, and calculates the percentage of those who voted.

I could then plug this function into another, plot_voter_perc, which generates a plot of voter percentages, given a column and plot labels.

I used the function to create the following bar plots:

With some of these variables, there is a clear linear pattern: voter turnout rates increased as the respondents’ age and level of education increased. Upper and middle class voters had higher turnout rates than those who identified as lower or working class, and liberals and conservatives voted more than moderates. Perhaps most interesting is the extremely high percentage of 2012 voters who also voted in 2016.

Building and Optimizing the Model

I tested several classifier types on the data, including logistic regression, random forests, and gradient boosting. Of these, the random forest classifier had the best performance. I first split the data into training and test sets:

The independent variable had fairly imbalanced classes, with the majority of respondents reporting that they voted in 2016:

When attempting to predict instances of a minority class, overall accuracy is not the best metric to measure a classifier’s performance. A model could predict that 100% of respondents voted and still have a relatively high accuracy score. Instead, model performance was evaluated based on recall or the true positive rate. A true positive, in this case, would mean that the model correctly classified a respondent who didn’t vote in 2016.

After importing the correct dependencies and fitting each model to the data, I wrote a function to plot a confusion matrix next to the model’s accuracy, recall, precision and F1 scores.

The default model had a failing recall score because, due to the imbalanced classes, it predicted that all respondents voted:

To improve the model, I used RandomizedSearchCV, which takes an estimator and a grid of parameters as arguments. It performs a non-exhaustive, random search over the parameters passed in, and returns the model with the best performance.

The above code gave me the following improved confusion matrix:

While a true positive rate of .74 isn’t amazing, it’s substantially better than random guessing. The model correctly predicted 59 out of 79 total respondents who didn’t vote in 2016. The ROC curves below show how the model performed in comparison to other models that were trained and optimized:

Conclusions

The random forest model, which had the highest AUC and a recall score that tied with logistic regression, was the best-performing classifier overall.

Respondents’ participation in the 2012 election appeared to be the strongest predictor of voting in the 2016 election, which suggests that past voting behavior may predict an eligible voter’s participation in future elections. Political identity was also a strong predictor: respondents who think of themselves as conservative or liberal were more likely to have voted than their moderate counterparts. Voter turnout rates also appeared to be positively linearly correlated with age.

With these observations in mind, organizations seeking to increase voter turnout by engaging potential non-voters should prioritize those who haven’t voted in past elections, younger voters, and political moderates.

Full code available on GitHub. Note: Some minor modifications have been made to the above code in the GitHub repository since this article was written, but the random forest classifier was still the best-performing model.