Explaining the 2016 Democratic Primary with Machine Learning.

3 min readJan 2, 2019

What was the 2016 Democratic Primary about?

Based on the following survey analysis, the main distinguishing features between Clinton and Sanders supporters are age, party-id, TransPacific Partnership support, and Obama approval.

Looking at the the 10,690 Sanders and Clinton voters who took the 2016 CCES survey, the answer is largely not policy or race or even gender. While the CCES has 563 questions, I reduced and transformed those questions to 145 potentially relevant predictors. I then built a statistical model, specifically a gbm, to figure out which of those 145 are predictive in practice. Because I only want to find the substantive predictors, I chose model settings most likely to do this and omit the more marginal predictors.

Here is what I found, including the relative influence of each variable:

As this plot is not intelligible, I then took those predictors and ran a logistic regression to more easily find magnitude and direction of these variables.

What I found predictive, ordered:

youth(age, 2016-birthyr),
weak or lack of identity as a Democrat (pid7),
opposition the the Trans Pacific Partnership(CC16_351B),
relative disapproval of Obama(CC16_320a, i.e. 67% of Clinton voters strongly approved of Obama while 41% of Sanders voters did),
religion (religpew, i.e. identifying as non-religious, Muslim, or Buddhist), economic pessimism (CC16_304),
race/ethnicity(race2[combined race & hispanic], identifying as non-black and non-asian, especially white or other),
preferring to cut military spending over other spending or raising taxes (CC16_337_1),
liberal identity (ideo5),
social media use (CC16_300_2) / TV non-use(CC16_300_5),
and general opposition to military use of force(scoreantiwar, a standardized average of the CC16_414 series of questions).

Additionally, there are some state-level patterns where, for example, Sanders preformed most disproportionately well among NH & VT respondents and worst among DC & AR respondents. Some of the more marginal predictors of interest include opposing the proposal to “Increase the number of police on the street by 10 percent, even if it means fewer funds for other public services”, but not support for crime reform in general.

Naturally all these predictors are conditioned on having voted in the 2016 Democratic primary. There are also questions which may be predictive like (almost certainly) feelings towards socialism or (maybe) hostile sexism but aren’t in the 2016 CCES dataset. Nonetheless, an out-of-the-box gbm is able to distinguish between the two camps with an accuracy rate and AUC of ~72%. The logistic regression based on the gbm’s results also preforms well.

Curiously, as someone who at the time felt very cross-pressured by the race unlike most of my peers, I inputed my responses in the model and found I would either have a 52% (using OH, from where I had recently moved) or 32% (using DC, to where I had recently moved) chance of having supported Sanders.

I began this project to answer how relevant were policy differences to the primary. I find that only opposition to the TPP and opposition to military spending/use of force are relevant in explaining support Bernie Sanders to any substantive degree. Instead, it’s mostly a function of age and feelings of warmth towards Obama’s Democratic Party. Especially, age:

R Code: https://github.com/zachacrowell/samples/tree/master/code_samples/demprimary2016

Explaining the 2016 Democratic Primary with Machine Learning.

Written by Zachery Crowell