New Questions, New Answers? A Preliminary Micro-level Statistical Analysis of the 2016 US Presidential Election
Who are the 59.6 million voters who voted for Donald Trump, and who are the 59.9 million voters who voted for Hillary Clinton? And what about the voting age citizens, all ~97.9 million of them, who did not vote?
In a work-in-progress manuscript which I invite you to read and comment on, my colleagues and I describe our efforts to begin answering these questions, by asking new questions of available data. We do this, in short, by combining data from the most recent United States census with the electoral vote counts by county. We are thus able to draw conclusions across the entire United States and at a local level, about voters and non-voters, for interesting and novel combinations of predictor variables. Exit polls, by comparison, focus on national and state results (for only some states) and only on people who went to vote — they lack both breadth and detail.
For the sake of this post, I will delve into just a subset of the many correlations we make, by probing the data on voting patterns by sex and income. For example, our preliminary results suggest an interesting relationship between personal income and voting. As shown in the table below, those making less than $50,000 per year were 65% of the population, and they voted for Clinton over Trump 51% to 49% (we have excluded third parties for this part of the analysis):
PERSONAL INCOME Clinton Trump % of electorate
< $50,000 51% 49% 65%
50K - 100K 46% 53% 24%
> 100K 42% 58% 10%
As we move up the income scale this trend reverses; the 24% of the population making $50,000 — $100,000 supported Trump, as did the 10% of the population making above $100,000. Much has been written and will be written about the role that the working class played in this election. But these results, and the exit polls, both show that Clinton won at the lower end of the income scale. (Our results differ from the exit polls for the higher end of the income scale, but this may simply be because our results are based on personal income while the exit polls asked about family income.) Intriguing follow-up questions to ask might be the way in which race or geography interacted with income, especially at this lower end of the income scale.
Similarly, our method shows that among women, support for Clinton was 56%, while among men, support for Trump was 55%. Since women were 53% of the electorate and supported Clinton at this higher rate, this reminds us that if US elections were based on the popular vote, rather on the Electoral College, Clinton would have won (she did in fact win the popular vote and her margin is likely to rise from what I listed above).
Our method allows us to dig deeper, looking at how these relationships play out together:
PERSONAL INCOME and SEX Clinton Trump % of electorate
< $50,000 55% 45% 37%
50K - 100K 56% 44% 11%
> 100K 52% 47% 3%
< $50,000 47% 53% 27%
50K - 100K 42% 58% 13%
> 100K 41% 59% 7%
Here we see something interesting. At all levels of income, women supported Clinton while men supported Trump. But the composition of the electorate changes as income increases, as shown in the last column: under $50,000, women are a larger percentage of the electorate, but over $50,000, they are a smaller percentage. This gives one possible explanation for the first table: Clinton won among those making less than $50,000 because there were more women making less than $50,000. But an important caveat applies to this explanation: we have no way of knowing about the relative importance of sex versus income or any other variable in understanding why voters voted the way they did.
What we can do with data like this is ask which demographic variables are particularly useful in finding patterns (that is, in “predicting”) the results. We undertook an exploratory analysis to see which census variables describing people were most predictive of the outcome, again using geography to link the two datasets. (Note that this is not an explicitly spatial analysis — we do not include location variables like urban/suburban/rural. We have spatial analyses planned for the future.) Here are the categories of most-predictive variables that we found:
the interaction of race/ethnicity with "has a degree"
level of education
type of work and industry
age interacted with hours worked per work
These relationships are at the “group”-level, unlike in the tables above. What this means is that when considering education, for example, we are looking at the relative fractions of people in a region with less than high school, high school, some college, college, and graduate education. When considering age as a predictor, we are looking at the full age distribution — how many millennials versus baby boomers, etc.
Below, we plot the census variable for “Detailed Ancestry” on a plot showing Clinton/Trump vote share horizontally, with Clinton on the left and Trump on the right and Participation rate vertically. As more than 40% of voting age citizens did not vote, we thought it would be interesting to understand who they were as well. The census allows over 200 options for this question, and we included the most common choices.
The size of the circles are indicative of the proportion of each group in the electorate population. Regions where people listed their ancestry as “American” were very supportive of Trump and likely to vote; regions where people listed their ancestry as “White” were also supportive, but unlikely to vote.
Here is a map that we made using our method, showing the gender gap in support for Trump, calculated as support for Trump among men minus support for Trump among women:
29 regions had gaps of more than 20 percentage points and 54 regions had negative gender gaps (higher support for Clinton among men than women).
We see our novel statistical method as a new way of understanding voter patterns, including the patterns of those who did not vote and relationships between group-level characteristics and voting. We are able to fill in the incomplete (and potentially misleading) picture painted by exit polls, especially before large-scale surveys and the voter file become available. There are many, many more relationships remaining to examine, just a small fraction of which are currently in our manuscript. We have focused above on relationships at the level of the entire country, but local effects could be very informative as well. I encourage statisticians, public policy experts, humanists, social scientists, and journalists to explore the correlations presented by our statistical analysis, raise new questions, and provide new answers with an in-depth analysis of the meaning of these numbers.
Acknowledgments: thanks to my collaborators, Prof Yee Whye Teh (Department of Statistics), Yu-Xiang Wang (Carnegie Mellon University), and Dougal J. Sutherland (University College London), and to Oxford historian Jaclyn Granick for extensive comments.