Using Data Science to PredictWho Will Win the 2020 Presidential Election

A Comprehensive Analysis

Published in

Vinod B

11 min readOct 13, 2019

The Business of Forecasting Elections

When forecasting presidential elections there are roughly two kinds of distinct models: (1) fundamentals based models and(2) polls based models.

Fundamentals based models use economic data like GDP growth, demographic data like the fraction of White residents in a state, and other “fundamental” factors to forecast elections. Polls based models, on the other hand, leverage state and national polls to make predictions.

Historically, polls based models have outperformed fundamentals based models in forecasting elections, but it is currently too early in the election cycle for polls based models. Most states have not had any polling yet, and any polling at this stage is notoriously un-predictive of the election let alone the ultimate candidate in each party (i.e. who will win the Democratic primary).

This makes any current head to head general election polls interesting but ultimately not indicative of how the electorate will vote on election day, especially because they are biased towards candidates with high name recognition.

To understand the state of the election today, we therefore need to turn to fundamentals based models. One concern (and one reason why they tend to perform worse compared to polls based models) is that there have been very few presidential elections in US history, providing little training data.

This is especially true when we consider the fact that the political party structure has consistently changed over time. The modern version of Democrats and Republicans only came into existence around 1980, leading to 10 past elections including 2016.

As a result fundamentals based models tend to overfit past data and perform poorly when trying to predict the future. We can, however, overcome these concerns of overfitting and build a model that has performed quite well in forecasting past election outcomes even this far out from election day. This suggests the insights within the model are useful for understanding the electorate going into the 2020 Presidential election.

Building the Best Model

When trying to build a fundamentals based forecast of the election, there are many different methods to use. Here we use a technique called linear mixed models because they allow us to easily simulate the election by adding in historical error to greatly reduce the overfitting problem noted above.

We regress each state’s (including DC) democratic vote share in Presidential elections between 1980 to 2016, denoted Y, on a set of state level features, denoted X (see below for the specific set). We add in what are known as random effects for the election year and region the state falls into as defined by the US census to capture national error and regional error respectively.

The model ends up taking the following form:

Y = B * X + National Error + Regional Error + State Error

It predicts the expected democratic vote share in a state (B * X) based on the fundamental factors in the model and decomposes the estimation error into three components at the national, regional, and state levels. The estimation error tries to account for the historical accuracy of the model and random changes in how people might vote.

The national error captures things like national swings towards one party or another such as in the 1980 election when almost every state turned Republican because of the unpopularity of incumbent Jimmy Carter. National error is modeled to be the same for all states.

The regional error, which is the same for all states in a region, and the state error capture the correlation between states. We know for example that states in the same region tend to vote similarly (such as states in the South) and specific states in different regions also vote similarly like Washington and Massachusetts.

To predict the election with this model, we first compute the expected democratic vote share in each state based on the selected fundamental factors. Then we add in the national, regional , and state errors based on the historical accuracy of the model.

Each of the three errors is drawn from a separate probability distribution (we use a t distribution that has fatter tails than the normal) centered at zero with a variance set to the historical error of the model. We first draw the national error, which is the same for every state, and add this to the expected democratic vote share. Then we draw the four regional errors and add the appropriate one to each state and finally draw state errors, accounting for the historical correlation between states, adding these to the expected vote shares as well.

This gives us the simulated Democratic vote share in each state for the election and one minus this is the Republican vote share. We can then compute the number of states each party wins and therefore the number of electoral votes it would get to determine the winner (ties or cases when both parties fail to get to 270 electoral votes are broken by randomly choosing one party as the winner).

Repeating this process a large number of times can provide an estimate for how likely the model thinks each party winning the election is. We therefore make probabilistic forecasts to account for historical uncertainty and estimation error.

Historical Results and Looking Ahead to 2020

To predict the results of an election we follow the procedure outlined above, first training a model on all elections from 1980 to the election prior to the year being forecast. We then use that model to forecast the following election. This means when forecasting, we only use information that would have been available at that time, providing a view of how accurate the model would have been historically.

We make two sets of predictions: one based on data available as of September in the year prior to the election (i.e. the state of the 2020 election today) and another using data just prior to election day. Below is a table showing the historical results of the model based on which party it would have predicted to win each election since 2000 (along with the probability to reflect how confident the model was) and the number of states it called correctly each time.

As of the September prior to the election year, the model would have called every election correctly except for 2016, and at that point in time in 2016 it really did look like the Democrats were going to win the election.

Switching to just before election day, we see that the model would have called every election correctly, including 2016, which many polls based models actually missed.

In terms of state accuracy, the model does fairly well, especially for more recent elections, averaging close to 90% accuracy, meaning it misses on average only around 5 states each election.

To give a more graphical sense of the model’s accuracy, here is a plot showing for the two prediction times (the September prior to the election year and just before election day) the model’s predicted Democratic vote share in each state (x axis) against the actual vote share (y axis). Points are color coded by election year with light colors representing later election years.

If the model was perfect everything would line up on the dotted black line as the predicted vote share and actual vote share would be equal. While the points do not perfectly line up, they are consistently close to one another, within 4–5% on average, showing the model is fairly accurate.

What does the model think will happen in 2020 then? It currently predicts a close election similar to 2000 or 2016 with Republicans slightly favored in an almost virtual toss up when predicting from September in the year prior to the election. If the election was held today, however, Republicans would stand a decent chance of being reelected at around 64%.

What Factors Explain Election Results

When building the model initially, we combed through many different economic, demographic, and other factors like approval ratings (over 50 variables total) to build the best set of features in forecasting elections.

Across many potential combinations, the model with the features below yielded the best results (which are detailed above). The variables are PVI, monthly state unemployment rates, state house price growth, national GDP growth, and the fraction of White residents in a state.

By far the most important factor is a variable called PVI (partisan voter index). This is defined as the popular vote margin in a state minus the national popular vote margin from the previous election (where the vote margin is the Democratic vote share minus the Republican vote share). It captures the partisan lean of a state and how much more Democratic (if positive) or Republican (if negative) a state is compared to the nation as a whole.

The next most important variable is the fraction of White residents in a state. We know that politics is increasingly polarized today and demographics are becoming increasingly predictive of how people vote. The electorate is polarizing across race, education, and age. It is no surprise that the two most important variables relate to this then. As the fraction of White residents in a state increases, on average, we expect the support for Democrats to go down.

The last set of variables that matter all have to do with the economy, especially the change over the last year and locally (at the state level). We know that voters tend to take elections as a referendum of the party in power with respect to how the economy is doing despite the fact that Presidents actually have little control over it. Better economic conditions favor Republicans. If US GDP growth is higher, local house price growth is higher, and the local unemployment rate is low, the Democratic vote share in a state goes down.

For forecasting elections then, what seems to matter is the partisan lean of the state, recent economic conditions, and whether the demographics in a state are favorable to Democrats or Republicans.

Impact on the 2020 Election

What does all this imply for the 2020 election then? To get a sense of that we can compare the current values of each factor in the model across states to the historical values.

Below we show boxplots of the z-score distribution for each feature listed above. Brown points are from 2016 and orange points from 2020 while grey/black dots are other elections since 1980. A positive z-score means the current factor values are higher than their historical average while negative z-scores mean the current values are below the historical average.

We see that that for 2020 overall (and relative to 2016) conditions favor Republicans. States are more polarized today than in the past, leaving fewer close states. Because of the oddities of the Electoral College, winning a state by 51% vs. 75% doesn’t matter, making it likely that Democrats could win the popular vote but lose the Electoral College, especially because their voters are concentrated in more populous areas and not spread out geographically across states.

In terms of economic conditions, unemployment is currently low and state house prices are growing along with a healthy overall GDP growth rate. Things could change as fears of a recession grow and indicators do seem to forecast an upcoming recession (such as an inverted yield curve where long term interest rates are lower than short term interest rates, though there are reasons to suspect this is due more to demographics and the trade war than an underlying weakness in the economy). Historically, worse economic times have benefitted Democrats, so a recession/weakening economy would hurt Republican’s reelection bid.

Finally, while states are getting less White over time, as we will see below, White voters are concentrated in key states, which hurts Democrat’s electoral prospects because the Electoral College advantages certain voters more than others.

States to Focus on in 2020

We can use our model to identify what states will be key for the 2020 election by looking at states where the model forecasts the popular vote margin to be within two percent (i.e. very close). Note we define popular vote margin here to be the Democratic share minus the Republican share, so positive values mean the state is expected to go Democratic while negative values mean the state is expected to go Republican.

Here are the closest states for both the 2016 and 2020 elections from the model. For 2016 we show the actual vote margin as well as the predicted vote margin. While the model incorrectly thought Colorado and Iowa would be close (they were important states in past elections), the model did correctly see that Wisconsin, New Hampshire, and Minnesota would be very close.

For 2020, the model expects much of the same states to be important again. Indeed, the Midwest as a region will again be central to the results of the election as Michigan, Minnesota, and Pennsylvania are again expected to be very close. The model expects these to be slightly leaning towards Republicans right now. Other predicted key states are New Hampshire, Virginia, Colorado, and Nevada. Look for these states to decide 2020.

For each of the states above, we compute the share of each variable’s contribution to the Democratic vote share to shed light on how conditions in each state affects the party’s chances of winning.

We see that Minnesota, Virginia, Colorado, and Nevada lean Democratic while Michigan, New Hampshire, and Pennsylvania lean Republican based on PVI.

Economic conditions overall are mostly a wash as unemployment rates as a whole are low, house prices are growing decently, and GDP is growing moderately. The big downside for Democrats though is demographically in many of these states as the fraction of White residents is fairly high.

Here is a state by state plot of the fraction of White residents. Many of the states above, especially in the key region of the Midwest, have more White residents than average, suggesting unfavorable demographics for Democrats in those states.

The hesitation then for Democrats in congress to call for impeachment makes sense in hindsight. Impeachment is a highly political process and given the forecast above for the 2020 election, what matters is how voters in the expected close states perceive of the whole affair. These states contain unfavorable demographic groups and some have a Republican lean.

The fate of the 2020 election will hinge on the state of the economy over the next year (which is outside of anyone’s control) and the sentiment of swing voters and overall turnout across demographics in Michigan, Pennsylvania, New Hampshire, and Minnesota (which is the forecasted tipping point state right now). Democrats have their work cut out for them to convince these key voters why Trump is unsuitable for office heading into 2020.

In addition, because the key states contain less favorable demographic groups for Democrats, it suggests the party should moderate and simplify its policy platform. Spending time discussing the intricacies of taxes to fund Medicare for All (it is not a coincidence that the Warren campaign has not released a specific healthcare plan yet) or the complexities of immigration reform (like the Flores Agreement or offering healthcare to undocumented immigrants) will likely confuse and turn off most voters outside of the party’s base.

There are plenty of widely popular issues outside of the extreme policies favored by the party’s base, which if enacted, could improve the lives of millions. Things like allowing Medicare to negotiate drug prices, a public option for healthcare, increasing the supply of affordable housing, and reducing occupational licenses share wide appeal. Creating a unified platform focused on these policies centered around a message of fixing government for the average American will give the Democrats a better shot to win than a far left, progressive agenda.