Chapter 2: Forecasting a Presidential Election with Monte Carlo Simulation

Published in

Programming for Lovers

27 min readNov 4, 2019

The Los Angeles Times had a poll which was interesting because I was always up in that poll. They had something that is, I guess, a modern-day technique in polling, it was called “enthusiasm”. They added an enthusiasm factor and my people had great enthusiasm, and [Clinton’s] people didn’t have enthusiasm. — Donald Trump (Source: NY Times)

If you enjoy this work, please join my ongoing programming class for charity: https://canvas.instructure.com/enroll/RKFKKP

An Introduction to Monte Carlo Simulation

One of the many bizarre aspects of the 2016 US Presidential Election was that so few people correctly predicted the outcome from polling data, including statisticians at media outlets like FiveThirtyEight and the New York Times. Although these outlets consistently predicted a Clinton victory, they disagreed significantly about the likelihood of that victory (see figure below).

Election projection results from FiveThirtyEight (top) and The New York Times (bottom) from June 2016 to November 8, 2016, the day of the US presidential election. The blue line shows Clinton’s forecast percentage chance of winning, and the red line shows Trump’s forecast percentage chance of winning (the yellow line in the top plot corresponds to Libertarian Party candidate Gary Johnson, who was not included in the Times forecast). Note that the projections disagree about the likelihood of a Clinton victory at any given time, but they indicate the same trend of this likelihood rising and falling over time.

STOP: Note in the figure above that the two charts appear to have the same shape over time. Why do you think this is, and what do you think caused the projections to fluctuate?

When we observe differences between election forecasts, it should give us pause about the reliability of these forecasts. It also begs the question, “What caused the discrepancies?” To find an answer, we wonder how we might forecast an election from polls in the first place.

Polls are inherently volatile. Because we are only able to sample a tiny percentage of the voting population, we can only estimate the true percentage of voters favoring a candidate. For this reason, most polls include a margin of error, or a percentage value on either side of the polling percentage. For example, a poll may indicate 51% support for one candidate with a 3% margin of error, in which case the poll projects with some degree of confidence (say, 95% confidence) that the true percentage supporting that candidate is between 48% and 54%.

We will take a moment to review how the U.S. presidential election works. Each of the 50 states, along with the District of Columbia (D.C.), receives two electoral votes, in addition to a number of other electoral votes that are directly related to their population. With two exceptions (Maine and Nebraska), each state is winner take all; the candidate who receives the most votes in a state earns all of the electoral votes in that state. There are 538 total votes (hence FiveThirtyEight’s name), and so a candidate must win 270 votes to win the presidency.

The actual process is even more complicated, since electoral votes are used to “elect” party members from the winning candidate’s party. This is by design — the Electoral College was created to provide a final barrier to prevent Americans from electing a highly undesirable candidate. In 2016, a record seven of the 538 electors defected (five Democrats and two Republicans), all of whom voting for someone who was not running in the election.

Because the presidential election is determined by the Electoral College, it makes sense to use state-by-state polling data to forecast the election. Our plan is therefore to simulate the election state-by-state many times, determining who wins the most Electoral College votes each time (or whether there was a tie). At the end, we can estimate our confidence in a candidate winning the election as the fraction of all simulations that candidate wins (see figure below).

A FiveThirtyEight figure showing the variability in the number of electoral votes won by Clinton and Trump across thousands of simulations. The height of each peak represents the percentage of simulations in which a given candidate won the corresponding number of electoral votes in a simulation. The charts are mirror images of each other because in a given election simulation, there are always a fixed number (538) of electoral votes to be divided between the two candidates.

A key point is that we must allow for some variation in our simulations by incorporating randomness, or else the simulations will all turn out the same. The principle of using a large number of randomized simulations to provide an estimate of a desired result is called Monte Carlo simulation.

Monte Carlo simulation is ubiquitous in practical applications. When your phone tells you that there is a 40% chance of rain at 3 PM, it is because meteorologists have run many simulations and found that it is raining at 3 PM in approximately 40% of these simulations. If you want to gain an edge in a daily fantasy sports contest, you might run many Monte Carlo simulations based on previous performance to identify a strong lineup of players. And for another political example, if you wanted to prove that legislative districts have been drawn unfairly, then you could draw many political maps randomly and compare how many districts favor one party in the simulations and in reality (see figures below).

Pennsylvania’s 7th Congressional District over six decades, as it changed from a relatively convex shape to a barely contiguous region that has been likened to Goofy kicking Donald Duck. Source: https://www.fairdistrictspa.com/the-problem/about-gerrymandering.

The 2016 (left) and 2018 (right) Congressional district maps of Pennsylvania. The 2018 district map was drawn by the Supreme Court of Pennsylvania, which ruled that the 2016 map had been unfairly drawn to benefit Republicans. Pivotal testimony was given by a Carnegie Mellon mathematics professor, who ran trillions of random simulations to show that the map on the left was more unfair than practically every map that had been randomly generated.

In this chapter, our goal is to apply Monte Carlo simulation to implement and interpret our own election forecast algorithm. But before we get ahead of ourselves, let’s learn a bit more about Monte Carlo simulation by applying it to a simpler example.

Estimating the House Edge of Craps

Monte Carlo simulation earns its name from the famed casino in Monaco, and so let’s see how random simulations can answer a question about a game of chance. In particular, if you play a given game with a fixed strategy, what is your average payout or loss in a single game over time? Because games are constructed in favor of the casino (or “house”), this average result is always a loss to the player and is called the house edge.

We can compute the house edge of a casino game via Monte Carlo simulation by programming a computer to “play” the game millions of times with the same strategy. The computer sums the total payout/loss over all these trials and then divides this payout/loss by the number of trials.

The simplest type of casino game we will call a binary game, which consists of two outcomes with equal and opposite outcomes: the player either loses the game along with their wager, or the player wins and receives back their wager in addition to an amount equal to their wager. If we assume that the player wagers a single unit of currency each time, then a win adds one unit to the player’s holdings, and a loss subtracts one unit from them.

Most casino games are not binary games because they have payouts that vary depending on the outcome as well as the player’s decisions., We will instead focus on a simple binary formulation of craps, in which a player makes a wager and rolls two six-sided dice. The sum of the dice on the first roll can be denoted x, and there are three possibilities.

If x is 7 or 11, then the player wins, and receives their wager back in addition to the amount of the wager.
If x is 2, 3, or 12, then the player loses the wager.
If x has some other value, then the player continues to roll the dice. On these subsequent rolls, the game stops if x is rolled, in which case the player wins, or if 7 is rolled, in which case the player loses. (Note that this is different to the first roll, when 7 is a winner.)

This video serves no real purpose to the exposition. I just love this skit and wanted to add it.

We can compute the house edge of craps with the following pseudocode. This function assumes a subroutine that we will write called PlayCrapsOnce that simulates one game of craps and returns true if the player wins and false if the player loses. Note also that the for loop does not name an explicit variable since it isn’t needed within the loop.

ComputeHouseEdge(numTrials)
    count ← 0
    for numTrials total trials
        outcome ← PlayCrapsOnce()
        if outcome = true
            count ← count + 1
        else
            count ← count − 1
    return count/numTrials

There is still no randomness immediately apparent in ComputeHouseEdge, as we have passed this detail to PlayCrapsOnce. Furthermore, the ComputeHouseEdge function could be used for any binary game; implementing the randomness and the rules of craps will be passed down to the subroutine. This exemplifies a paradigm of program design that we will discuss in a later chapter.

STOP: Before continuing, try writing a PlayCrapsOnce function that takes no parameters as input and that returns true or false based on whether the simulated player wins a single game of craps. You should assume a subroutine called SumTwoDice that takes no input and that returns an integer between 2 and 12 based on the sum on two simulated dice.

We will now write the PlayCrapsOnce function. Note that this function does not take any inputs — we are only interested in the outcome of playing the game, and the rules of the game are built into the function. Implementing these rules will rely on a collection of nested loops and if statements. Also, we are still not concerned about randomization details, as we have passed these to the SumTwoDice subroutine. You may want to review each line of the function below as an exercise in interpreting control flow.

PlayCrapsOnce()
    firstRoll ← SumTwoDice()
    if firstRoll = 2, 3, or 12
        return false (player loses)
    else if firstRoll = 7 or 11
        return true (player wins)
    else
        while true
            newRoll ← SumTwoDice()
            if newRoll = firstRoll
                return true
            else if newRoll = 7
                return false

STOP: Does using while true not produce an infinite loop? (Hint: do you think that anyone who plays craps ever worries about getting stuck between their first roll and the end of the game?)

We already know all the control structures needed to implement ComputeHouseEdge and PlayCrapsOnce on a computer, except for the two lines that ask us to compute the sum of rolling two dice. To be overly precise, we could even state this as a computational problem.

Sum of Two Dice Problem
Input: (No input)
Output: The sum of the values on two simulated six-sided fair dice.

We just need to implement the randomized process of rolling two dice; how hard could that be?

Pitfalls of (Pseudo)Random Number Generation

Von Neumann’s Middle-Square Pseudorandom Number Generator

Take a moment to think up a few random digits. (Here’s mine: 6, 1, 2, 8, 0, 3, 3, 0, 8, 0, 0, 4, 5, 7, 9.) You might think that the numbers you are generating are random, but there is some process going on in your brain to produce these numbers that is anything but random. In fact, experiments have shown that individuals tend to underrepresent the presence of repeated numbers. (When picking lottery numbers, would you pick the winning numbers from the previous day?)

STOP: Recalling the Dobzhansky quote that “nothing in biology makes sense but in the light of evolution” from a previous chapter, what evolutionary purpose would being able to quickly make a (seemingly) random decision serve?

If we are thinking like the ancient Greeks, then we might hope for a mathematical formula to save us. This was also the wish of John von Neumann, a computational pioneer who in the late 1940s was working on an early computer called ENIAC (see figure below). At that time, random number generation was so primitive that the state of the art was A Million Random Digits with 100,000 Normal Deviates, a book that was in development by the RAND Corporation. This book eventually contained 400 pages of a million random digits between 0 and 9 generated by spins of an electronic roulette wheel; 20 of the 32 resulting numbers were converted to digits, whereas the other 12 were ignored. If you wanted a random number, you opened the book to some arbitrary page of this book and started transcribing digits. Needless to say, a better computational approach for generating random numbers was sorely needed.

One method that von Neumann considered for random number generation is called the Middle-Square approach. It starts with an arbitrary n-digit integer called the seed and generates a new “random” n-digit integer by first squaring the number to produce a 2n-digit number, and then taking the middle n digits of the result. For example, if n = 4 and our seed is 1600, then 1600² = 2,560,000; this number viewed as an 8-digit number is represented as 02,560,000, whose middle four digits are 5600. This number becomes our first randomly generated integer; to generate a new random integer, we square 5600 and take its middle four digits.

The numbers generated by the Middle-Square approach may appear random, but they are generated by a deterministic method that always produces the same sequence of numbers from a given initial seed. We will use therefore the term pseudorandom number generator, or PRNG for short, to refer to a deterministic method like the Middle-Square approach that generates a sequence of numbers in an effort to mimic randomness.

STOP: With a seed of 1600, we saw that the next four-digit random number generated by the Middle-Square approach was 5600. What are the next three numbers that will be generated?

An ideal PRNG would, over the long term, produce each n-digit number with equal frequency, independently of the seed. But as the preceding exercise indicates, the problem with the Middle-Square approach is that it often generates short cycles, in which the same numbers come up repeatedly. Consider that 8100² = 65,610,000, 6100² = 37,210,000, 2100² = 04,410,000, and 4100² = 16,810,000, so we are only able to generate four numbers from the seed 8100. Worse, if we pick 3792 as our seed, then 3792² = 14,379,264, and so we generate the same number over and over in the name of randomness.

The poor performance of the Middle-Square approach is unfortunately the norm when we try to use functions based on arithmetic operations to generate random numbers. As von Neumann himself lamented,

Anyone who considers arithmetical methods of producing random digits is, of course, in a state of sin. For, as has been pointed out several times, there is no such thing as a random number.

Lagged Fibonacci Generators

In a truly random sequence of numbers, we would expect to see repeated numbers — it would be anything but random if a PRNG were to produce a sequence containing four-digit number between 0 and 9999 exactly once without any repeats. However, the Middle-Square approach suffers from the critical weakness that whenever a number is encountered that has already been generated, the entire sequence of numbers following that number will be generated again.

In other words, we currently think about a PRNG as generating a new number y from only one preceding number x, and this means that whenever we encounter the same x, we will produce the same y. What if the PRNG instead produced a new integer from more than one previous integer? “Remembering’’ more than one previous number would allow us to generate the same number x multiple times without necessarily causing a cycle.

A simple PRNG with a longer memory is based on the Fibonacci sequence. The first two terms of this sequence are defined as equal to 1, F(0) = F(1) = 1. Subsequent terms are defined as the sum of the previous two terms; that is, for n ≥ 2, F(n) = F(n-1) + F(n-2). Of course, the terms of this sequence continue on toward infinity (1, 1, 2, 3, 5, 8, 13, 21, 34, …), but we could form a PRNG called a Fibonacci generator by letting F(n) be the remainder of F(n-1) + F(n-2) after dividing by some fixed value m. For example, if m = 1,000,000, then for F(n-1) = 832,040 and F(n-2) = 346,269, we would set F(n) equal to Remainder(832,040 + 346,269, 1,000,000), which in turn is equal to 178,309.

STOP: How many initial seeds will we need for the Fibonacci generator?

Note that the Fibonacci generator will require not one but two seeds. If we change F(0) and F(1), then this will likely cause F(2) to change, and as a result we will obtain a different sequence of integers generated by the PRNG.

STOP: Do you see any issues with the Fibonacci generator?

Even though we now have a PRNG with a longer memory, there are substantial flaws with the Fibonacci generator, chiefly that once we have seen two integers in a sequence, we will immediately know the next number in the sequence. Hardly random!

We can make a slight adjustment to the Fibonacci generator if we ask it to “remember” farther back in the sequence of random integers.

That is, for some fixed positive integers j and k, we set F(n) equal to the remainder of the sum F(n-j) + F(n-k) divided by m. The resulting PRNG is called a lagged Fibonacci generator that is surprisingly practical given its simplicity. We generally avoid mention of specific languages, but it is worth pointing out that Go’s built-in PRNG uses a lagged Fibonacci generator with j =273, k = 607, and m = 2³¹. That is, F(n) is equal to Remainder(F(n-273) + F(n-607), 2³¹).

STOP: How many seeds do you think we will need for a lagged Fibonacci generator? Why?

Programming languages typically have simple PRNG like this is built-in because we don’t need something state of the art.

This discussion makes us wonder, what makes a PRNG acceptable for use in applications like Monte Carlo simulation? For one, we would hope that if we use different seeds, then we will obtain sequences of numbers that are diferent from each other. We would also hope to avoid repeating sequences of numbers like what we saw with the Middle-Square approach. The German Federal Office for Information Security identified four “levels” that PRNGs should have to be secure enough to use in applications.

Level 1: Different sequences generated by the PRNG should usually be different from each other.

Level 2: The PRNG should generate numbers with the same properties of a sequence of truly random numbers (e.g., how often a number repeats, or how often a given triplet of numbers repeats).

Level 3: A hacker shouldn’t be able to use the sequence of numbers to guess any future numbers.

Level 4: A hacker shouldn’t be able to use the sequence of numbers to guess any previous numbers.

PRNGs satisfying all four levels of security do exist, although they are slower than simpler PRNGs. Furthermore, it took decades of research to obtain advanced PRNGs, which are mathematically complicated — yet one more reason why mathematics is vital for an understanding of computer science.

Finally, before returning to our work with Monte Carlo simulation, we note that part of the reason we emphasize an understanding of the fundamentals of PRNGs is that they are still relevant today. A Wired story from 2017 tells a fascinating tale of a team of hackers based in St. Petersburg who allegedly made millions on a particular make of slot machine by using an understanding of the machine’s PRNG to predict when the machine would make large payouts.

STOP: If we are able to use past results to time a slot machine payout, then what level does the machine’s PRNG not achieve?

Simulating Craps

Rolling a single die

Recall that to estimate the house edge of craps, we need to simulate rolling two dice. First, we will consider the simpler process of simulating a single die roll, which amounts to uniformly generating a pseudorandom number between 1 and 6.

Programming languages contain packages, or collections of pre-written code that are likely to be reused in multiple contexts. In particular, most languages have a “random” package with a built-in pseudorandom number generator. (In Go, this package, built upon the lagged Fibonacci PRNG mentioned previously, is called rand.) A random package should, in addition to other functions, include three useful functions:

RandInt: takes no inputs and returns a pseudorandom integer (within the range of integers allowable for the language and operating system).
RandIntn: takes a single integer n as input and returns a pseudorandom integer between 0 and n − 1.
RandFloat: takes no inputs and returns a pseudorandom floating-point decimal number in the interval [0, 1).

STOP: Which of these functions should we use to simulate rolling a die?

Students typically wish to use RandIntn to simulate a die roll, but in fact, we can use any of these three functions. For example, if we generate a pseudorandom decimal number from [0, 1), then we can divide this interval into six equally-sized subintervals of length 1/6 and assign the die a value based on which subinterval our number falls into.

RollDie()
    roll ← RandFloat()
    if roll < 1/6
        return 1
    else if roll < 2/6
        return 2
    else if roll < 3/6
        return 3
    else if roll < 4/6
        return 4
    else if roll < 5/6
        return 5
    else
        return 6

A shorter functions usesRandIntn to generate a pseudorandom integer between 0 and 5, inclusively, and then adds 1 to the resulting number.

RollDie()
    roll ← RandIntn(6)
    return roll + 1

But we can also apply RandInt, generating an arbitrary pseudorandom integer and then taking its remainder when dividing by 6 to obtain an integer between 0 and 5, inclusively. As in the preceding function, we can then add 1 to the resulting integer.

RollDie()
    roll ← RandInt()
    return Remainder(roll, 6) + 1

STOP: Say that we have a weighted die that produces each of 1, 3, 4, and 5 with probability 1/10, and that produces each of 2 and 6 with probability 3/10. Write pseudocode for a RollWeightedDie function that models this weighted die.

We now return to the Sum of Two Dice Problem. The table below shows the sum on two dice for each of the 36 = 6² different outcomes of rolling two six-sided dice. Assuming that the dice are fair, each of these outcomes is equally likely.

A table showing all 36 outcomes for rolling two dice and the sum of the dice for each outcome.

STOP: Write a function SumTwoDice that takes no input parameters and returns the simulated sum of two six-sided dice.

If we are thinking mathematically, then we would notice that the probability of rolling an x is the number of ways in which an x can be rolled, divided by the total number of outcomes (36). So, the probability of rolling a 2 is 1/36, the probability of rolling a 3 is 2/36, and so on. This reasoning leads us to apply the same principle we saw earlier before to divide the interval [0,1) into subintervals. We could divide [0,1) into 36 subintervals of equal width, but our function will be shorter if we divide [0, 1) into 11 subintervals where the width of the interval is equal to the probability that the sum of two dice will be one of the 11 integer values between 2 and 12. This idea is implemented by the following pseudocode for a SumTwoDice function.

SumTwoDice()
    roll ← RandFloat()
    if roll < 1/36
        return 2
    else if roll < 3/36
        return 3
    else if roll < 6/36
        return 4
    ... (etc.)

In this work, we have seen the power of mathematical rigor for helping us to write efficient programs. But summing two dice offers an example of how we can miss a much simpler solution to a problem by being overly mathematical. In this case, modularity will make our work very easy, since rolling two dice is equivalent to rolling a single die twice.

SumTwoDice()
    return RollDie() + RollDie()

STOP: Write a function in pseudocode called SumMultipleDice that takes an integer numDice as input and returns the sum of the outcomes of rolling a die numDice times.

If you are interested in seeing how to implement the above functions and estimate the house edge of craps, join us for the coding component of this chapter. For now, we return to forecasting an election from polling data, where we will also see the power of modularity.

Simulating a Presidential Election

Planning election functions in pseudocode

We can learn from our work with computing the house edge of craps that a natural way to run many election simulations is to pass the heavy lifting to a function that simulates a single election. The pseudocode below for a SimulateMultipleElections function does just this.

In practice, different polls have different margin of errors, reflecting variables like number of respondents or historical reliability of the poll. However, we will assume that the margin of error is a fixed parameter called marginOfError. We will use this parameter to incorporate random bounce into our simulations. If this parameter is 0.06, then we are approximately 95% confident that our state poll is accurate within 6% in either direction. For example, if candidate 1’s polling percentage in California is 64%, then we are 95% sure that the true percentage of voters supporting candidate 1 is between 58% and 70%.

We will also not specify yet how the polling data will be represented; for the time being, we will just call these data pollingData, a parameter that we will pass along with marginOfError to the SimulateOneElection subroutine, which returns the number of Electoral College votes each candidate receives in a single simulated election.

We will now present SimulateMultipleElections, which calls SimulateOneElection numTrials times, keeping track of the number of wins each candidate accumulates (as well as the number of trials). It then divides each win count by the total number of trials to obtain estimated “probabilities” of each candidate winning, along with the estimated probability of a tie.

SimulateMultipleElections(pollingData, numTrials, marginOfError)
    winCount1 ← 0
    winCount2 ← 0
    tieCount ← 0
    for numTrials total trials
        votes1,votes2 ← SimulateElection(pollingData, marginOfError)
        if votes1 > votes2
            winCount1 ← winCount1 + 1
        else if collegeVotes2 > collegeVotes1
            winCount2 ← winCount2 + 1
        else (tie!)
            tieCount ← tieCount + 1
    probability1 ← winCount1/numTrials
    probability2 ← winCount2/numTrials
    probabilityTie ← tieCount/numTrials
    return probability1, probability2, probabilityTie

A natural way to represent pollingData is as two maps. An electoralVotes map associates state name keys with the (positive) integer storing the number of votes for that state. A polls map associates state name keys with the percentage at which candidate 1 is polling in that state. See figure below.

(Left) An abbreviated map of state names to the state’s number of Electoral College votes. (Right) A hypothetical abbreviated map of state names to the polling percentage for candidate 1 in the state.

To simulate a single election, we need to use these polling data along with marginOfError to produce a randomly adjusted polling percentage for each state. If this adjusted percentage is greater than or equal to 50%, then we assume that candidate 1 wins the state (a tie will be extremely unlikely). We will return to discuss the details of this adjustment soon; for now, we will call the subroutine that achieves this objective AddNoise. The SimulateOneElection function could return which candidate wins (or if there was a tie), but we will have this function return the number of electoral votes that each candidate wins in the simulated election.

SimulateOneElection(polls, electoralVotes, marginOfError)
    votes1 ← 0
    votes2 ← 0
    for every key state in polls
        poll ← candidate 1's polling percentage
        adjustedPoll ← AddNoise(poll, marginOfError)
        if adjustedPoll ≥ 0.5 (candidate 1 wins state)
            votes1 ← votes2 + electoralVotes[state]
        else (candidate 2 wins state)
            votes2 ← votes2 + electoralVotes[state]
    return votes1, votes2

As with our work with craps, we see that the randomization is lurking within a subroutine, which in this case is the AddNoise function.

STOP: How could we write AddNoise? What random function would be helpful?

Unlike our previous work, using RandInt or RandIntn is unnatural because we need to generate a random decimal number x that we will be adding to candidate 1’s polling percentage. One idea is to generating a noise value between -marginOfError and marginOfError by first generating a pseudorandom decimal number between -1 and 1, then multiply the resulting value by marginOfError.

STOP: We have a built-in function RandFloat that returns a pseudorandom decimal number between 0 and 1. How could we use this function to produce a pseudorandom decimal number between -1 and 1?

Once we generate a random decimal between 0 and 1, we can multiply it by 2, and then subtract 1 from the result, in order to obtain a random decimal between -1 and 1. Multiplying this number by marginOfError produces the desired x.

AddNoise(poll, marginOfError)
    x ← RandFloat()
    x ← 2 * x (now x is between 0 and 2)
    x ← x – 1 (now x is between -1 and 1)
    x ← x * marginOfError (now it’s in range 😊)
    return x + poll

However, this AddNoise function suffers from a flaw because the number it returns will be uniform, meaning that it is just as likely for the noise variable to be at the “tails” of the range of possible values than in the middle. It also won’t allow for the desired 5% of extreme cases when the noise value is outside of the range between -marginOfError and marginOfError.

In practice, the probability of generating a given noise value should be weighted so that values closer to 0 are more likely than values at the tails. One way of achieving this is with a standard normal density function, a “bell-shaped” curve that will be familiar to anyone who has taken a statistics course (see figure below). The probability of generating a number between numbers a and b is equal to the area under the function between the x values of a and b; that is, the integral of the standard normal density function. Because the density function is taller near zero, the probability of generating a number near zero is higher, and this probability decreases the farther we move from zero. There is approximately a 68.3% probability of generating a number between -1 and 1, and a 95.4% probability of generating a number between -2 and 2.

The standard normal density function. The area under the curve between *x-values of a and b is equal to the probability of generating a pseudorandom number between a and b.*

STOP: What does the density function look like if we are generating decimal numbers uniformly between 0 and 1?

We can therefore obtain a random noise value in the range from -marginOfError to marginOfError with approximately 95% probability by generating a pseudorandom decimal according to the standard normal density function, and then multiplying this number by marginOfError/2. But how can we produce these numbers?

Remember our classic text for computing random numbers in the pre-computing era, A Million Random Digits with 100,000 Normal Deviates? The book earns the second half of its name because it contains 100,000 random numbers sampled underneath the standard normal density function. Fortunately, we won’t need to consult a 70 year-old book because programming languages will include a function RandNormal that returns a random number according to the standard normal density function, allowing us to update our implementation of AddNoise as follows.

AddNoise(poll, marginOfError)
    x ← RandNormal()
    x ← x/2 (95% chance of x being between -1 and 1)
    x ← x * marginOfError (now x is in range)
    return x + poll

And just like that, we are done! Note how we saved the hardest part of our work, incorporating randomness into polling data, until last? We will say more on this in the next chapter.

Implementing Our Work

In the video below, we use Go to implement the craps simulator and election forecast algorithm that we have discussed above. We then apply this forecast algorithm to real polling data from the 2016 US election. What will we find?

If you’re interested in seeing the results of this forecast without implementing it yourself, feel free to skip ahead to the next section.

Conclusion: A Reflection on the Nature of Election Forecasting

Improving our election forecast algorithm

Designing an algorithm, converting it to code, and getting the program to run does not mean that we have solved a problem. Often these tasks represent just the first steps of analyzing a dataset. To be fully fledged programmers, we must also interpret our results.

To put it more bluntly, we must ask, “Is our forecasting approach any good”? However, we cannot help but notice that our prediction is more confident in Clinton’s victory than both the New York Times and FiveThirtyEight were. What has caused this discrepancy?

A simulation of election polling data with a margin of error set to 10%, for three different polling data sets taken from the 2016 election. Our forecast is consistently more confident in a Clinton victory than either of the media outlets from the introduction.

After all, the margin of error of 10% used to produce the table above is much wider than the margin of error of most polls. It means that even if a candidate is polling at 60% in a state, a veritable landslide, there is a greater than 2% chance that our forecast will award that state to the opposing candidate. So if our forecast is more conservative in this regard than professional forecasts, why does it lean more heavily in favor of Clinton? What did the media outlets add to their forecasts that we are missing?

We will highlight three examples of how to make our forecast more robust. First, the polling data used for our forecast, available at the course website, provides an aggregate polling value for a given state at a given time. Yet some polls may be more reliable than others, whether because they survey more people or because the poll has traditionally shown evidence of less bias. (See figure below.)

Several polls used by FiveThirtyEight, along with polling statistics, an estimated bias in the poll toward one party or another, and a grade of the poll. Polls with higher grades have a greater influence over the forecast. Source: FiveThirtyEight.

Second, because of the small sample sizes used with polls compared to the population, forecasters typically used a bell-shaped curve that is “shorter” around 0 and that has “fatter” tails when adding noise to a polling value. Sampling randomly from a density function with fat tails allows for a higher likelihood of obtaining an adjusted polling value farther from its current value in AddNoise.

Finally, when simulating the election we have thus far treated the states as independent. In practice, states are heavily correlated; there is a very good chance, for example, that North Dakota and South Dakota will vote the same way. Accordingly, if in a single simulation AddNoise adjusts a given state poll in favor of one candidate, then we should most likely adjust the polls of any correlated states in favor of this candidate as well.

The correlation of state results makes it much more likely for the underdog to win — in 2016, Trump won most of the “Rust Belt” states stretching from western New York across the Great Lakes region into Wisconsin, even though polls indicated that he had a relatively small chance of winning each of these states.

Is forecasting an election hopeless?

We could easily add these features to improve our own forecast algorithm. However, it will be more fruitful if we take the time not to continue coding but instead to reflect on the inherent weaknesses of forecasting any election from polling data.

A hidden assumption of our work thus far is that responses to a poll adequately reflect the decision that respondents will make in the privacy of the ballot box. One way in which such an asymmetry can arise is if one candidate’s supporters have higher voter turnout. One national poll, the Los Angeles Times “Daybreak Poll”, was weighted based on how enthusiastic a voter was for their respective candidate. That is, a respondent could indicate 60% support for Trump, or 80% support for Clinton, rather than providing a binary response. The Daybreak Poll, shown in the figure below, consistently favored Trump in 2016.

The Los Angeles Times Daybreak Poll, in which polling responses were weighted based on enthusiasm, was often panned in 2016 because it consistently showed Trump as leading.

STOP: Does the fact that the Daybreak Poll forecast the correct winner make it a good poll?

Just because a forecast is correct does not make it a well-designed forecast. For example, say that your local meteorologist may predict rain tomorrow, while your neighbor predicts sunshine based on the outcome of a coin flip. If it is sunny, you would hardly call your neighbor an expert. The Daybreak Poll’s biggest flaw is that it is a national poll despite the election being decided state-by-state. In fact, Clinton won the national “popular vote” by about three million ballots, so in this sense the Daybreak Poll was just as wrong as the other media forecasts — although weighting polls by enthusiasm is an interesting idea.

We continue the weather analogy by asking you to consider the following question.

STOP: If a meteorologist tells you that there is a 70% chance of rain on a given day three months from now, would you believe them?

We would never trust a weather forecast three months in advance, but we could probably trust knowing what time the sun will rise three months in advance. Understanding the flaw in all election simulations requires us to understand what is different about these two examples, and will take us on a detour to the stock market.

The figure below shows the prices of two stocks (called “stock A” and “stock B”) over a five-year period from January 2012 to January 2017. Imagine that it is January 2017 and we play a binary game. You wager $1 and pick one of the two stocks. If in six months time, your stock’s price has increased by 30%, then you win $5; otherwise, you lose your $1. Which of the two stocks would you choose?

The price of mystery stock A in dollars per share from January 2012 to January 2017.

The price of mystery stock B in dollars per share from January 2012 to January 2017.

Many students will pick Stock A for their wager because it has had the better performance over the previous five years. Yet you are more likely to win your bet if you choose Stock B because its price is much more volatile than Stock A. In fact, in July 2017, six months after the time of the bet, Stock A (Colgate-Palmolive) was trading for around $73 per share, a meager gain over its price in January 2017, whereas Stock B (First Solar) had rocketed up to around the $45 range (see figure below). In this case, you would have won your bet, although the stock price could just as easily have plummeted — which it did in the middle of 2018.

Stock A (Colgate Palmolive) rose a few percentage points in 2017 and then continued its low-volatility trend. Note: the stock price does look like it is bouncing around, but note how small the change in price is on the y-axis; the stock essentially stayed between $60 and $75 for a five-year period.

Stock B (First Solar) skyrocketed upward in 2017, and then continued climbing until spring 2018, when it lost half its value in just a few months, and then jumped back up by 50% in 2019.

If you think about our binary game, in which we wager on a stock going up 30% in six months, there is not a single point in time over the last five years at which you would have won this bet on Colgate-Palmolive. The case is different for First Solar, where we find times in mid-2016, early to mid-2017, and again in late 2018 and early 2019 at which this bet would have been a winner.

These types of bets may seem contrived, but they are an example of a derivative, or an investment whose price is tied to that of some underlying asset. Our wager that a stock will increase to a certain amount K by time T is a simplified form of a European call option; if the price of the stock is x at time T, then the payout of the option is either $0 if the stock price is less than K at time T, or x-K if the price x is greater than or equal to K. And the key point is that a stock that has had higher volatility will have call options that are more expensive when K is much higher than the stock’s current price.

We can be sure about the time that the sun will rise in three months because, barring armageddon, sunrise time has very low volatility. In contrast, weather is very volatile, which makes forecasting it very difficult for the next few days, and impossible for several months in the future.

And what of election forecasting? We reproduce the FiveThirtyEight and New York Times projections below. Notice how wildly the forecasts swing back and forth; in the most notable case, an early August FiveThirtyEight forecast that was 50/50 swung back to an 80% forecast for Clinton in just a few days. These swings in the forecasts reflect very volatile polls that are influenced by news events, as well as scheduled occurrences during the campaigns such as party conventions and debates.

These media outlets built pretty data visualization on top of a fun method of forecasting an election that a novice programmer can implement with a little help. But they also made a critical error in forecasting an event far in the future based on volatile data, all while sometimes selling their work as estimating a “probability” of victory. A far better election forecast in the summer of 2016 would have been to shrug one’s shoulders and declare that predicting any future event in an environment of high volatility is no different than playing a game of chance.

Thanks for reading! If you like this craziness,check out my ongoing course: https://canvas.instructure.com/enroll/RKFKKP