A Not-So-Secret Ballot

How differential privacy can maintain voter anonymity

Published in

Sarus Blog

9 min readOct 28, 2020

Biden/Harris and Trump/Pence signs, placed in neighboring yards — photo credit: goerie.com

If you are a US citizen, you will soon vote or have already voted for the presidential elections. Over the past few months, as the election has been approaching, you may have felt strong social pressure to vote one way or another.

Perhaps your employer has endorsed a particular party or candidate. Maybe your neighbors are vocal about what a disaster it would be if the candidate from party D or R were to stay or get to the White House.

Luckily, the ballot is secret so you won’t be intimidated into voting against your beliefs. Or is it? Actually, the way vote counts are made public can be problematic from a privacy standpoint and inadvertently reveal how an individual may have voted.

In this post, we will first see how just revealing election results leaks some privacy on votes and then, how adding a small random adjustment could be a solution, protecting voter’s political choices while allowing the computation of very precise results at the state level.

What is made public about your vote?

Digging into precinct-level 2016 presidential election results as collected by the MIT Election Data and Science Lab (MEDSL), we can find many precincts in which, your vote can be far from secret.

Unanimous precincts

For the 2016 presidential election, in some precincts, all citizens voted for the same candidate. Let’s call those precincts unanimous. This is the case for precinct 930016 in Comanche County, Texas, which reported 69 votes for Trump on 69 votes.

Precinct 930016 in Comanche County, Texas voted 100% Trump — image by author

It is also the case for Detroit City, precinct 181 in Wayne County, Michigan, which reported 143 votes for Clinton on 143.

Detroit City, precinct 181, Wayne County, Michigan voted 100% Clinton — image by author

Even if they account for a small percentage of votes (23k out of 130M), as many as 4,604 precincts (on 172,638) are unanimous. If we know Alice voted in one of those precincts, we know exactly who Alice voted for.

Precincts with overwhelming majorities

For the same elections, one candidate had more than 95% of the votes in 9,118 precincts accounting for 2M votes.

For the same 2016 election, a significant amount of precincts: 9,118, representing 2M votes had a winner with more than 95% of the votes.

In precinct 3930002 in Roberts County, Texas, Trump won by 134 votes out of 135. Similarly, Clinton won by 425 votes out of 426 total votes in CLEVELAND-07-Q precinct in Cuyahoga County, Ohio.

If we know that Alice lives in one of those precincts it is harder to deduce who she voted for. We need auxiliary information. If, for example, we know the probability that anyone-else in the precinct voted for a particular candidate, then we can be certain about Alice’s.

Let’s try to formalize what is learned about Alice when vote counts from her precinct are published.

Bayesian learning from election results

Let’s consider a precinct of size n. All citizens are indexed by a unique integer. Alice, indexed by a, is one of them:

For simplicity we assume the existence of 2 parties, D and R, that are the only 2 options available (no abstention is allowed). A citizen i can either vote for party D (i.e. i∈D, i belongs to the set D) or party R: (i.e. i∉D). We start by assuming nothing about the political preferences of the voters. If we know nothing at all, our best guess is to assume it is equally likely to vote for each citizen. In Bayesian terms we say that our prior probability is 0.5:

Suppose that the results are published and that d votes go to party D, and n-d to party R. Using the Bayes rule, we can update our beliefs about Alice’s vote knowing d:

We can deduce the intuitive and rather obvious probability that Alice is voting for D given the number d of votes for D:

This formula simply says that the publication of precinct-level vote count d allowed you to update your knowledge about Alice. You thought she was equally likely to vote for one party or the other (which is reasonable if you know nothing about her), and now (after the publication of d) you estimate the probability she voted D to d/n. Being able to infer such personal information about Alice can be concerning when you care about preserving privacy.

From this formula, we can trivially deduce that, if all votes in one precinct went to one candidate, Alice voted for that one candidate. This applies to the unanimous precincts listed above.

In other cases, we cannot be sure about Alice’s vote. Besides, she could deny any supposition we make — unless we know more about the other citizens. To model our knowledge about the other citizens, let’s introduce non-trivial priors for Alice and her neighboring citizens. Let’s say that, a priori (before we get to know the results), we believe Alice will vote for D with probability p, and others will vote for D with probability q. This prior knowledge may have been derived from the sociology of the precinct, from a poll or from the previous election results:

If we don’t know anything about Alice we can set the prior p around 0.5 (each candidate is equally likely to win alice’s vote). If we think most of the population in the precinct will vote for party R, we’ll set q to a small value.

Now the probability that Alice voted for D knowing the count d, also known as the posterior probability, is slightly more complex:

An interesting application of this formula is when we know with high degree of certainty the probability of others voting for D. If, for example, q goes to 0, we can show that if d=1, Alice certainly voted D, if d=0 Alice certainly voted R, and any other d is impossible.

This illustrates the situation in precincts with overwhelming majorities. If we know that Alice lives in precinct 3930002 in Roberts County and we are sure all other citizens voted for Trump, then we can be sure Alice voted Clinton.

Introducing Differential Privacy

Let’s consider the problem of publishing detailed election results for an informative purpose and its implications for privacy. We are not trying to make statements about the voting mechanics, ballot transparency and legal disclosures obligation here, so let’s assume precinct level results needn’t be published for other reasons than information and science.

Common sense would dictate to publish only the results in larger precincts or to group many smaller ones into larger entities, but this simple approach is flawed. First, if conservatively applied, this approach destroys information, because groups should be large enough so that entities with overwhelming majorities are very unlikely. Also, it does not prevent re-identification attacks using auxiliary information. As we have seen, leveraging auxiliary information enables precise predictions about individuals. Of course, the larger the precinct, the harder it gets to know everyone except Alice, but it is still a risk. A risk that tends to increase as tracking technologies become more efficient and agents gathering data about individuals become more prevalent.

Luckily a theoretical framework called Differential Privacy was developed to solve these kind of problems. It was invented in 2006 by Cynthia Dwork. Exposing this concept in detail is beyond the scope of this post, but, in short, a publication will be 𝜀-differentially private if it was perturbed randomly enough so that one cannot infer anything significant about an individual. The parameter 𝜀 is to be understood as a tolerance level: the smaller it is, the stronger the privacy.

The vote count d of a precinct can be made 𝜀-differentially private by adding some Laplace random noise:

By leveraging Bayes rule, we can try to deduce something about Alice given the noisy count:

Which can be rewritten:

Because we added random noise to the exact count, d is imperfectly known, hence the exponentially weighted average over all possible d’s that makes deductions about one individual very imprecise.

Note that, when 𝜀 becomes very large (low privacy), the formula reduces to the simpler formula above (without differential privacy). More interestingly, if 𝜀 becomes small (high privacy), the formula gives:

Which means that absolutely nothing was learned about Alice from the election results.

Application

We can apply the formula above to 4 precincts with problematic features.

2 unanimous precincts:

2 precincts with overwhelming majorities:

In both cases, the noise is sufficient to erase Alice contribution while not impacting the result at the scale of a state.

With 𝜀=0.2, the standard deviation of the noise added to the total number of votes is:

around 1,100 votes out of 14M for California
around 550 votes out of 9.5M votes for Florida
around 670 votes out of a total of 9M votes for Texas

It can be further reduced by taking a larger 𝜀, trading some privacy for more precision.

Conclusion

The main take-away of this post is that: publishing exact detailed data always destroys privacy to some extent. Reducing the level of details and adding a degree of randomness protects privacy but destroys some utility. Differential Privacy provides a systematic framework to make the best privacy-utility tradeoff.

In the case of election results: useful statistics published without some random perturbation is subject to re-identification risk. The risk can be high, when very little auxiliary information is necessary — e.g. in the case of unanimous precincts. It can be moderate if re-identification require having more auxiliary information — e.g. when the precinct is far from unanimity and that the votes of many other citizens has to be known to deduce the vote of Alice.

Adding some random noise can significantly improve privacy protection while allowing detailed analysis. Of course the true counts could still be used to determine the results or be published at an aggregate level: County or State, where the risk of re-identification is small. But precinct-level results should probably be published in a differentially private way.

Sarus Technologies provides tools for the data practitioners to work on sensitive data assets without revealing the underlying data and with the formal guarantees provided by Differential Privacy.