MIT Election Lab
Jul 11 · 5 min read

The MIT Election Data and Science Lab helps highlight new research and interesting ideas in election science, and is a proud co-sponsor of the Election Sciences, Reform, & Administration Conference (ESRA).

Seo-young Silvia Kim, Spencer Schneider, and Michael Alvarez recently presented a paper at the 2019 ESRA conference entitled, “ Evaluating the Quality of Changes in Voter Registration Databases.” Here, they summarize their analysis from that paper.

Photo: Annie Bolin

Voter files are an important avenue for political research — and also crucial for the integrity of the administration of elections, as they dictate who votes. The files constantly change — but not all changes are intentional or welcome. External intrusions into voter files had not been salient public issues until the 2016 presidential election (Sanger, 2018). Various media and federal officials reported that foreign actors attempted to access the voter data in a number of states, who may have tried to alter it to manipulate election outcomes or undermine public trust. On the other hand, internal quality deterioration occurs because voter files are large, dynamic, and complex. For example, in the 2018 June primary in California, records for many as 77,000 in the state’s system were duplicated inadvertently by the Department of Motor Vehicles; in the 2018 primary election in Los Angeles County, 118,000 voters were left off precinct rosters due to a merge error.

While election officials work to guard against cyberattack and human error, there are calls for independent auditing of voter files, and generally to improve their quality (Alvarez et al., 2005, 2009; Ansolabehere & Hersh, 2010). Existing papers have analyzed the voter file qualities in their static form but again, voter files are dynamic. In this paper, we present two methods that evaluate the quality of voter registration data as it changes over time, and which (1) increase assurance of voter file quality, and (2) provide new windows into the behavior and interaction of election administrators.

Using data from Orange County, California, we develop two methods for evaluating the quality of voter registration data as it changes over time:

  1. generating audit data by repeated record linkage across periodic snapshots of a given database, and monitoring it for sudden anomalous changes; and
  2. identifying duplicates via an efficient, automated duplicate detection, and tracking new duplicates and deduplication efforts over time.

For the first, we use record linkage on 154 daily “snapshots” of the voter file from Orange County from the 2018 election cycle. Record linkage is a task of identifying individual records from distinct databases where the records refer to the same real-world entity — in our case, a single voter. Because exact matching often fails due to typographical errors, we use probabilistic record linkage (Fellegi & Sunter, 1969; Herzog, 2007; Christen, 2012) implemented by open-source R package fastLink (Enamorado, Fifield, & Imai, 2018). Then for variables such as voter ID, first and last name, street address, and partisan affiliation, we analyze the trend of changes over time, and check for outliers using an interquartile range (IQR) method, the simplest first-stage checks that can be performed to analyze anomalies.

Fortunately, “anomalies” we detected turned out to be mostly pre-designated administrative schedules or traces of internal list maintenance. However, we discovered that internal IDs were not always perfectly consistent. The same entity may have, whilst submitting a re-registration, been assigned a new voter ID. Our estimates show that there are more than 13,000 of such cases. This is less than 1% of the registrants, but if a researcher plans to use voting history as key covariates, this may skew some intended estimates and must be cautioned against — a single voter would be split into a voter who has not voted after a period, and a voter who newly started to vote. Most of the ID changes we observe arose when new information from statewide registration files were sent to county’s election management systems, comparative to the ratio of statewide to countywide registration without ID changes. Since internal IDs should ideally stay consistent, this is a possible indication that despite the Registrar’s best efforts, some records are not as smoothly merged as it would have in a unified system.

For the second, we develop an algorithm to efficiently determine which “blocks” to use in deduplicating, and minimizing the set of false positives — so that an accidental deletion of a valid voter does not occur — the algorithm determines the set of potential duplicates to manually inspect, given a budget constraint. Deduplication is another form of record linkage performed within the same database, and blocks are a fixed set of criteria that determine whether a pair of records is potentially duplicated. For instance, if the block is date of birth and last name, only those who share the same date of birth and last name would be deemed a duplicate. The algorithm enables us to track these potential duplicates in all the later snapshots, and monitor the quantities of incoming duplicates and deduplication efforts, just as we have monitored changes in the first part of the paper.

The time-series of duplicates and deduplication efforts show that the Registrar is actively working to deter the incoming duplicates. Upon scrutiny, we found that most duplicates came from state-driven changes in the data, again relative to the statewide vs. countywide registration in non-duplicate new registrations. Again, this could be a signal of the strain from multiple governmental agencies sending updated information per the National Voter Registration Act of 1993, something that is yet to be explored in the election literature.

In conclusion, we show that audit data that can be generated as in this paper can serve not only to evaluate voter file quality over time, but also as a novel source of data on election administration practices, such as implementation of election laws and administrators’ interactions with other state-level agencies.

Seo-young Silvia Kim is a Ph.D. candidate in Social Sciences at the California Institute of Technology with research interests in American Politics and Political Methodology.

Spencer Schneider is a current undergraduate at the California Institute of Technology.

Michael Alvarez is a professor of political science at California Institute of Technology, as well as the co-director of the Caltech/MIT Voting Technology Project.

MIT Election Lab

The MIT Election & Data Science Lab uses scientific principles to examine how elections are administered. We aim to improve the democratic experience for all U.S. voters, and serve as a bridge to like-minded researchers and practitioners. Visit us at

MIT Election Lab

Written by

By applying scientific principles to how elections are studied and administered, we aim to improve the democratic experience for all U.S. voters.

MIT Election Lab

The MIT Election & Data Science Lab uses scientific principles to examine how elections are administered. We aim to improve the democratic experience for all U.S. voters, and serve as a bridge to like-minded researchers and practitioners. Visit us at

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade