Entity Resolution on Voter Registration Data

An Iterative Process

Published in

District Insights

9 min readMar 21, 2018

Entity resolution is a field that aims to find records in data sets that refer to the same entity by grouping and linking. Entity resolution is also called deduplication, merge purge, patient matching, etc. depending on the application. In voter registration, it is a useful technology to make sure voter rolls are up to date and can be used to see if a voter is registered in multiple states.

There are many challenges to applying entity resolution. Different data sources have varying schema, collection standards and methodologies. Even the task of standardizing the data to link entities from these sources can be a cumbersome task. To further complicate matters, there can be spelling errors, transposed characters, missing values, and other anomalies.

Why Voter Registration Data?

There has been a lot of discussion around voter fraud during the most recent Presidential election. The following article stated that there were 3 million people who were registered to vote in multiple states. Entity resolution is a tool that can be used to clean up voter registration data, ensure integrity in the election process, and prevent voter fraud, thereby helping to restore people’s confidence in the voting process.

Voter Registration Data Source

For the purposes of this project, DC voter registration data from 2014 was used. The data is available at this website.

All of the information was compiled into a zipped-up csv file that contained a header row and 469,610 rows of voter registration data. The following are the most significant fields related to a particular voter:

Since the data was provided as a csv file, it was an easy task to import the records into a table in a MySQL database. The data was imported into a database called dcvoterinfo and the table was named voterinfo. To make the task of entity resolution easier, certain fields were combined to form new fields based on logical groups. The fields LASTNAME, FIRSTNAME, MIDDLE, and SUFFIX were combined to create a new field called name. Similarly, the RES_HOUSE, RES_FRAC, RES_APT, RES_STREET, RES_CITY, RES_STATE, RES_ZIP, and RES_ZIP4 fields were combined to create a field called address.

There was one record with a value in the REGISTERED field where the recorded year was 2223, which was clearly an error. Also, there were 40 records with REGISTERED year 1900, all with the date 01/01/1900, which was most likely a collection error. Grouping records by year reveals that prior to 1968, there was very little data. In fact, the total number of records prior to 1968 was 123, while the number of records in just 1968 alone was 8,628. These were left in the data set and did not appear to cause any issues with the results.

Studying the data set revealed that the following fields were most likely to identify a unique entity: name, address, REGISTERED. Fields such as precinct and ward are dependent on the address field and are, therefore, redundant. One quick and dirty way to determine if there were possible duplicates was to use a SQL group by query to see if there was more than one record that matched an attribute or a set of attributes. The following is a query that was run to identify potential duplicates using the name and address fields:

select name, address, count(*)
from voterinfo
group by 1, 2
having count(*) > 1;

The above query generated 703 results — most resulted in two matches per group, and there were four groups that had three matches. The following is a snapshot of the results:

Looking at the records for RUSSELL D SIMMONS revealed that the value for REGISTERED was different in these records: the first REGISTERED date was 3/23/1968, the second was 10/12/1996, and the third was 5/9/2006. It is possible that a father and son lived at the same address, but since we had three matches, it was possible that two of the records refer to the same entity.

Adding REGISTERED to the sql query found 10 results with the same values for name, address and REGISTERED. The following are the records:

The records for AKIL A LASTER showed that there were two records with a REGISTERED date of 11/4/2014 and another record with a REGISTERED date of 11/12/2014. It was very likely that these three records refered to the same person.

Using DeDupe

Without domain knowledge regarding the data set, we had to make some assumptions on how to identify duplicate records. Since the data spans several years, it was possible for someone to have moved and to have more than one address. But in the absence of the residential address history and to simplify the analysis, we made the assumption that name and address will uniquely identify an entity.

While the sql queries revealed duplicates, the data set probably had other duplicates that were not caught because of misspellings in the name and/or address fields. This was where Dedupe came in. Dedupe is a Python library that performs entity resolution.

Since the data set is in MySQL, it was a pretty straightforward task to use the mysql_example.py code from the dedupe-examples package. Some changes had to be made to the code, namely the data source, table name, fields etc.

The following were the fields selected from the voterinfo table:

VOTER_SELECT = "SELECT voterid, REGISTERED, name, address, " \
               "precinct, ward, anc, smd from voterinfo"# ## Trainingif os.path.exists(settings_file):
    print('reading from ', settings_file)
    with open(settings_file, 'rb') as sf :
        deduper = dedupe.StaticDedupe(sf, num_cores=4)
else:    # Define the fields dedupe will pay attention to
    #
    fields = [{'field' : 'name', 'variable name' : 'name',
               'type': 'String'},
              {'field' : 'address', 'variable name' : 'address',
              'type': 'String'},
              {'field' : 'REGISTERED', 'variable name' : 'REGISTERED',
              'type': 'String'},
              {'type' : 'Interaction',
               'interaction variables' : ['name', 'address']}
              ]

Running the code against the data set resulted in 8,111 clusters. Several of these clusters contained 5 or more duplicate entities, which seemed excessive. It also took several hours to run. Upon examining the results, it was clear that name was over weighted. All the clusters had either the exact same name or names that were very close (such as Craig C Johnson Sr and Craig C Jonhson) regardless of the value in the address field. The following is a sample of the results:

Forest Gregg, one of the creators of Dedupe, suggested installing and using dedupe-variable-name as the data set contained American names and would perform better in this example because there were multiple people who share the same address. He also suggested removing the Interaction variable.

For people with Mac operating systems who have trouble installing dedupe-variable-name because of a gcc error, the following modification can be made to get past the error:

env CC=/usr/bin/gcc pip install dedupe-variable-name

In order to use dedupe-variable-name, the following additional import statement was added to the code and the name field was modified to use the “Name” type from this package.

from dedupe.variables.name import WesternNameType
    fields = [{'field' : 'name', 'variable name' : 'name',
               'type': 'Name'},
              {'field' : 'address', 'variable name' : 'address',
              'type': 'String'},
              {'field' : 'REGISTERED', 'variable name' : 'REGISTERED',
              'type': 'String'}
              ]

This iteration produced 1,250 clusters and took 52 minutes to run, which was a significant improvement in performance.

Since Dedupe also provides a special variable type called dedupe-variable-address, in the third iteration, the code was modified to include this package to see if the results can be further improved. The address field was modified to use the “Address” type.

from dedupe.variables.address import USAddressTypeThe field list was modified as follows:    fields = [{'field' : 'name', 'variable name' : 'name',
               'type': 'Name'},
              {'field' : 'address', 'variable name' : 'address',
              'type': 'Address'},
              {'field' : 'REGISTERED', 'variable name' : 'REGISTERED',
              'type': 'String'}
              ]

Adding dedupe-variable-address resulted in 1,337 clusters and took 50 minutes to run, which was somewhat similar to the previous example.

Comparison of Results

Remember that we made an assumption earlier that name and address will uniquely identify an entity.

Taking that approach, the first set of results (produced without using dedupe-variable-name and dedupe-variable-address) was not meaningful because it identified duplicates solely on name.

The second and third sets of results were very similar — both matched on 1,100 clusters containing a total of 2,224 records. There were 150 clusters in the second result set that were not in the third result set while there were 237 clusters in the third result set that were not in the second result set.

Overall, both did a pretty good job in identifying records that had the same name and address, such as the following:

There were also a number of records that they identified where there were spelling mistakes and inverted name mistakes. The following are a couple of examples.

However, there were also a number of records where names that had a suffix were matched up with names that did not. The following are a few cases.

There were also cases where records matched on last name and address but had completely different first names that were put together in a cluster.

Looking at the results that were only in the second result set, it appeared that it contained a number of records that had similarities in name but were not exact. The following figure shows some examples. All these records should have been in different clusters.

What was interesting was that there were some exact matches on name and address (but different values for REGISTERED) that were identified as belonging to the same cluster in the second result set but not so in the third result set. The following is an example:

There were similar patterns in the records that were only in the third result set. Examples include: clusters with similarities in name, and clusters that matched on name and address but had different values for REGISTERED. In other words, there were cases where the third result set identified clusters while the second result set did not.

The third result set identified a couple of clusters where the address was the same but the names were not the same. The following is an example.

Conclusion

Overall, Dedupe does a very good job of identifying duplicate entities based on name and address. It is able to effectively identify slight variations in the name such as spelling errors and inverted first name and last name. In that sense, it does a much better job than simply identifying exact matches using a group by SQL query or an Excel remove duplicates function. It appears, however, that it can use some improvement in discarding cases where there is more variation, such as examples where the first name is clearly different. The dedupe-variable-name package accounts for suffixes but it tends to cluster names with suffixes along with names that don’t. A potential enhancement can be made to prevent such entities from getting clustered together.

There have been several articles such as this one, which stated that people were registered to vote in more than one state. This was possible as people moved they failed to tell the voter registration agency in their old state that they have moved. The results that Dedupe produced seem to suggest that people have also registered to vote more than once in the same state. Further investigation needs to be conducted to determine if this was truly the case. If so, states can begin to use entity resolution to take measures to prevent this double-registering from happening in the future. In addition, entity resolution tools such as Dedupe can be combined with record linkage to identify voters who are registered in multiple states. Utilizing entity resolution to clean up our voter registries would be an important first step towards making our elections more fair and transparent.

District Data Labs provides data science consulting and corporate training services. We work with companies and teams of all sizes, helping them make their operations more data-driven and enhancing the analytical abilities of their employees. Interested in working with us? Let us know!