Basics of Entity Resolution with Python and Dedupe

District Data Labs
District Insights
Published in
18 min readJan 3, 2018


By Kyle Rossetti and Rebecca Bilbro

Entity resolution (ER) is the task of disambiguating records that correspond to real world entities across and within datasets. The applications of entity resolution are tremendous, particularly for public sector and federal datasets related to health, transportation, finance, law enforcement, and antiterrorism.

Unfortunately, the problems associated with entity resolution are equally big — as the volume and velocity of data grow, inference across networks and semantic relationships between entities becomes increasingly difficult. Data quality issues, schema variations, and idiosyncratic data collection traditions can all complicate these problems even further. When combined, such challenges amount to a substantial barrier to organizations’ ability to fully understand their data, let alone make effective use of predictive analytics to optimize targeting, thresholding, and resource management.

Naming Your Problem

Let us first consider what an entity is. Much as the key step in machine learning is to determine what an instance is, the key step in entity resolution is to determine what an entity is. Let’s define an entity as a unique thing (a person, a business, a product) with a set of attributes that describe it (a name, an address, a shape, a title, a price, etc.). That single entity may have multiple references across data sources, such as a person with two different email addresses, a company with two different phone numbers, or a product listed on two different websites. If we want to ask questions about all the unique people, or businesses, or products in a dataset, we must find a method for producing an annotated version of that dataset that contains unique entities.

How can we tell that these multiple references point to the same entity? What if the attributes for each entity aren’t the same across references? What happens when there are more than two or three or ten references to the same entity? Which one is the main (canonical) version? Do we just throw the duplicates away?

Each question points to a single problem, albeit one that frequently goes unnamed. Ironically, one of the problems in entity resolution is that even though it goes by a lot of different names, many people who struggle with entity resolution do not know the name of their problem.

The three primary tasks involved in entity resolution are deduplication, record linkage, and canonicalization:

  1. Deduplication: eliminating duplicate (exact) copies of repeated data.
  2. Record linkage: identifying records that reference the same entity across different sources.
  3. Canonicalization: converting data with more than one possible representation into a standard form.

Entity resolution is not a new problem, but thanks to Python and new machine learning libraries, it is an increasingly achievable objective. This post will explore some basic approaches to entity resolution using one of those tools, the Python Dedupe library. In this post, we will explore the basic functionalities of Dedupe, walk through how the library works under the hood, and perform a demonstration on two different datasets.

About Dedupe

Dedupe is a library that uses machine learning to perform deduplication and entity resolution quickly on structured data. It isn’t the only tool available in Python for doing entity resolution tasks, but it is the only one (as far as we know) that conceives of entity resolution as it’s primary task. In addition to removing duplicate entries from within a single dataset, Dedupe can also do record linkage across disparate datasets. Dedupe also scales fairly well — in this post we demonstrate using the library with a relatively small dataset of a few thousand records and a very large dataset of several million.

How Dedupe Works

Effective deduplication relies largely on domain expertise. This is for two main reasons: first, because domain experts develop a set of heuristics that enable them to conceptualize what a canonical version of a record should look like, even if they’ve never seen it in practice. Second, domain experts instinctively recognize which record subfields are most likely to uniquely identify a record; they just know where to look. As such, Dedupe works by engaging the user in labeling the data via a command line interface, and using machine learning on the resulting training data to predict similar or matching records within unseen data.

Testing Out Dedupe

Getting started with Dedupe is easy, and the developers have provided a convenient repo with examples that you can use and iterate on. Let’s start by walking through the from the dedupe-examples. To get Dedupe running, we’ll need to install unidecode, future, and dedupe.

In your terminal (we recommend doing so inside a virtual environment):

git clone
cd dedupe-examples
pip install unidecode
pip install future
pip install dedupe

Then we’ll run the file to see what dedupe can do:


Blocking and Affine Gap Distance

Let’s imagine we own an online retail business, and we are developing a new recommendation engine that mines our existing customer data to come up with good recommendations for products that our existing and new customers might like to buy. Our dataset is a purchase history log where customer information is represented by attributes like name, telephone number, address, and order history. The database we’ve been using to log purchases assigns a new unique ID for every customer interaction.

But it turns out we’re a great business, so we have a lot of repeat customers! We’d like to be able to aggregate the order history information by customer so that we can build a good recommender system with the data we have. That aggregation is easy if every customer’s information is duplicated exactly in every purchase log. But what if it looks something like the table below?

How can we aggregate the data so that it is unique to the customer rather than the purchase? Features in the data set like names, phone numbers, and addresses will probably be useful. What is notable is that there are numerous variations for those attributes, particularly in how names appear — sometimes as nicknames, sometimes even misspellings. What we need is an intelligent and mostly automated way to create a new dataset for our recommender system. Enter Dedupe.

When comparing records, rather than treating each record as a single long string, Dedupe cleverly exploits the structure of the input data to instead compare the records field by field. The advantage of this approach is more pronounced when certain feature vectors of records are much more likely to assist in identifying matches than other attributes. Dedupe lets the user nominate the features they believe will be most useful:

fields = [
{'field' : 'Name', 'type': 'String'},
{'field' : 'Phone', 'type': 'Exact', 'has missing' : True},
{'field' : 'Address', 'type': 'String', 'has missing' : True},
{'field' : 'Purchases', 'type': 'String'},

Dedupe scans the data to create tuples of records that it will propose to the user to label as being either matches, not matches, or possible matches. These uncertainPairs are identified using a combination of blocking , affine gap distance, and active learning.

Blocking is used to reduce the number of overall record comparisons that need to be made. Dedupe’s method of blocking involves engineering subsets of feature vectors (these are called ‘predicates’) that can be compared across records. In the case of our people dataset above, the predicates might be things like:

  • the first three digits of the phone number
  • the full name
  • the first five characters of the name
  • a random 4-gram within the city name

Records are then grouped, or blocked, by matching predicates so that only records with matching predicates will be compared to each other during the active learning phase. The blocks are developed by computing the edit distance between predicates across records. Dedupe uses a distance metric called affine gap distance, which is a variation on Hamming distance that makes subsequent consecutive deletions or insertions cheaper.

Therefore, we might have one blocking method that groups all of the records that have the same area code of the phone number. This would result in three predicate blocks: one with a 202 area code, one with a 334, and one with NULL. There would be two records in the 202 block (IDs 452 and 821), two records in the 334 block (IDs 233 and 699), and one record in the NULL area code block (ID 720).

The relative weight of these different feature vectors can be learned during the active learning process and expressed numerically to ensure that features that will be most predictive of matches will be heavier in the overall matching schema. As the user labels more and more tuples, Dedupe gradually relearns the weights, recalculates the edit distances between records, and updates its list of the most uncertain pairs to propose to the user for labeling.

Once the user has generated enough labels, the learned weights are used to calculate the probability that each pair of records within a block is a duplicate or not. In order to scale the pairwise matching up to larger tuples of matched records (in the case that entities may appear more than twice within a document), Dedupe uses hierarchical clustering with centroidal linkage. Records within some threshold distance of a centroid will be grouped together. The final result is an annotated version of the original dataset that now includes a centroid label for each record.

Active Learning

You can see that dedupe is a command line application that will prompt the user to engage in active learning by showing pairs of entities and asking if they are the same or different.

Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished

Active learning is the so-called special sauce behind Dedupe. As in most supervised machine learning tasks, the challenge is to get labeled data that the model can learn from. The active learning phase in Dedupe is essentially an extended user-labeling session, which can be short if you have a small dataset and can take longer if your dataset is large. You are presented with four options:

You can experiment with typing the y, n, and u keys to flag duplicates for active learning. When you are finished, enter f to quit.

  • (y)es: confirms that the two references are to the same entity
  • (n)o: labels the two references as not the same entity
  • (u)nsure: does not label the two references as the same entity or as different entities
  • (f)inished: ends the active learning session and triggers the supervised learning phase

As you can see in the example above, some comparisons decisions are very easy. The first contains zero for zero hits on all four attributes being examined, so the verdict is most certainly a non-match. On the second, we have a 3/4 exact match, with the fourth being fuzzy in that one entity contains a piece of the matched entity; Ryerson vs. Chicago Public Schools Ryerson. A human would be able to discern these as two references to the same entity, and we can label it as such to enable the supervised learning that comes after the active learning.

The csv_example also includes an evaluation script that will enable you to determine how successfully you were able to resolve the entities. It’s important to note that the blocking, active learning and supervised learning portions of the deduplication process are very dependent on the dataset attributes that the user nominates for selection. In the csv_example, the script nominates the following four attributes:

fields = [
{'field' : 'Site name', 'type': 'String'},
{'field' : 'Address', 'type': 'String'},
{'field' : 'Zip', 'type': 'Exact', 'has missing' : True},
{'field' : 'Phone', 'type': 'String', 'has missing' : True},

A different combination of attributes would result in a different blocking, a different set of uncertainPairs, a different set of features to use in the active learning phase, and almost certainly a different result. In other words, user experience and domain knowledge factor in heavily at multiple phases of the deduplication process.

Something a Bit More Challenging

In order to try out Dedupe with a more challenging project, we decided to try out deduplicating the White House visitors’ log. Our hypothesis was that it would be interesting to be able to answer questions such as “How many times has person X visited the White House during administration Y?” However, in order to do that, it would be necessary to generate a version of the list that contained unique entities. We guessed that there would be many cases where there were multiple references to a single entity, potentially with slight variations in how they appeared in the dataset. We also expected to find a lot of names that seemed similar but in fact referenced different entities. In other words, a good challenge!

The data set we used was pulled from the website, a part of the executive initiative to make federal data more open to the public. This particular set of data is a list of White House visitor record requests from 2006 through 2010. Here’s a snapshot of what the data looks like via the White House API.

The dataset includes a lot of columns, and for most of the entries, the majority of these fields are blank:

Loading the Data

Using the API, the White House Visitor Log Requests can be exported in a variety of formats to include, .json, .csv, and .xlsx, .pdf, .xlm, and RSS. However, it’s important to keep in mind that the dataset contains over 5 million rows. For this reason, we decided to use .csv and grabbed the data using requests:

import requestsdef getData(url,fname):
Download the dataset from the webpage.
response = requests.get(url)
with open(fname, 'w') as f:
DATAURL = ""ORIGFILE = "fixtures/whitehouse-visitors.csv"getData(DATAURL,ORIGFILE)

Once downloaded, we can clean it up and load it into a database for more secure and stable storage.

Tailoring the Code

Next, we’ll discuss what is needed to tailor a dedupe example to get the code to work for the White House visitors log dataset. The main challenge with this dataset is its sheer size. First, we'll need to import a few modules and connect to our database:

import csv
import psycopg2
from dateutil import parser
from datetime import datetime
conn = NoneDATABASE = your_db_name
USER = your_user_name
HOST = your_hostname
PASSWORD = your_password
conn = psycopg2.connect(database=DATABASE, user=USER, host=HOST, password=PASSWORD)
print ("I've connected")
print ("I am unable to connect to the database")
cur = conn.cursor()

The other challenge with our dataset are the numerous missing values and datetime formatting irregularities. We wanted to be able to use the datetime strings to help with entity resolution, so we wanted to get the formatting to be as consistent as possible. The following script handles both the datetime parsing and the missing values by combining Python’s dateutil module and PostgreSQL's fairly forgiving 'varchar' type.

This function takes the csv data in as input, parses the datetime fields we’re interested in (‘lastname’,’firstname’,’uin’,’apptmade’,’apptstart’,’apptend’, ‘meeting_loc’.), and outputs a database table that retains the desired columns. Keep in mind this will take a while to run.

def dateParseSQL(nfile):
cur.execute('''CREATE TABLE IF NOT EXISTS visitors_er
lastname varchar,
firstname varchar,
uin varchar,
apptmade varchar,
apptstart varchar,
apptend varchar,
meeting_loc varchar);''')
with open(nfile, 'rU') as infile:
reader = csv.reader(infile, delimiter=',')
next(reader, None)
for row in reader:
for field in DATEFIELDS:
if row[field] != '':
dt = parser.parse(row[field])
row[field] = dt.toordinal() # We also tried dt.isoformat()
sql = "INSERT INTO visitors_er(lastname,firstname,uin,apptmade,apptstart,apptend,meeting_loc) \
VALUES (%s,%s,%s,%s,%s,%s,%s)"
cur.execute(sql, (row[0],row[1],row[3],row[10],row[11],row[12],row[21],))
print ("All done!")

About 60 of our rows had ASCII characters, which we dropped using this SQL command:

delete from visitors where firstname ~ '[^[:ascii:]]' OR lastname ~ '[^[:ascii:]]';

For our deduplication script, we modified the PostgreSQL example as well as Dan Chudnov’s adaptation of the script for the OSHA dataset.

import tempfile
import argparse
import csv
import os
import dedupe
import psycopg2
from psycopg2.extras import DictCursor

Initially, we wanted to try to use the datetime fields to deduplicate the entities, but dedupe was not a big fan of the datetime fields, whether in isoformat or ordinal, so we ended up nominating the following fields:

KEY_FIELD = 'visitor_id'
SOURCE_TABLE = 'visitors'
FIELDS = [{'field': 'firstname', 'variable name': 'firstname',
'type': 'String','has missing': True},
{'field': 'lastname', 'variable name': 'lastname',
'type': 'String','has missing': True},
{'field': 'uin', 'variable name': 'uin',
'type': 'String','has missing': True},
{'field': 'meeting_loc', 'variable name': 'meeting_loc',
'type': 'String','has missing': True}

We modified a function Dan wrote to generate the predicate blocks:

def candidates_gen(result_set):
lset = set
block_id = None
records = []
i = 0
for row in result_set:
if row['block_id'] != block_id:
if records:
yield records
block_id = row['block_id']
records = []
i += 1
if i % 10000 == 0:
print ('{} blocks'.format(i))
smaller_ids = row['smaller_ids']
if smaller_ids:
smaller_ids = lset(smaller_ids.split(','))
smaller_ids = lset([])
records.append((row[KEY_FIELD], row, smaller_ids)) if records:
yield records

And we adapted the method from the dedupe-examples repo to handle the active learning, supervised learning, and clustering steps:

def find_dupes(args):
deduper = dedupe.Dedupe(FIELDS)
with psycopg2.connect(database=args.dbname,
cursor_factory=DictCursor) as con:
with con.cursor() as c:
c.execute('SELECT COUNT(*) AS count FROM %s' % SOURCE_TABLE)
row = c.fetchone()
count = row['count']
sample_size = int(count * args.sample)
print ('Generating sample of {} records'.format(sample_size))
with con.cursor('deduper') as c_deduper:
c_deduper.execute('SELECT visitor_id,lastname,firstname,uin,meeting_loc FROM %s' % SOURCE_TABLE)
temp_d = dict((i, row) for i, row in enumerate(c_deduper))
deduper.sample(temp_d, sample_size)
if os.path.exists(
print ('Loading training file from {}'.format(
with open( as tf:
print ('Starting active learning')
print ('Starting training')
deduper.train(ppc=0.001, uncovered_dupes=5)
print ('Saving new training file to {}'.format(
with open(, 'w') as training_file:
deduper.cleanupTraining() print ('Creating blocking_map table')
CREATE TABLE blocking_map
(block_key VARCHAR(200), %s INTEGER)
""" % KEY_FIELD)
for field in deduper.blocker.index_fields:
print ('Selecting distinct values for "{}"'.format(field))
c_index = con.cursor('index')
""" % (field, SOURCE_TABLE))
field_data = (row[field] for row in c_index)
deduper.blocker.index(field_data, field)
print ('Generating blocking map')
c_block = con.cursor('block')
full_data = ((row[KEY_FIELD], row) for row in c_block)
b_data = deduper.blocker(full_data)
print ('Inserting blocks into blocking_map')
csv_file = tempfile.NamedTemporaryFile(prefix='blocks_', delete=False)
csv_writer = csv.writer(csv_file)
f = open(, 'r')
c.copy_expert("COPY blocking_map FROM STDIN CSV", f)
os.remove( con.commit() print ('Indexing blocks')
CREATE INDEX blocking_map_key_idx ON blocking_map (block_key)
c.execute("DROP TABLE IF EXISTS plural_key")
c.execute("DROP TABLE IF EXISTS plural_block")
c.execute("DROP TABLE IF EXISTS covered_blocks")
c.execute("DROP TABLE IF EXISTS smaller_coverage")
print ('Calculating plural_key')
CREATE TABLE plural_key
(block_key VARCHAR(200),
INSERT INTO plural_key (block_key)
SELECT block_key FROM blocking_map
GROUP BY block_key HAVING COUNT(*) > 1
print ('Indexing block_key')
CREATE UNIQUE INDEX block_key_idx ON plural_key (block_key)
print ('Calculating plural_block')
CREATE TABLE plural_block
AS (SELECT block_id, %s
FROM blocking_map INNER JOIN plural_key
USING (block_key))
""" % KEY_FIELD)
print ('Adding {} index'.format(KEY_FIELD))
CREATE INDEX plural_block_%s_idx
ON plural_block (%s)
CREATE UNIQUE INDEX plural_block_block_id_%s_uniq
ON plural_block (block_id, %s)
print ('Creating covered_blocks')
CREATE TABLE covered_blocks AS
string_agg(CAST(block_id AS TEXT), ','
ORDER BY block_id) AS sorted_ids
FROM plural_block
print ('Indexing covered_blocks')
CREATE UNIQUE INDEX covered_blocks_%s_idx
ON covered_blocks (%s)
print ('Committing')
print ('Creating smaller_coverage')
CREATE TABLE smaller_coverage AS
(SELECT %s, block_id,
TRIM(',' FROM split_part(sorted_ids,
CAST(block_id AS TEXT), 1))
AS smaller_ids
FROM plural_block
INNER JOIN covered_blocks
USING (%s))
print ('Clustering...')
c_cluster = con.cursor('cluster')
FROM smaller_coverage
USING (%s)
ORDER BY (block_id)
clustered_dupes = deduper.matchBlocks(
candidates_gen(c_cluster), threshold=0.5)
print ('Creating entity_map table')
c.execute("DROP TABLE IF EXISTS entity_map")
CREATE TABLE entity_map (
canon_id INTEGER,
cluster_score FLOAT,
print ('Inserting entities into entity_map')
for cluster, scores in clustered_dupes:
cluster_id = cluster[0]
for key_field, score in zip(cluster, scores):
INSERT INTO entity_map
(%s, canon_id, cluster_score)
VALUES (%s, %s, %s)
""" % (KEY_FIELD, key_field, cluster_id, score))
print ('Indexing head_index')
c.execute("CREATE INDEX head_index ON entity_map (canon_id)")
if __name__ == '__main__':
parser = argparse.ArgumentParser()
parser.add_argument('--dbname', dest='dbname', default='whitehouse', help='database name')
parser.add_argument('-s', '--sample', default=0.10, type=float, help='sample size (percentage, default 0.10)')
parser.add_argument('-t', '--training', default='training.json', help='name of training file')
args = parser.parse_args()

Active Learning Observations

We ran multiple experiments:

  • Test 1: lastname, firstname, meeting_loc => 447 (15 minutes of training)
  • Test 2: lastname, firstname, uin, meeting_loc => 3385 (5 minutes of training) — one instance that had 168 duplicates

We observed a lot of uncertainty during the active learning phase, mostly because of how enormous the dataset is. This was particularly pronounced with names that seemed more common to us and that sounded more domestic since those are much more commonly occurring in this dataset. For example, are two records containing the name Michael Grant the same entity?

Additionally, we noticed that there were a lot of variations in the way that middle names were captured. Sometimes they were concatenated with the first name, other times with the last name. We also observed what seemed to be many nicknames or that could have been references to separate entities: KIM ASKEW vs. KIMBERLEY ASKEW and Kathy Edwards vs. Katherine Edwards (and yes, dedupe does preserve variations in case). On the other hand, since nicknames generally appear only in people's first names, when we did see a short version of a first name paired with an unusual or rare last name, we were more confident in labeling those as a match.

Other things that made the labeling easier were clearly gendered names (e.g. Brian Murphy vs. Briana Murphy), which helped us to identify separate entities in spite of very small differences in the strings. Some names appeared to be clear misspellings, which also made us more confident in our labeling two references as matches for a single entity (Davifd Culp vs. David Culp). There were also a few potential easter eggs in the dataset, which we suspect might actually be aliases (Jon Doe and Ben Jealous).

One of the things we discovered upon multiple runs of the active learning process is that the number of fields the user nominates to Dedupe for use has a great impact on the kinds of predicate blocks that are generated during the initial blocking phase. Thus, the comparisons that are presented to the trainer during the active learning phase. In one of our runs, we used only the last name, first name, and meeting location fields. Some of the comparisons were easy:

lastname : KUZIEMKO
firstname : ILYANA
meeting_loc : WH
lastname : KUZIEMKO
firstname : ILYANA
meeting_loc : WH
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished

Some were hard:

lastname : Desimone
firstname : Daniel
meeting_loc : OEOB
lastname : DeSimone
firstname : Daniel
meeting_loc : WH
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished


What we realized from this is that there are two different kinds of duplicates that appear in our dataset. The first kind of duplicate is one that generated via (likely mistaken) duplicate visitor request forms. We noticed that these duplicate entries tended to be proximal to each other in terms of visitor_id number, have the same meeting location and the same uin (which confusingly, is not a unique guest identifier but appears to be assigned to every visitor within a unique tour group). The second kind of duplicate is what we think of as the frequent flier — people who seem to spend a lot of time at the White House like staffers and other political appointees.

During the dedupe process, we computed there were 332,606 potential duplicates within the data set of 1,048,576 entities. For this particular data, we would expect these kinds of figures, knowing that people visit for repeat business or social functions.

Within-Visit Duplicates

lastname : Ryan
meeting_loc : OEOB
firstname : Patrick
uin : U62671
lastname : Ryan
meeting_loc : OEOB
firstname : Patrick
uin : U62671

Across-Visit Duplicates (Frequent Fliers)

lastname : TANGHERLINI
meeting_loc : OEOB
firstname : DANIEL
uin : U02692
lastname : TANGHERLINI
meeting_loc : NEOB
firstname : DANIEL
uin : U73085
lastname : ARCHULETA
meeting_loc : WH
firstname : KATHERINE
uin : U68121
lastname : ARCHULETA
meeting_loc : OEOB
firstname : KATHERINE
uin : U76331


In this beginners guide to Entity Resolution, we learned what it means to identify entities and their possible duplicates within and across records. To further examine this data beyond the scope of this blog post, we would like to determine which records are true duplicates. This would require additional information to canonicalize these entities, thus allowing for potential indexing of entities for future assessments. Ultimately we discovered the importance of entity resolution across a variety of domains, such as counter-terrorism, customer databases, and voter registration.

Recommended Reading

District Data Labs provides data science consulting and corporate training services. We work with companies and teams of all sizes, helping them make their operations more data-driven and enhancing the analytical abilities of their employees. Interested in working with us? Let us know!



District Data Labs
District Insights

Data science consulting and corporate training. Take your analytics to the next level.

Recommended from Medium


See more recommendations