Entity Resolution — An Introduction

Finding records that refers to the same real-world object

Adrian Evensen
9 min readJan 24, 2024
Spider-Man pointing at Spider-Man. From: Spider-Man (1967) — Double Identity

What is Entity Resolution?

Entity Resolution, also known as Data Matching, addresses the challenge of matching and merging records that correspond to the same real-world object. It offers valuable insights, efficiency, and accuracy in data management and analysis, making it highly useful for various applications such as data integration in healthcare, fraud detection, product matching, mergers and acquisitions, risk assessment, supply chain management, and data quality improvement. The process of Entity Resolution involves selecting and comparing pairs of records to determine whether they are a match or not.

Entity resolution can be further divided into the subcategories of Record Linkage and Deduplication (See Figure 1). Record Linkage focuses on finding similar records in two or more datasets, while Deduplication aims to identify matches within a single dataset [1].

Illustration of Record Linkage and Deduplication.
Fig 1: Illustration of Record Linkage and Deduplication. By author.

The workflow of Entity Resolution is typically constructed as a sequence of the following processes: Preprocessing, Blocking (aka Indexing), Comparison, Matching (aka Classification), and Clustering (See Figure 2). The main objective of this pipeline is to achieve high accuracy in identifying matches while controlling the time complexities associated with comparing candidate pairs.

Illustration of an entity resolution pipeline. From left to right; Dataset, Preprocessing, Blocking, Comparing, Classification and Clustering. By author.
Fig 2: Illustration of an entity resolution pipeline. By author.

Preprocessing

When working with multiple datasets, each dataset might be structured differently making it challenging to compare records. Different names might be used to reference the same type of attribute, or the format of attribute values might vary, like for instance dates (MM/DD/YY vs DD-MM-YY). The dataset will have to be standardized such that both dataset have the same structure suitable for comparison.

In addition to standardizing the datasets, there are several preprocessing, data cleaning, and normalization techniques that can be applied. To name a few: converting strings to lower case, removing stopwords, stemming, or using regular expressions to manipulate the data.

Phonetic Encoding

Phonetic encoding is a techniques used to identify records that sound similar despite being spelled different. This is done by following a set of rules compressing records into a short expression. Phonetic Encoding is essentially feature engineering that can be used as a precursor for Blocking, grouping similar encodings, or comparing encoding for similarities.

The Soundex algorithm is the most commonly used phonetic encoding algorithm. It is based on American-English language pronunciation and works by retaining the first letter of a string and converting the remaining characters into numbers according to a set of rules. These rules map similar-sounding letters to the same number, ignoring certain characters (vowels, ‘h’, ‘w’, and ‘y’), and then eliminate repetitions of the same number. The resulting encodings are normalized to three digits, either by truncation or by appending ‘0’ (See Figure 3).

Fig 3: Image of strings and their phonetic encodings using different algorithms including Soundex. The table is taken from [1] Table 4.1.

Blocking

Blocking is a process where records are grouped together following a specific criteria, creating blocks of candidate pairs. Entity Resolution suffers from a quadratic time-complexity as each record potentially needs to be compared with all other records. This can lead to unnecessary computations as very few of these are likely to be true matches. Blocking aims to reduce the number of candidate pairs to compare by grouping records that are more likely to be a match (See Figure 4).

Image of a table of five persons where three are likely to be the same person illustrated via standard blocking.
Fig 4: Example of Standard Blocking on the attribute ‘first_name’. By author, inspired from Splink GiT.

The main objective of blocking is to cover as many true positives, while avoiding as many false positives and false negatives. This can be achieved via Non-learning Methods (most common) or via Learning-based Methods using Supervised and Unsupervised methods. Blocking can be further extended into: Block Building, and Block Processing (See Figure 5) [2, 3].

Pipeline illustrating Blocking Building and Blocking Processing. By author.
Fig 5: Pipeline illustrating Blocking Building and Blocking Processing. By author.

The Block Building stage takes one or more sets of records stored as dataframes and constructs blocks of candidate pairs from them. Some well established Block Building methods are:

  • Standard Blocking: Also known as Attribute Equivalent, for every distinct Blocking Key Value (BKV), or attribute value, blocks are created containing all corresponding records.
  • Suffix Blocking: Converts each BKV into a list of its suffixes that are longer than a minimum length, of which it generates blocks.
  • Sorted Neighborhood: Sorts the records in alphabetical order according to a BKV. It then use a fixed-size sliding window to group a certain number of records, which represents the blocks.
  • Q-grams Blocking: Each BKV is converted into a list of q-grams. Then, sub-lists are generated by combining these q-gram lists down to a certain minimum length, of which it generates blocks.
  • Canopy Clustering: Creates blocks by randomly selecting a record from a pool. It the compares the BKV of the selected record with the remaining records in the pool, using a low-cost similarity function. Records which scores above a loose threshold are clustered together, and those above a tight threshold are removed from the pool while the others are returned. This creates overlapping blocks, generating more candidate pairs, but can potentially capture matches that otherwise would be overlooked.
  • Locality Sensitive Hashing (LSH): Transforms a BKV into a bag of shingles (either as tokens or q-grams). The shingles are transformed via MinHash to create a hash code which is used to generate blocks. Similar items have much higher probability to be mapped to the same hash than dissimilar ones.

Block Processing is an optional stage that refines the selection of candidate pairs through additional optimizations. This may involve discarding entire blocks that contain unnecessary comparisons (e.g. very large blocks), and/or discarding individual candidate pairs within certain blocks (e.g. candidate pairs that appear in multiple blocks). The former is known as Block Cleaning, while the later technique as Comparison Cleaning [2].

Filtering

Filtering is a technique that attempts to quickly discard candidate pairs that are very likely to not match. The goal is to remove as many false positive candidate pairs, before computes the actual similarity between records (See Figure 6). Filtering methods usually work by comparing a candidate pair’s signatures using a similarity function. If the similarity score falls below a certain threshold, the corresponding candidate pair is discarded [2, 3].

Illustration of Blocking and Filtering motivation as a Venn diagram.
Fig 6: Illustration of Blocking and Filtering motivation as a Venn diagram. By author, inspired from [2] Figure 1 (c) .

Filtering is fairly similar to Blocking. Both aim to reduce the number of candidates before computing their actual similarity. Filtering can be applied directly to the input data or used in conjunction with blocking, further reducing the number of candidate pairs appearing in a block. Filtering is a less recognized technique for generating candidate pairs due to the popularity and efficiency of Blocking. However, there is a wide selection of Filter methods available to chose from [2].

Comparison

In the Comparison step, the similarity between the candidate pair is computed. This step is the most computationally expensive and essentially the bottleneck of Entity Resolution. Computing similarity scores quickly becomes computationally costly with larger datasets as the number of candidate pairs experiences exponential growth.

For each candidate pair generated from previous steps, a comparison vector is computed. The comparison vector holds the similarity score for every attribute compared (see Figure 7).

Illustration going from Candidate pairs to Comparison vector.
Fig 7: Illustration going from Candidate pairs to Comparison vector. By Author.

The comparison vector forms the baseline that in the end determines whether a candidate pair should be classified as a match or non-match. Depending on the attribute’s data type, there is a handful of similarity functions to chose from:

Text and set similarity functions:
Overlap, Jaccard, Dice and TF-IDF + Cosine.

String similarity functions:
Jaro-Winkler, Exact, Levenshtein, Smith-Waterman, Affine, Soft-TFIDF, Mongo-Elkan and Longest Common Substring.

Numerical similarity functions:
Maximum Absolute Difference and Maximum Percentage Difference.

Date, Age and Time:
Both string-based and numerical-based functions can be used to estimate similarity of date, age and time.

Matching

The matching step is essentially a binary classification task that determines whether candidate pairs are matches or non-matches. Any machine learning model, whether supervised or unsupervised, can be used to make predictions on the comparison vector. Entity Resolution-specific models have been developed for classifying candidate pairs. These models can be roughly categorized into collective-based methods, which combine information across multiple records to make decisions, and attribute-based methods (also known as learning-based methods), which make decisions based on individual entity attributes [3].

One machine learning model for Entity Resolution worth mentioning is the Fellegi-Sunter (FS) model. This model offers a framework to estimate the weight an attribute contributes when predicting a match or non-match (See Figure 8).

A waterfall chart of match weights given different attributes for an Entity Resolution problem.
Fig 8: A waterfall chart of match weights given different attributes for an Entity Resolution problem. The figure is a taken from Splink’s documentation.

The framework can also be extended, taking into account the attribute value frequencies. For instance, the surname ‘Smith’ is usually more common than ‘Djikstra’, hence a match on the latter should be weighted more heavily. These are known as frequency-based weights. Additionally, attribute weights can be estimated in an unsupervised manner using the Expectation Maximization algorithm [4]. This is very convenient as most cases with Entity Resolution the matches of records are unknown (For interested readers, I recommend Robin Lincacre’s blog series on the FS-model).

The Fellegi-Sunter model has been implemented in multiple Python library. Splink fully revolves around the FS model, while RecordLinkage provides an API to implement it.

Active Learning

The idea behind Active Learning is to guide a machine learning model iteratively with human feedback. The goal is to obtain a model with relatively high performance by labeling as few training samples as possible. This is achieved by having the model present training data that it struggles with classifying. Active Learning can be applied to both the classification step, as well as with Blocking.

Active learning is a very viable approach to Entity Resolution, as it often lacks ground truth, or ‘gold standard’. The Python library DeDupe uses active learning as its primary strategy for entity resolution.

Clustering

Clustering identifies groups of records that likely refer to the same entity. The collection of matched candidate pairs is represented as a similarity graph, where the nodes correspond to records, while the edges refer to the likelihood of a match between the records (See Figure 9). Clustering identifies more connections based on indirect relations between the nodes, accounting for variations, errors, or inconsistencies in the data, and can remove weaker connections. As a result, it generates a cluster of records that are likely to refer to the same real-world entity.

Some clustering methods include Markov Clustering, Cut Clustering, and Center- and Merge-Center Clustering [3].

Fig 9: Illustration of how records can be linked as the same entity. By author, inspired from Splink GiT.

Canonicalization

Canonicalization (aka Canonical Entities) refers to the process of merging a cluster of records belonging to the same entity into a unified representation with maximal information. In certain cases, it can also accelerate Entity Resolution by reducing the number of potential comparisons as records are merged. Although canonicalization is not extensively covered, [5] provides some methods for interested readers.

Summary

In this blog post, we have covered Entity Resolution and its fundamental concepts for identifying records that refer to the same real-world entity. This process involves standardizing the datasets and preprocessing attributes. Blocking is employed to minimize the number of candidate pairs to compare, without sacrificing too many true positives. Comparing candidate pairs involves calculating their similarity metrics using different similarity functions. Matching classifies pairs as matches or non-matches, and finally, records are grouped into clusters representing entities.

References

[1]: P. Christen, Data Matching, 2012
[2]: G. Papadakis, et. al, A Surviey of Blocking and Filtering Techniques for Entity Resolution, 2020
[3]: V. Christophides, et. al, End-to-End Entity Resolution for Big Data: A Survey, 2020
[4]: T. Herzog, et. al, Data Quality and Record Linkage Techniques
[5]: L Getoor, et. al, Entity resolution: theory, practice & open challenges, 2012

--

--