Real-time Entity Resolution with Kafka and Spark

Published in

data-surge

7 min readSep 26, 2023

Introduction to Entity Resolution (ER)

Welcome to real-time Entity Resolution (ER) project overview, where we explore the dynamic field of streaming ER. This project is focused on resolving entities not only within a static dataset but also within the continuous influx of incoming streaming data.

ER Data Flow Overview. User Data with Multiple Records and User IDs.

Entity Resolution (ER) aims to identify which records in a single dataset or across multiple datasets refer to the same real-world entity so that these records can be aggregated and better understood. ER techniques can help solve complicated and costly problems like data corruption, unnecessarily duplicated records, and most importantly, discordant datasets within the same overarching system. With these challenges in mind, ER is a fundamental tool for big data applications across industries and is often essential for improving data quality and increasing the reliability of data analytics.

Consider the simplified ER task shown above. We want to be able to know what pieces of information can be linked together between two datasets in order to draw insights about the records in each:

Even with only three entities, we run into a problem: although we have (apparently) unique IDs for each person, they vary by database and we have multiple John Does, so we cannot combine our databases because there is no shared key to join on that has unique information. Even still, we can easily see that John Doe and Tom Jones are probably unique because they have different names and IDs. However, despite our best efforts, we have no way to determine whether each Tom Jones record is unique compared to other Tom Jones records and similarly, no way to tell if each John Doe record is unique compared with other John Doe records. So even in a very simple case, where our system has three people in it and our databases carry only three features each, this hands-on approach has distinct limitations. As we scale millions or billions of times to enterprise-levels, the problem becomes intractable almost immediately! Because of this, academic and corporate researchers have worked on scaling this problem since the 1950s, establishing the following approaches:

For this project, ML models were chosen because they can efficiently handle large datasets and streaming data, while remaining much more flexible to changes in data and even schema compared to rules-based matching. These models can learn from data and improve over time as they see new data patterns they excel at identifying complex patterns and subtle relationships in between data features for more accurate matching, especially when the data is messy.

The Data Surge Solution

We are particularly interested in techniques for linking personal records together. Fortunately for all of us, there are strict rules on data sharing of private, or personally identifiable information (PII). To emulate this process, we generated millions of synthetic records matching PII-type fields of interest. The following is an overview of our ER solution:

First, synthetic user records are submitted via a React form, which provides fields like first name, middle name, last name, social security number, driver’s license number, driver’s license state, and birth state. After the React form is submitted, an HTTP request is made to a Spring Boot microservice called Ingest.
Next, within Ingest, the submitted fields are packaged using into a regular JSON and published to Kafka within our Confluent Docker environment.
Afterward, we perform entity resolution using trained machine learning models deployed in Spark. The resolved linkages are routed with Kafka to an instance of ksqlDB.
The record linkages are then linked to original ingested records to form a queryable set of resolved entities in ksqlDB
Finally, a separate React form paired with a second Spring Boot microservice, Search, allows a user to search the resolved records in ksqlDB for canonicalized entities.

The deployment of our machine learning system on Spark and the inclusion of real-time streaming on Kafka means that this solution is scalable to both the temporal and spatial needs of any use case!

Under the Hood

Now that we have covered the overview of our system, we’ll spend more time detailing how we use Kafka, ksqlDB, and Spark to make our entity resolution system both real-time and scalable.

Kafka for Streaming

Kafka supports streaming of flexible data types called messages from system to system. Further, Kafka allows configuration of anticipated data schemas on these messages and importantly, unique topics that allow us to differentiate from raw ingested records and resolved entities later on.

As data enters our system, the Ingest microservice will publish data to Kafka on the IngestRecords topic.
These Kafka messages are read into Spark via structured streaming and deserialized via the record known schema.
After machine learning has been completed, the resolved record linkages are streamed directly to ksqlDB on the ResolvedRecords topic as shown below. This prevents cross-contamination of unresolved and resolved records and allows for multiple immutable streams.

Spark for Resolving

Spark is a natural choice for scalable operations because of its support for parallelizable computing operations. At scale, our entity resolution models should perform similarly to when we have only a few records to resolve! For this project we used the dedupe library which works as follows:

First, a query set of records entering our system with Kafka is compared against all previous records in the system and a subset of likely potential candidate matches are chosen through a process called blocking to minimize the number of comparisons that have to be made
Next, each query record is compared through character distance metrics against each candidate match
Afterward, machine learning techniques such as hierarchical clustering or logistic regression learn from presented examples of true record linkages establish predicted record linkages between the query set and the candidate matches:

{
    "Resolved Entity ID": A,
    "User Records": [1, 3, 5, 7]
}

In this example, records 1, 3, 5, and 7 have been linked together to form resolved entity A!
Finally, the record linkage JSONs are streamed to ksqlDB using Kafka.

ksqlDB for Querying

An essential part of this project is interaction with a structured database for querying resolved entities. ksqlDB functions much like any SQL database; however, the important distinction is that ksqlDB interfaces directly with Kafka to support streaming operations! ksqlDB inherits the same type of nomenclature from Kafka, and queries against tables can be submitted against the database instance as though the table names are KafkaTopics . In our system, this works like so:

Streaming “tables” are generated for both the initial records routed through the Kafka topic IngestRecords as well as the resolved records routed through ResolvedRecords.
These streams are then linked together as a queryable table in ksqlDB to give the most recent entity IDs associated with each unique record in our system
All information that pertains to a unique entity is collected in a self-contained row in the resolved table like so:

This allows the end-user to query any field associated with a person and receive back all of the pertinent information in the database

Conclusion

The Data Surge Real-Time Entity Resolution System provides a uniquely useful implementation of ER that is both responsive to both temporal and scaling needs. We are now empowered to handle ER on massive datasets and can trivially solve the case of which Tom Jones went to Meeting Y and is working on Task A posed at the beginning of this article.

Further implementation details, outputs, and use cases can be explored in the Project GitHub Repository real-time-ers.

Connect with us on LinkedIn for updates and more information about Data Surge LLC.

Project Components and Further Reading

The GitHub repository for the project includes various components:

Infra: Contains Docker Compose files for Kafka and Spark clusters setup.
Ingest: Spring Boot app responsible for collecting user data from React and sending it to Kafka.
React: React app for submitting data for ingestion and querying resolved entities.
Query: Spring Boot app for processing user search requests on the resolved entity database.

These links will be helpful in starting your containerized entity resolution journey:

End-to-End Entity Resolution: Overview of End-to-End ER

Get Docker

Installing Docker Compose

Confluent Platform

ksqlDB Documentation