A scalable fuzzy approach to a deduplication problem

Karan Mitroo
Credit Saison (India)
6 min readFeb 28, 2020
identification of a duplicate entry in fuzzy world
Identifying similar records in a system

Introduction

We at Credit Saison (India) provide capital to Corporate, MSME and retail consumers. Before we hand out a loan, we do a thorough job of due diligence to ensure the borrower is in its full capacity to repay the loan. One aspect of the due diligence involves deduplication of our customer base.

What is Deduplication?

Deduping is the process of identifying and dealing with entries from a data-set that has similar characteristics. As a lender, we want to identify potential customers who might exist in our system.

Photo by Jari Hytönen on Unsplash

To find whether a new customer who has just joined the platform, is a repeat customer or not, we match their data with the data of all the existing customers in the system — Like the name of the new customer to the names of all the existing customers. Apart from just the name we also consider matching their KYC data, contact, address, and some more information.

The logic here cannot be limited to just a simple string match with case insensitivity. We also do fuzzy string matches where entities that are not exactly equal but are very much similar to each other are also considered the same. We do a fuzzy match and get those results whose match score is above a certain threshold. We also realized that one fuzzy algorithm does not give us the best match. So we decided to use multiple algorithms with a weighted mean based on some confidence that we gave to every algorithm.

Note — This was an initial development pre-launch where we had no existing data nor any data scientists deciding what the algorithm weightage should be.

How we started building it — Approach #1

The microservice in which this feature was being added is written in Java, so we used some libraries that help us do the fuzzy match. Now — since we are doing both exact and fuzzy matches, we used the help of a database query for exact matches and Java libraries to do fuzzy matches.

This is how our pseudo code for util methods looked like

def get_exact_name_match(input_name):
return db.find.where(name = input_name)
def get_exact_contact_match(input_contact):
return db.find.where(contact = input_contact)
def get_fuzzy_name_match(input_name):
all_names = db.get_all_names()
names_with_fuzzy_score = []
for each_name in all_names:
fuzzy_score = library.fuzzy_match(each_name, input_name)
names_with_fuzzy_score.append((each_name, fuzzy_score))
return names_with_fuzzy_score

And this is how our pseudo code for business logic looked like

exact_name_match = get_exact_name_match(‘myName’)
if len(exact_name_match) != 0:
return True
exact_contact_match = get_exact_name_match(‘myContact’)
if len(exact_contact_match) != 0:
return True
fuzzy_name_match = get_fuzzy_name_match(‘myName’)
if len(fuzzy_name_match) != 0:
return True
return False

It worked but what went wrong?

In the pseudo-code above, the method get_fuzzy_name_match(input_name) gets all the names from the database into a list and then traverses over them to generate a list of names and their fuzzy match score. And there are multiple such methods which get objects from the DB and convert them to a list of java class objects — like address and contact matches.

This implementation works just fine for a smaller set of data. But when the number of rows fetched grows to multiples of thousand, we start experiencing high latency in every API call. Embarrassingly enough, with around 12000 rows in our tables, the Api call was taking no less than 4 minutes to perform the dedupe operation for one user. This is when we realized that our approach needs a major tweaked let alone not scale.

We did some analysis over it and found the following reasons behind the delay.

  1. Converting all the DB rows to Java objects and then filtering out objects that don’t cross the fuzzy match threshold.
  2. Fetching data from some other tables related to the table in the query because it had foreign keys in place.
  3. Not using proper indexes when using exact string match queries.

How we tried improving it — Approach #2

From the first approach, we understood that doing fuzzy matches at the service level will not scale. Data will constantly get bigger and so will the response time (Even in the brief time we went live). The next approach was to do fuzzy matches at a database level.

We experimented with Elastic Search for a while. Though it was super fast in response times, it wasn’t able to serve our purpose of having multiple weighted fuzzy matching algorithms. Moreover, streaming data from our Relational DB to ElasticSearch just for the purpose of deduping seemed like premature optimization at this time.

After some research, we hit upon our solution — PostgreSQL itself and discovered an extension pg_similarity that supports fuzzy string matches inside PostgreSQL. pg_smilarity supports more than 15 different fuzzy match algorithms. Most of the algorithms that we were using from Java libraries were found in pg_similarity.

Now, we were able to perform exact and fuzzy matches both, at the database level. This solved the first 2 issues that we listed above.

  1. Getting only the useful data from the database Now only the rows which passed our fuzzy match threshold value would be returned. This is expected to be a small number. So instead of converting all the rows to Java objects, now we were only converting rows that only match with the input data. Converting all the rows to Java objects would have grown linearly with the number of rows. Converting only required rows to Java objects will always take a similar time thus improving a major chunk of our performance.
  2. No longer fetching related tables: Since we were now using SQL to fetch data, we only fetched the columns that were required. No extra columns or tables related by foreign keys were fetched.

Some Metrics

We needed some stats to ensure we are headed in the right direction before doing a production implementation — To know if our design works on a larger scale.

Database Indexes: We created a table with 100 million rows and 4 columns. Every column had 10 million distinct values. We created an index on the column on which we wanted to do our query. So say, our table was named contacts. Our query looked something like this

SELECT * FROM contacts WHERE contact_value = '9988111222';
  • Without an index, it took ~31 seconds for this query to execute.
  • With an index, it took ≤ 15 milliseconds for this query to execute.

Converting SQL output to Java POJO: Some of the data was initially getting stored as a byte array in PostgreSQL as Springboot by default doesn’t support the jsonb type (PostgreSQL). Also, as we mentioned earlier, we were fetching some non-required columns and mapping them to their member variables in the Java object. On converting 1500 rows to Java object we observed the following points

Mapping 10 columns with one of them containing byte array data took us > 9 seconds

Mapping 4 columns that contained primitive data types took < 250 milliseconds.

Based on the above metrics we knew that PostgreSQL can handle our current requirements and scale far better. And, that, we can start refactoring our code to perform exact and fuzzy searches in the database.

Time to refactor and the Result

We were able to foresee the benefits of shifting our business logic from the service level to the database level. So we developed some stored procedures in PostgreSQL which accepted the same input that our service was accepting, performed the dedupe operation and returned back the relevant matches.

Our service now just beautified the response from the database and sent it back as a response to the API call. This code was now ready to scale.

  • We saw more than 90% improvement in our API response times.

At this point in time, we have more than 20,000 rows in each of our tables and 18 hierarchy levels in our dedupe business logic. Of these 18 levels;

  • The first 5 require data only from 1 table to do exact matches
  • The next 3 require data from 2 tables to do an exact match and
  • Finally, the next 10 require data from 2 tables to do fuzzy matches

Result: The business logic now responds in less than 500ms.

The Learnings

We learned plenty in the short journey of implementing this feature;

  • Do quick POC’s and fail quickly (yes, it’s the agile way) — Capture data points along the way!
  • Consider more than 1 option while evaluating a solution (especially when developing a brand new feature).
  • We became better at separating tasks at a service level and using a Relational DB to do more than just store data.
  • Greater appreciation of PostgreSQL capabilities.
  • While most problem statements seem similar from 30,000 feet above, solution and implementation is unique to your business need — remember, copy/paste solutions from tech forums will only give you direction

--

--