Pull a fast one with graph and identity resolution.

Louis Pienaar
3 min readOct 4, 2022

--

courtesy craiyon.com — Flash running fast across the universe

I finally got tasked with finding duplicate client records. In my line of work, I often have to deal with duplicate records, but now I get to do something about it.

Identity resolution is a big topic, and I will deal with only a small portion: Edit Distance.

The edit distance between Johnathan and Jonathan is 1 (we only have to alter one character). Hence, in this case, these two names are close to each other, and it can be used, together with other evidence, to eventually conclude that these two records are, in fact, from the same client.

So with a set of names, you calculate the edit distance for all unique pairs and then you are done.

Problem solved, right?

That is what I thought, but it turns out the problem becomes too big, very quickly. For a set of names, the number of possible unique pairs is n*(n-1)/2 with n = the number of names. So for a set of 100 000 names, we have to calculate the edit distance for 4 999 950 000 pairs (5 billion pairs). Even as I’m writing this, I’m thinking I must be missing something here. The calculation time on this becomes infeasible. Even Flash would take forever to run across the entire universe.

Using graph to solve this calamity.

Good graph database engines are great at figuring out adjacency. In other words, if you redefine your problem in terms of adjacency, you will likely solve complex problems quickly and efficiently using a graph engine. In this case, we want only to generate pairs of names to compare when those names are relatively close to each other. Why even calculate edit distance for Elizabeth and Kate when they are that far apart?

The secret sauce is in the graph schema.

To illustrate the solution, I will simplify the problem to 3 names of length five or less.

Kate, Diana and Nate.

At first glance, we can see that we should calculate the edit distance for {Kate; Nate}, but not for the other two possible pairs {Diana, Kate} and {Diana, Nate}. This is clear to you as you can see Kate and Nate are close, but the other two pairs are not.

To solve this, we create new nodes that represent a combination of the letter of the alphabet and the possible positions (1 to 5 in this case, as we limited ourselves to names of max length 5). Hence we then get nodes a1, a2,a3,a4,a5,b1,b2,…,z3,z4,z5. Each name is connected to these alphabet nodes if they match the letter and the position. For example, Diana would be presented like this in our graph:

Example Name Alphabet Schema in TigerGraph

Nate and Kate would look like this in our graph:

Nate and Kate's connections in TigerGraph

We can see that Nate and Kate are close to each other as they have three (out of four possible) connections. They each have three corresponding letters and positions. In production, this gets more complicated as you should allow for position offsets etc., but this too can be done using edges/connections.

Using this adjacency and other attributes like word length, we can filter the number of pairs before running our expensive algorithms on them, like a Levenshtein Edit distance calculation. In this case, we filter by returning all pairs where there is at least 60% of the possible total connections between them.

In my real-world example of 10 000 names, I reduced the total amount of pairs to be evaluated from 49 995 500 (just under 50 million) to 689 650. That is a whopping 98.6% reduction in the number of calculations.

Contact me if you would like to see this in action on Tigergraph.

--

--