A Journey into BigQuery Fuzzy Matching — 2 of [1, ∞) — More Soundex and Levenshtein Distance

Brian Suk
Google Cloud - Community
8 min readAug 13, 2019

In the first post on this topic, we went over how to build a series of user defined functions (UDF) in order to implement the Soundex algorithm. To recap, the idea here is to build a toolbox that can be used to solve challenges around fuzzy matching and master data management. Now Soundex is the first one that we have!

Ever since I wrote that post, I have been inundated with an email asking about the performance aspect of it. One of the good things about Soundex is that it’s fairly quick. When you add the massively parallel processing capabilities of BigQuery, you end up with Soundex not being very heavy.

Must Soundex Faster

Previously we were using a handful of names to do some spot checking around the algorithm itself. Let’s add a few more names to that list and see what happens. Off we go to fetch some data!

A search for some name data led me to the US Census Bureau’s Genealogy Data website. A list of frequently occurring last names has been compiled from the last two censuses (2010 and 2000). The 1990 census gave us both first and last names for men and women. I’ve taken all that data published for public use, and threw it into a BigQuery table. All in all, we were able to collect almost 400,000 last names to run our queries on.

The breakdown of names by year, and type of name.

Now, some of these will be repeated, so the real world applicability of the fuzzy matching algorithms themselves is a bit limited. But here we are trying to see how fast it goes. 400,000 is a little bit bigger than 20. Putting on my on-premises data integration/data quality software hat on for a second, I would normally expect this to take an hour, maybe? Tens of minutes? It’s BigQuery, though, so surely a few minutes at least?

It’s a little faster than “a few minutes.”

Five seconds is pretty quick fast when processing almost half a million names. Doubling that count towards a million names didn’t seem to have all that much impact to the overall processing time.

Double the data doesn’t necessarily mean double the processing time.

The point here is that the processing time increase resulting from doubling the size of an already pretty large sized list of names is so nominal that it’s possible that it’s simply due to compute warm-up rather than the increase in data processed. The idea of leveraging database horsepower to perform operations like this is nothing new, but being able to get these results this quickly with zero infrastructure and no tooling is pretty awesome, if I do say so myself.

Also, if you’d like to run these yourself, I’ve made the Census data that I’ve merged into a single table available as a CSV export.

Now that we have some basic phonetic classification going on, what’s next? We should be good, right? Well, not quite. Let’s talk about building another tool for our toolbox.

Introducing Vladimir Levenshtein

Vladimir Levenshtein was a Russian scientist who specialized in, among many thing, information theory. This is a form of distance metric called edit distance, which at a high level is about comparing the number of changes required to turn one string into another. The Levenshtein Distance was created in 1965 and is a measure of how many insertions, substitutions, and deletions there are from one string to another. So here’s a few examples of this in effect:

  • “Whisky” — “Whiskey”: Levenshtein Distance, 1. An ‘e’ is added.
  • “Drinjs” — “Drinks”: Levenshtein Distance: 1. The ‘k’ is substituted to a ‘j’.
  • “Chandler Bing” — “Miss Chanandler Bong”: Levenshtein Distance, 8. Add “Miss ” (5), add ‘an’ (2), substitute ‘i’ for ‘o’ (1).

Keep in mind that this is the base algorithm that gives equal weight to inserts, substitutions, and deletions. There are variants on this that give different weights to different operations, which you may want to do depending on your use case (for example, in OCR use cases you may want to give a lower weight to ‘l’/‘1’ and ‘O’/‘0’ substitutions). If you are curious on how the algorithm works, there is an excellent Medium post that walks you through the matrix operations to get the right distance. It’s definitely worth a read! When running two strings through the algorithm, the output of the calculation is a number which gives the net number of edits.

So how do we get this working? Another UDF! In the Soundex function we were able to string together a series of SQL statements, but now we’re looking for loops so we will have to use some JavaScript. If you’re not into reading the thorough explanation linked before, the folks at the University of Pittsburgh boiled it down to these steps:

1 — Set n to be the length of s. Set m to be the length of t.

If n = 0, return m and exit.
If m = 0, return n and exit.
Construct a matrix containing 0..m rows and 0..n columns.

2 — Initialize the first row to 0..n.
Initialize the first column to 0..m.

3 — Examine each character of s (i from 1 to n).

4 — Examine each character of t (j from 1 to m).

5 — If s[i] equals t[j], the cost is 0. If s[i] doesn’t equal t[j], the cost is 1.

6 — Set cell d[i,j] of the matrix equal to the minimum of:

a. The cell immediately above plus 1: d[i-1,j] + 1.
b. The cell immediately to the left plus 1: d[i,j-1] + 1.
c. The cell diagonally above and to the left plus the cost: d[i-1,j-1] + cost.

7 — After the iteration steps (3, 4, 5, 6) are complete, the distance is found in cell d[n,m].

The previous link that explains how the algorithm works also has step by step matrix illustrations to show how it iterates through data, and there’s also code available online that has this algorithm implemented in JavaScript. As it’s distributed under the MIT OSI license, let’s put it into a BigQuery UDF!

CREATE OR REPLACE FUNCTION
dq.dq_fm_LevenshteinDistance(in_a string,
in_b string)
RETURNS INT64
LANGUAGE js AS
"""/*
* Data Quality Function - Fuzzy Matching
* dq_fm_LevenshteinDistance
* Based off of https://gist.github.com/andrei-m/982927
* input: Two strings to compare the edit distance of.
* returns: Integer of the edit distance.
*/
var a = in_a.toLowerCase();
var b = in_b.toLowerCase();

if(a.length == 0) return b.length;
if(b.length == 0) return a.length;
var matrix = [];
// increment along the first column of each row
var i;
for(i = 0; i <= b.length; i++){
matrix[i] = [i];
}
// increment each column in the first row
var j;
for(j = 0; j <= a.length; j++){
matrix[0][j] = j;
}
// Fill in the rest of the matrix
for(i = 1; i <= b.length; i++){
for(j = 1; j <= a.length; j++){
if(b.charAt(i-1) == a.charAt(j-1)){
matrix[i][j] = matrix[i-1][j-1];
} else {
matrix[i][j] =
Math.min(matrix[i-1][j-1] + 1, // substitution
Math.min(matrix[i][j-1] + 1, // insertion
matrix[i-1][j] + 1)); // deletion
}
}
}
return matrix[b.length][a.length];
"""

Let’s give this a whirl!

WITH
data AS (
SELECT
'Whiskey' a,
'whisky' b
UNION ALL
SELECT
'drinjs' a,
'Drinks' b
UNION ALL
SELECT
'Chandler Bing' a,
'Miss Chanandler Bong' b)
SELECT
a,
b,
`dq.dq_fm_LevenshteinDistance`(a, b) as ldistance
FROM
data
It looks like it works!

So let’s do a few things to see how these two work together.

Better Together!

With Soundex, we are essentially appending metadata to an existing data element, so it’s pretty straightforward. With edit distance, we are doing one by one comparisons. Comparing all 380,000 names from our original dataset with the entire 379,999(ish) names in our new dataset is going to be significantly more comparisons than we want, or need to do. This is where combining Soundex with edit distances make things go a bit easier. Taking each element and comparing the edit distance against all other elements within the same Soundex code significantly reduces the number of comparisons that need to be made. It might look a little something like this:

WITH
# Create base table of last names with Soundex codes
name_data AS (
SELECT
lname,
`dq.dq_fm_Soundex`(lname) soundex_code
FROM
`dq.name_data`
WHERE
lname IS NOT NULL),
# Create table that's joined with itself on the Soundex code
joined_data AS (
SELECT
p.lname AS p_lname,
c.lname c_lname,
p.soundex_code
FROM
name_data p,
name_data c
WHERE
p.soundex_code = c.soundex_code
AND p.lname != c.lname)
# Apply Levenshtein Distance to each pair
SELECT
*,
`dq.dq_fm_LevenshteinDistance`(p_lname, c_lname) AS ldistance
FROM
joined_data
ORDER BY
soundex_code,
p_lname ASC

It looks like it worked! There’s a few things from the query execution details that I would like to point out.

Our query execution details!

The first thing is the number of rows that are produced out of the Join+ stage. This is telling us that we had to run calculations over 111 million rows from the dataset containing all of the comparisons to be made. Having said that, the next thing I wanted to point out is the elapsed time. 8 minutes and 43 seconds. Pretty quick, if you ask me.

Now let’s check out the Job Information tab.

Consumption based pricing at its best.

Here’s the on-demand pricing metrics for BigQuery, just for reference.

Going back to the data, let’s save the results of the previous query into another table so it’s easier for us to use! We will take a look at an example Soundex code, and see what that looks like. Select distinct to eliminate duplicates from overlaps in “common names” across census years.

That’s a pretty high edit distance.

When looking at the resulting data here, we see that it’s getting results for all possible ldistance values. As shown here, it casts too wide a net and isn’t very useful. For example, it’s pretty clear that “Garcy” and “Grohowski” aren’t the same. Let’s tune this a little bit by looking at the elements with a Levenshtein Distance of less than four.

Ahh, these are much closer.

You’ll immediately notice that by reducing the range of the edit distance you start to get things that are closer together. By narrowing the range of comparisons to groups that phonetically match, we improve performance as well. The example with “Georgiou” shows results that are more in the realm of possibly being the same.

Again, this isn’t a complete method of fuzzy matching, but it shows how quickly we can start to assemble the parts needed to come up with a solid matching strategy. These are two possible elements that you could use as a part of a multi-criteria matching strategy.

Once again, off to the next one…

--

--

Brian Suk
Google Cloud - Community

Avid 2020 bed-to-couch traveler, cloud tech, big data, random trivia, Xoogler. My employer isn’t responsible for what’s here. NYC. linkedin.com/in/briansuk