A Journey into BigQuery Fuzzy Matching — 1 of [1, ∞) — Soundex
“That’s Not How My Name is Spelled”
Getting a name right is important when making decisions around people. This holds true when dealing with all aspects of how a company engages with a person. Use cases in human resources, targeted marketing, and customer engagement are examples where getting the name right matters.
Just think back to the last time a barista wrote your name on a cup.
When it gets into the context of enterprise systems, the need to match similar names is even more important. When a single transaction happens in isolation, there might be a bit of tolerance for error. When it says “Bryan” instead of “Brian” on my cup, I can let it go (usually). If I were to appear in a CRM system in multiple permutations, this becomes a bigger issue. A good example is with bank acquisitions.
I was once a customer of a regional retail bank with a checking and savings account. It got bought by a larger regional bank. Which then got bought by a major bank holding company that also made other large acquisitions in parallel, some of which I was also a customer. The result is that I would be directly marketed for savings accounts, checking accounts, and credit cards for which I am already a customer through another acquisition, and with variations and typos in my name. The lack of a unified identifier for a single entity is a problem known as Master Data Management (MDM), and has been around for a while.
Vendors have been trying to develop applications to solve this for quite some time. MDM and identity resolution systems are nothing new. One of the more common architectures leverages middleware to perform fuzzy matching and consolidation in bulk, and writes the consolidated data to a target repository. While the algorithms have been time tested, one of the big issues is always around performance. You tend to be limited by the infrastructure that your middleware application is running on.
Having been in the enterprise information management space for a while, and also having worked at vendors that sold data integration (DI)/quality (DQ) and MDM tools, this is a topic that I’m always interested in.
I have also come to appreciate the ability to push down transformations into the database layer. With Google BigQuery being a fully managed petabyte-scale data platform, the question hit me whether we can start to perform these tasks while being able to leverage all the cool things BigQuery brings for scale?
So, why not give it a shot.
What I don’t intend to do is architect out an entire MDM system (for now, at least). As part of an overall DQ and/or MDM strategy, you’ll want to have a tool kit of cleansing and matching strategies to put together a solution for your particular problem, which is what I am hoping to help put together. I’m always interested in hearing if something works or not, and any feedback on how to make it better. Also, I don’t know how many of these I’ll be able to get to, as there’s a ton of algorithms out there solving for different things. Life is full of small little adventures, let’s see where it goes.
As a general rule, I’m always going to try and stick as close as I can to BigQuery Standard SQL. I’ve always believed that sticking with native dialects and data structures is the fastest since it doesn’t require additional engines and allows for BigQuery to optimally do its thing, and also the less languages one needs to know, the easier it is to adopt.
So let’s try one out.
Soundex
This algorithm was first patented in 1918, and is one of the simpler ones that is commonly used. The goal is to process names so that ones that have the same phonetic pronunciation, have the same code. This is so that one can identify and group entries that might have different spellings, which can overcome typos and misspellings.
Here’s what the algorithm looks like, from Wikipedia:
1 — Retain the first letter of the name and drop all other occurrences of a, e, i, o, u, y, h, w.
2 — Replace consonants with digits as follows (after the first letter):
b, f, p, v → 1
c, g, j, k, q, s, x, z → 2
d, t → 3
l → 4
m, n → 5
r → 63 — If two or more letters with the same number are adjacent in the original name (before step 1), only retain the first letter.
4 — If you have too few letters in your word that you can’t assign three numbers, append with zeros until there are three numbers. If you have more than 3 letters, just retain the first 3 numbers.
For the character substitutions, I found that just wrapping as series of REPLACE()
(for single character substitutions) and REGEX_REPLACE()
(for multi-character substitutions) functions would do the trick. This could be abstracted out to a User Defined Function (UDF) that can handle both cases, but there’s really only six cases and doesn’t clean up the query all that much.
The only other issue I had was around removing duplicate adjacent letters. There were solutions around removing duplicate characters anywhere in the string, but the requirement of them being consecutive was where I got stuck. That’s when Elliott Brossard, one of the smart folks in BigQuery engineering, came up with a solution that uses offsets. We’re also going to be using Persistent UDFs, a newly released capability, to make our lives a bit easier.
CREATE OR REPLACE FUNCTION
dq.dq_hf_gh_RemoveDuplicateChars(s STRING) AS (
/*
* (Helper) Data Quality Function
* dq_hf_gh_RemoveDuplicateCharacters
* input: Any string to clean up.
* returns: String with duplicate characters removed. This
* refers to consecutively repeated characters.
*/
(SELECT
STRING_AGG(
IF
(c = SPLIT(s, '')[SAFE_OFFSET(off - 1)],
NULL,
c), '' ORDER BY off)
FROM
UNNEST(SPLIT(s, '')) AS c
WITH OFFSET off) );
Let’s throw a few strings at it for a quick test.
WITH
input_data AS (
SELECT ["hello", "parallelogram", "sapphire", "shhhhh"] AS words)
SELECT
words, dq.dq_hf_gh_RemoveDuplicateChars(words)
FROM
input_data, UNNEST(words) AS words
There’s a couple helper functions that we also need to create. With this implementation, we need to remove the first number if the translated first letter is the same as the leading number. This is because we can’t just unilaterally remove the first number. For example, after de-duplication and translation, the full name of “Gonzalez” is 25242. G translates to 2, so we need to remove it. However, names like “Ward” and “Hasnain” translate to 63 and 255, respectively, and W and H both do not match to numbers and we have to keep all the digits.
Let’s create a couple helper functions. The first is one that will take a letter and return the corresponding Soundex number mapping to it. It will return a 7 if it does not find a match.
CREATE OR REPLACE FUNCTION
dq.dq_hf_soundex_GetSoundexNumber(inchar string) AS (
/*
* (Helper) Data Quality Function
* dq_hf_soundex_GetSoundexNumber
* input: Single letter.
* returns: Equivalent Soundex code.
*/
CASE
WHEN STRPOS('bfpv', LOWER(inchar)) > 0 THEN '1'
WHEN STRPOS('cgjkqsxz', LOWER(inchar)) > 0 THEN '2'
WHEN STRPOS('dt', LOWER(inchar)) > 0 THEN '3'
WHEN STRPOS('l', LOWER(inchar)) > 0 THEN '4'
WHEN STRPOS('mn', LOWER(inchar)) > 0 THEN '5'
WHEN STRPOS('r', LOWER(inchar)) > 0 THEN '6'
ELSE '7'
END
);
Then we create another one that will simply return the string without its leading character if it finds a match.
CREATE OR REPLACE FUNCTION
dq.dq_hf_gh_CheckRemoveFirstChar (full_numbers string,
first_letter string) AS(
/*
* (Helper) Data Quality Function
* dq_hf_gh_CheckRemoveFirstChar
* input: String
* returns: Input string without the first character if the
* stored first character is the same.
*/
IF
(STARTS_WITH(full_numbers, first_letter),
SUBSTR(full_numbers, 2, LENGTH(full_numbers)),
full_numbers) );
So with the dq_hf_RemoveDuplicateChars()
, dq_hf_GetSoundexNumber()
, and dq_hf_CheckRemoveFirstChar()
UDFs working, it’s time to layer on the rest of the character replacements on top of it.
CREATE OR REPLACE FUNCTION dq.dq_fm_Soundex(instring string) AS (
/*
* Data Quality Function - Fuzzy Matching
* dq_fm_Soundex
* input: String to encode.
* returns: Soundex code of the string.
*/
CONCAT(SUBSTR(UPPER(instring), 0, 1),
rpad(SUBSTR(`dq.dq_hf_gh_CheckRemoveFirstChar`(
REPLACE(`dq.dq_hf_gh_RemoveDuplicateChars`(
REPLACE(
REGEXP_REPLACE(
REPLACE(
REGEXP_REPLACE(
REGEXP_REPLACE(
REGEXP_REPLACE(
REGEXP_REPLACE(LOWER(instring),'[aeiouyhw]', '0'),
'[bfpv]', '1'),
'[cgjkqsxz]', '2'),
'[dt]', '3'),
'l', '4'),
'[mn]', '5'), 'r', '6')),
'0', ''), `dq.dq_hf_soundex_GetSoundexNumber`(substr(instring, 0, 1))),
0, 3), 3,'0')))
After creating that, let’s throw some names at it and give it a whirl!
WITH
name_table AS (
SELECT
["Suk","Suck","Souk",
"Price","Prise","Pryce",
"Welch","Welsh","Walch",
"Stamos","Stamoos",
"Ward","Werd","Whard",
"Gonzalez","Gonzolezz",
"McGovern","McGooovern","MacGavern"] AS lname)
SELECT
lname,
dq.dq_fm_Soundex(lname) as soundex_code
FROM
name_table,
UNNEST(lname) AS lname
Now while Soundex is fast and easy to use and implement, it does sometimes suffer in precision. The base encoding we are using here is typcially referred to as “American Soundex,” because in the 1930’s the original Russell Soundex was modified to what we have shown here, and used by the US Census Bureau to index names in the English alphabet. Names in different languages such as Korean, Japanese, or Chinese first need to be transliterated which comes with its own challenges.
Even in English, we still do sometimes encounter false negatives and false positives due to the loss in precision. Take the names “Wozniak”, “Wiggins”, “Wegonge”, and “Weiscmowsky.” At least from a phonetic sense, you wouldn’t expect these to be categorized the same. However…
WITH
name_table AS (
SELECT
["wozniak","wiggins","wegonge","weiscmowsky"] AS lname)
SELECT
lname,
dq.dq_fm_Soundex(lname) AS soundex_code
FROM
name_table,
UNNEST(lname) AS lname
So while Soundex is still helpful and it’s fast, it’s definitely not infallible. It’s one dimension of a property that can be used to quickly create groupings for an entity, which is a big piece of the fuzzy matching puzzle. Typically in a data quality flow a number of different algorithms will be used in conjunction with each other with different weightings depending on the use case in order to truly determine if duplicates can be resolved, and hopefully this is helpful as the first piece of the puzzle in BigQuery.