String Data Normalization and Similarity Matching Algorithms

Szulicki
12 min readMar 31, 2019

--

How I overcame the problem with checking the data quality between string values with a common ID and eventually matched remaining items with a matching score

Photo by Rick Mason on Unsplash

The string values reconciliation is too case-specific, however, I believe in the best practices that work across programming languages, ETL tools, and our beautiful minds. Below you will find rather a generalized concept, some parts of which can be used in pretty much any data quality check. What you see below is the compilation of different methods that I found in the net, tested against different scenarios, and eventually accepted as useful.

After I got the business task to reconcile string values, I had a series of unfortunate and even desperate attempts to apply all known basic string functions to hit the nail on the head (I work in one of the ETL tools). Following multiple kinds of lefts and rights, I decided to calm down and have a look at a string data type from a different perspective. It is just… a combination of symbols right? In my case, company names. As any text-based data, these tend to have a bunch of unnecessary elements distracting the data quality check: corporate extensions, whitespace characters, digits, non-word characters, diacritical signs, and common words that do not add much sense. The second problem is the similarity as such.

To normalize the data, I mostly used regular expressions: they are quick and pretty much user-defined. If you are not yet familiar with the concept, explore it!

Below you will find the list of actions that will allow us to normalize the data, better to keep the exact order. The example is two company names, representing the same entity, however, the writing is different since we reconcile the name written in English versus the same equivalent in Spanish.

First step: Adjust case

Photo by Markus Spiske on Unsplash

Depending on the languages/ETL tools, some functions can be case-sensitive, hence, ‘A != a’. To get rid of this problem from the beginning, make all the words uppercase. For the simplicity of further description, I marked the differences in orange:

Going further, we will rely on regex. However, sometimes ETL solution may not inherit regular expression modules fully, hence, the ‘\i’ modifier can be ignored by the tool, leaving regex case sensitive.

Second step: Diacritics

Photo by Maria Freyenbacher on Unsplash

Guys like umlauts create a lot of mess: some people may simply not know how to operate with those, hence, in the end, these are being replaced with some basic Latin letter equivalents, i.e. Ä Ö Ü becomes A O U. Even if you do not see these hiders at first glance, believe me, if you have thousands of company names, they are there. Fairly, the matching algorithm treats those as different words. What we are going to do is to use regular expressions to equalize all possible umlauts with their Latin equivalents:

+----------------------------------------+------------+
| Regex | Replace by |
+----------------------------------------+------------+
| [ÁĂẮẶẰẲẴǍÂẤẬẦẨẪÄǞȦǠẠȀÀẢȂĀĄÅǺḀÃǼǢ] | A |
| [ḂḄḆ] | B |
| [ĆČÇḈĈĊ] | C |
| [ĎḐḒḊḌḎ] | D |
| [ÉĔĚȨḜÊẾỆỀỂỄḘËĖẸȄÈẺȆĒḖḔĘẼḚÉ] | E |
| [Ḟ] | F |
| [ǴĞǦĢĜĠḠ] | G |
| [ḪȞḨĤḦḢḤẖ] | H |
| [ÍĬǏÎÏḮİỊȈÌỈȊĪĮĨḬ] | I |
| [ǰĴ] | J |
| [ḰǨĶḲḴ] | K |
| [ĹĽĻḼḶḸḺ] | L |
| [ḾṀṂ] | M |
| [ŃŇŅṊṄṆǸṈÑ] | N |
| [ÓŎǑÔỐỘỒỔỖÖȪȮȰỌŐȌÒỎƠỚỢỜỞỠȎŌṒṐǪǬÕṌṎȬǾØ] | O |
| [ṔṖ] | P |
| [ŔŘŖṘṚṜȐȒṞ] | R |
| [ŚṤŠṦŞŜȘṠẛṢṨ] | S |
| [ŤŢṰȚẗṪṬṮ] | T |
| [ÚŬǓÛṶÜǗǙǛǕṲỤŰȔÙỦƯỨỰỪỬỮȖŪṺŲŮŨṸṴ] | U |
| [ṾṼ] | V |
| [ẂŴẄẆẈẀẘ] | W |
| [ẌẊ] | X |
| [ÝŶŸẎỴỲỶȲẙỸ] | Y |
| [ŹŽẐŻẒẔ] | Z |
+----------------------------------------+------------+

I’ve equalized A = Æ and O = Ø for the purposes of simplicity, don’t get angry.

You may want to use the codes of diacritical symbols instead of actual symbols (i.e. \u00c4 instead of Ä. However, keep in mind kilometers of the code that you will need to apply (and different codes for small/capital letters, just in case)

You may say that in the previous step we just capitalized all the letters so why we added some small letters? Is because some of those snitches do not have their capital equivalents, so upper will not always work. Feel free to check it out by yourself and remove unnecessary small letters from a regular expression if possible.

Keep in mind: this whole problem with replacing symbols sometimes can be fixed by applying different encoding (various options are available depending on the operational system, programming language, ETL tool, etc), however, this option does not seem reliable to me as you should be sure that all combining diacritical signs have their equivalent amongst basic Latin characters.

A deeper hell: Corporate extensions

Photo by NeONBRAND on Unsplash

The next thing you may want to do (not really want but should) is to normalize corporate extensions. What I am saying is that “SOCIETE ANONYME” has to become “SA”, “LIMITED LIABILITY COMPANY” -> “LLC”, etc. This will prevent us from not matching obviously equal results. It can be done by a regular replace as well as the previous step. Below you can see the list of all possible corporate extensions, feel free to incorporate that into your code! It is long but useful. I was preparing the below-mentioned list for the usage in ETL, however, if you are operating in a programming language, you may want to use dictionaries instead.

Since I was doing this in regex, I left “\b” in the beginning and at the end of the string which means word boundaries + “(\s{1,}|)” between the words in order to prevent the cases when there is either something or just space (one or more) between corporate extension words. Depending on your views, you can delete “\b” (if you are afraid of merged words in your input files\flows), deleting “(\s|)” is not recommended

Step four: Frequently used words

Photo by Silas Baisch on Unsplash

These will add nothing but a distraction to your matching algorithm. However, I’ve found useful only deleting two of the most frequent words: “THE” and “AND”. From my perspective, by deleting other common words you can harm the company name.

Last but not least: symbols (and digits)

Photo by Adam Birkett on Unsplash

The source of chaos and confusion for matching. You do not need those to assess if strings are equal, the same thing with spaces. Deleting both is nice and easy using regular expressions, as you need to find everything that matches regex “\W|\d” and replace it by nothing.

Our data normalization is generally finished, the second challenge is to handle everything that did not match at this point (checking similarity).

You may want to add some intermediary step to remove double+ space, for whatever reason. Again, regular expression will help: replace regex “\s{2,}” by “ “ (one space)

Example: https://regex101.com/r/RNhEqe/1

Keep in mind that you should not worry about matching pairs with “AND” versus ampersand “&” because “AND” will be deleted as unnecessary word and “&” is going to be eliminated by the last step as the non-word character.

Some of the items will anyway match after the 1st, 2nd, 3rd, etc steps, hence, if you think it will speed up your system, to match something after each step and then continue further only for not matching items — feel free!

Obviously, the field of your interest may be not matching company names but something else, totally specific, so if you ever wondered how to strip suffixes, check out Porter Stemming Algorithm, very interesting.

Link to original Porter’s article:

https://tartarus.org/martin/PorterStemmer/def.txt

Link to enhancement:

http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.136.7185&rep=rep1&type=pdf

Following data normalization, I bet there are still not-matching leftovers

Photo by Rainy Wong on Unsplash

Company names with mistakes, totally not-matching names, similar names, names in a local language, etc. As people say, it takes two to tango: we need the second step — data similarity measurement.

When I started to struggle with those, I was sure that I am alone. Fortunately, after just a few minutes of googling I found out that people have been fighting this problem for about a century, hence, there are some nice practices developed. There is no cure for everything but it is your call what to apply for your specific case. Before jumping into what I was applying, it is worth to say that there are multiple libraries in different languages that allow performing fuzzy match (i.e. FuzzyWuzzy in Python), so you can apply them, skipping below. However, if you want to get a bit of flavor of those then continue reading!

Soundex

This nice algorithm exists across multiple languages, libraries, and tools. It transforms the string value into the series of characters depending on how it sounds in English. In case you do not have this function in your arsenal and cannot get the library (I see no reason but still), then you can replicate the enhanced Soundex by yourself (it is a bit different comparing with the classic one, I like it more)

Leave the first letter aside:>> M-Margaret : M-MargeretteDelete h and w if not the first letter

>> M-Margaret : M-Margerette
Replace b, p by 1>> M-Margaret : M-MargeretteReplace f, v by 2>> M-Margaret : M-MargeretteReplace c, k, s by 3>> M-Margaret : M-MargeretteReplace g, j by 4>> M-Mar4aret : M-Mar4eretteReplace q, x, z by 5>> M-Mar4aret : M-Mar4eretteReplace d, t by 6>> M-Mar4are6 : M-Mar4ere66eReplace l by 7>> M-Mar4are6 : M-Mar4ere66eReplace m, n by 8>> M-8ar4are6 : M-8ar4ere66eReplace r by 9>> M-8a94a9e6 : M-8a94e9e66eIf you see same consequent numbers, remove duplicates, i.e. a22 becomes a2>> M-8a94a9e6 : M-8a94e9e6eDelete a, e, i, o, u, y if not the first letter>> M-89496 : M-89496Replace the first symbol by the letter from the first step (initially the first letter)>> M9496 : M9496

Despite this version is much more accurate comparing with original Soundex, sometimes it is still too fuzzy, so use it carefully.

Recommendation:

Probably, enhanced Soundex fits to the data quality check with common id, however, can be weak in case of “blind matching”, without common id

Original Soundex is rather for short string values as it’s being cut to 4 symbols, leaving a lot more space for unexpected and not needed matches

Note: there are multiple phonetic algorithms, I’m going to discuss them in a separate article, however, if you try to match names/surnames, you may better want to check NYSIIS algorithm instead of Soundex.

Levenshtein Distance

This method shows the distance between two words in terms of their similarity by measuring the number of letters to be changed to equalize the words, i.e. “Art” and “Arc” have L distance of 1 because it takes 1 letter to equalize these words (only change t to c or vice versa)

I do not like this method much because of the above example. It doesn’t really work on the short strings and in my case (company name reconciliation), it is quite crucial. However, it helps when it comes to mistakes, of course (swapped/skipped letters)

Jaro/Jaro-Winkler

Absolute champion for me. Let’s start with Jaro. It is based on the idea that a string similarity can be measured based on the number of symbols in strings, their common letters, and the number of not matching pairs of letters divided by 2. Below is the classic example, Martha against Marhta:

  1. Number of symbols in strings is equal to 6 for both (s1, s2)
  2. Number of matching characters is 6 (count ones below), m-m, a-a, r-r, t-t, h-h, a-a (m)
  3. However, the number of not matching pairs is 2: 4th pair t-h, 5th pair h-t, bold
+---+---+---+---+---+---+---+
| | M | A | R | H | T | A |
+---+---+---+---+---+---+---+
| M | 1 | 0 | 0 | 0 | 0 | 0 |
| A | 0 | 1 | 0 | 0 | 0 | 0 |
| R | 0 | 0 | 1 | 0 | 0 | 0 |
| T | 0 | 0 | 0 | 0 | 1 | 0 |
| H | 0 | 0 | 0 | 1 | 0 | 0 |
| A | 0 | 0 | 0 | 0 | 0 | 1 |
+---+---+---+---+---+---+---+

The formula for the Jaro Distance is the following:

Jaro Distance
dj = (6/6 + 6/6 + (6–1)/6)/3, resulting in 0.94 or 94%.

Warning: in case you compare the strings that have significant differences in length, you may fairly get less common pairs then non-matching pairs (m<t), equal amount of those (m=t) or even potentially no common pairs (m=0). This may result in negative dj for negative difference and no result for no common pairs (do not try to divide by zero). Since these are some extreme outliers, less likely to be similar, I just converted them to 0 at the end.

Further, to make more likely matching items more significant, Winkler introduced his adjustment to the existing Jaro method. According to William Winkler, typos appear rather after the 4th symbol, leaving a common prefix. Based on that, we need to calculate the number of symbols in common prefix (L), i.e Martha — Martha with common prefix “Mar” that consists of 3 characters, resulting in L = 3. To calculate Jaro-Winkler distance, we also need to set an adjustment factor p, which should be more than 0 and less than 0.25 (because we will multiply it later by our L which should not be greater than 4). By the standard, it equals to 0.1

Jaro-Winkler Distance
djw = 0.94 + (3*0.1(1-0.94)) = 0.96 or 96%

So, that’s it, the matching score is ready. Results are quite impressive, btw!

Note: Jaro-Winkler is a good thing when it comes to fuzzy matching, however, if you need to perform literally data quality check, better stay with Jaro.

Hint: equal strings with swapped words

Suppose you are matching addresses or any other stuff that may potentially have swapped words/elements, i.e.

Via del Corso : Corso Via del

Some ETL tools (like mine) do not allow to split the string and match it word by word, or at least it is going to be too heavy. While playing with the Jaro Distance method, I realized that in case I compare the strings under the same ID, there is an unbelievably low probability that I face 2 strings that have the same number of symbols, the same number of particular letters and at the same time, totally different.

Via del Corso : Corso Via delLength:length1 = 13                 length2 = 13Number of letters for both:v = 1 i = 1 a = 1 d = 1 e = 1 l = 1 c = 1 o = 2 r = 1 s = 1

So, I used to match such ones!

Other methods did not attract much of my attention, however, useful sources below:

Great summary with a use case by Coralie Collignon, Appaloosa: Trigram, Levenhstein, Cosine, Jaro-Winkler

FuzzyWuzzy in Python, the article by Susan Li:

I’m going to post another summary, fully devoted to some other phonetic algorithms, namely, NYSIS, Soundex, Daitch-Mokotoff, Metaphone/Double Metaphone, Caverphone, and others. Stay tuned!

--

--

Szulicki

Working in the Process Automation field, passionate about technology and process improvements of all kinds