Phonetics based Fuzzy string matching algorithms

A dive in Soundex & Metaphone

Mehul Gupta
Data Science in your pocket
7 min readSep 8, 2020

--

Carrying on where I left off in my last post after exploring some spelling-based fuzzy match algorithms, it's time to shift our focus to pronunciation-based fuzzy matching.

Before moving on, we must know a library in python i.e. fuzzywuzzy which internally uses Levenstein Distance to calculate the similarity between 2 strings on a scale of 0–100, the higher the value, the more similar the strings with 100 as exact match.

My debut book “LangChain in your Pocket” is out now !!

But in the 1st place,

  • What is the requirement of phonetics-based fuzzy matching?
  • Do spelling-based fuzz match algorithms have some drawbacks?

Consider the below cases

CASE 1

  • Should ‘Mehul’ & ‘Mahul’ map to the same individual given the same ID (let every family in the world be given an ID)?

Yes, as spelling mistakes are common in user data & it appears to be one. Observe what fuzzy-wuzzy returns

Not bad!!

But can we make it 100 i.e an exact match? will see in a short span

CASE 2

  • Should ‘Mehul Gupta’ & ‘Mehul’ be considered the same if they are mapped to the same ID (let every family be given an ID as above)?

The probability is definitely 90% if not 100% that they represent the same individual as different surnames in the same family are rare. Let's have a look at what fuzzy-wuzzy returns for this case

Just 62 !!!

This has to be around 90 for sure.

Hence, incomplete names can give us some headaches.

CASE 3

Also, what about these strings ‘Mehulllllllllllllllll’, ‘Mehul’? They are surely the same person’s name.

Oops !! just 53

So, you see, there exist some cases where spelling matches can be disastrous !!

Can these strings be first transformed into a standard pattern before going for Levenstein distance? this is what phonetics-based matching is all about. Before moving on, it must be kept in mind that the below algorithms are completely rule-based & discussing those rules won’t make much sense (as there are numerous rules for any algorithm being used) but will be discussing the basis for the rules. So let’s get started !!

Soundex

Soundex is amongst the early algorithms designed for phonetics-based matching which is still used in US Census. Basically what it does is it generates a 4-character code (like G123) for any string:

  • 1st character is the 1st character of the string. Like ‘A’ in apple
  • The remaining 3 characters are digits dependent on the next 3 sounds of the string. How? using rules
  • Let’s test it for our above cases

Case 1: Minor spelling errors (‘Mehul’ & ‘Mahul’)

Resolved easily !!

Case 2: Gibberish text by the end

Yeah!!

Now, fuzz.ratio() will return 100 as we found an exact match. One problem is solved.

Case 3: Incomplete names

As we can see, the fuzz.ratio() improved (75) as compared to the original(67) for the newly generated encodings. Hence a better option to opt for.

But how Soundex is generating these encodings?

Soundex considers just the 1st 4 sounds the string has.

  • Consider ‘Gaurav Gupta’ from the above screenshots. It has 6 different sounds (‘gau’,’r’,’v’,’gu’,’p’,’ta’) but Soundex will consider the 1st 4 which are: gau, r, v gu & give us a code for the string.
  • Similarly, ‘Gaurav’ has 3 sounds that are ‘gau’,’r’,’v’ hence the last digit goes 0 (in G610) as the 4th sound is missing.
  • So, 3 out of 4 sounds match for the 2 strings & hence the codes generated are pretty similar (G612 & G610) & differ in the last character that represents the 4th sound.

So far so good !!

But are there any drawbacks to considering just the 1st 4 sounds?

Many !!

As ‘Muralitharan’ has 6 sounds: Mu, R, Li, Th, R, N. just the 1st 4 help in forming the code hence whatever comes after ‘Muralith’ gets ignored which can put us in situations at times like in the below case where the same 1st name with different surname would generate same code

Hence, identifying different individuals as the same.

Also, as the encodings generated has just 4 characters, hence a possibility of 26(A-Z) X 10 (0–9) X 10 (0–9) X 10 (0–9)=26000 different patterns which can be very small when the pool of strings/name is vast.

Therefore if we have a pool of say 100k names, they all will get transformed to one of the 26k patterns & hence many collisions will be observed & different names will map to the same pattern which might not be representing the same individual. So a few points before wrapping up.

  • Soundex is pretty good with short names/strings in terms of length or where just first names are crucial. Like if we have a family ID grouping all family members together, you shouldn’t worry about surnames most of the time & hence Soundex is a definite savior.
  • It should only be used when a group's pool of possible names is small. Like, identifying an individual using a family can be easy but identifying individuals using city or state as ID will be troublesome due to a vast pool & hence collisions in the 4 character encoding generated.

Metaphone

Metaphone came into existence much after Soundex. It, similar to Soundex, converts any string in an encoding depending on the sounds present & outputs an all-alphabet code. The major advantage that Metaphone has over Soundex is it

  • Considers the entire string while generating code for a string & is not like Soundex which considers only 1st few sounds.
  • Code length has no restriction, hence even a large pool of words can be standardized without many collisions.

How are the codes for a string generated? using a newer set of rules than Soundex. Let’s observe its performance on a few edge cases

Minor spelling mistakes

Bingo!!

The case where Soundex is struggling is that it maps different names like ‘Gaurav Dwivedi’, ‘Gaurav Deshmukh’,’ Gaurav Dravid’ as the same. Metaphone solves this as well

Metaphone generates different encodings where 1st name is the same but the surname is different. But are there some issues with Metaphone?

Yes,

As it considers the entire string for generating codes,

  • Struggles with Gibberish tails in a string/name as compared to Soundex but is better than just fuzz.ratio()
  • Struggles with incomplete names compared to Soundex

Metaphone has two upgraded versions as well:

  1. DoubleMetaphone: Produces 2 encodings given a string, one major & another minor.
  2. Metaphone3 isn’t open source hence a very little knowledge of this.

One more thing, both these algorithms are disastrous with characters other than alphabets. So

  • Metaphone returns an empty encoding if all characters are numbers or special characters else if just some of the characters are not an alphabet, they will be ignored completely
  • Soundex ignores such characters if they aren’t the 1st character of the string else the 1st character of the encoding is that special character.

Now as we have gone through both Soundex & Metaphone both, which is better is a bit difficult to say & completely depends on your data.

For eg, I have been working with merging names given a family ID. Hence, as it is rare to have different surnames within an Indian family with the 1st name common (like Raj Gupta & Raj Mishra in the same family is a rare scenario), Soundex has a big edge over Metaphone. Though, Metaphone can be good with similar names within a city or state where consideration of full names is a must.

--

--