Phonetic Matching Algorithms

Szulicki
6 min readApr 12, 2019

--

Photo by Oleg Laptev on Unsplash

Let’s jump straight to the issue: you have a business task to match similar string values. The phonetic matching algorithm can help you with this, however, with certain limitations: those rely on pronunciation and are not aimed at providing you with a matching score. Hence, what you need to do is to apply an algorithm and then figure out some matching score for leftovers (the items without a 100% match). This summary covers the overview of phonetic matching algorithms only. It is worth to mention that phonetic algorithms are mostly relevant for matching names and surnames because this is where the problem with pronunciation appears, so if you need to match other attributes, maybe it is worth to check out other methods.

Soundex

Created by Robert Russel and Margaret King Odell in 1918, this algorithm intended to match names and surnames based on the basic rules of English pronunciation, hence, similar names get the same value. It is quite basic, I will present the generalized schema of original Soundex and its enhanced version (it is easy-peasy to implement)

The original Soundex
The slightly enhanced version that gives better results

New York State Identification and Intelligence System Phonetic Code (NYSIIS)

This one was developed in 1970 by the New York State Identification and Intelligence Center (surprise, surprise). The idea is the same as for the Soundex: if there are homophones, you will match them by assigning specific indices for particular sounds. What is good about its results: it is more accurate comparing with Soundex as it returns fewer surnames under the same code.

Metaphone

Developed by Lawrence Philips in 1990, the Metaphone is also more accurate compared with the Soundex method as it takes into consideration the groups of letters. The disadvantage shows up when you apply it to reconcile the strings that are not in English, as it is based on the rules of English pronunciation. The logic is the following:

Following Metaphone, Philips also designed the Double Metaphone. As its name suggests, it returns two codes, so you have more chances to match the items, however, at the same time, it means a higher probability of an error. According to the algorithm, there are three matching levels: a primary key to the primary key = strongest match, a secondary key to the primary key = normal match, the secondary key against the secondary key = weakest match. Sometime later (9 years, to be precise), Philips introduced his gem — Metaphone 3 — the extended matching algorithm, mostly intended to match personal information, such as names and surnames, like its previous versions, however, with a significantly higher degree of precision. What’s wonderful, it is adjusted to various language families (endings are hardcoded into the code) which allows it to match most of the European surnames (including, probably, the toughest — the Slavic ones). You can familiarize yourself with the code, it is very detailed, beautiful, and not for free usage. I would even highly recommend checking the code for everyone who has at least some interest in string data matching. By reading it you can follow the author’s logic and that may lead you to the most fascinating discoveries, straight to the resolution of your business problem!

https://searchcode.com/codesearch/raw/2366000/

Caverphone

If you are from New Zealand or you need to match those guys, you may want to pursue this algorithm.

Table 1:From
+-------+-------+-------+--------+----+----+
| cough | rough | tough | enough | gn | mb |
To
+-------+-------+-------+--------+----+----+
| cou2f | rou2f | tou2f | enou2f | 2n | m2 |
+-------+-------+-------+--------+----+----+
Table 2:From
+----+----+----+----+-----+---+---+---+---+
| cq | ci | ce | cy | tch | c | q | x | v |
To
+----+----+----+----+-----+---+---+---+---+
| 2q | si | se | sy | 2ch | k | k | k | f |
+----+----+----+----+-----+---+---+---+---+
From
+----+-----+-----+---+----+---+----+---+
| dg | tio | tia | d | ph | b | sh | z |
To
+----+-----+-----+---+----+---+----+---+
| 2g | sio | sia | t | fh | p | s2 | s |
+----+-----+-----+---+----+---+----+---+
Table 3:From
+---+-----+----+---+------+----+
| j | ^y3 | ^y | y | 3gh3 | gh |
To
+---+-----+----+---+------+----+
| y | Y3 | A | 3 | 3kh3 | 22 |
+---+-----+----+---+------+----+
From
+---+----+----+----+----+----+----+
| g | s+ | t+ | p+ | k+ | f+ | m+ |
To
+---+----+----+----+----+----+----+
| k | S | T | P | K | F | M |
+---+----+----+----+----+----+----+
From
+----+----+-----+----+---+----+
| n+ | w3 | wh3 | w$ | w | ^h |
To
+----+----+-----+----+---+----+
| N | W3 | Wh3 | 3 | 2 | A |
+----+----+-----+----+---+----+
From
+---+----+----+---+----+----+---+
| h | r3 | r$ | r | l3 | l$ | l |
To
+---+----+----+---+----+----+---+
| 2 | R3 | 3 | 2 | L3 | 3 | 2 |
+---+----+----+---+----+----+---+

*^ indicates the beginning of the row, $ the end of the row, s+ adjacent consequent symbols

Beider-Morse Phonetic Matching System

As stated by the authors, it is intended to solve the problem with a large number of irrelevant matches based on the language in which the name is written, applying the language-specific pronunciation rules. It is hard to explain here even some part of the rules as there are 200 of those. However, the main idea is that firstly you identify the language based on rules. Following that, words are being translated into phonetic tokens. After the language-specific rules, common rules have to be applied.

See the full description here:

Conclusion

If you need to construct/use a fairly easy algorithm, go for Soundex (still, probably the enhanced one), NYSIIS, or Metaphone/Double Metaphone, however, consider weaker results. The same is true for specific cases: New Zealand — Caverphone, Slavic — Daitch-Mokotoff Soundex, German — Cologne Phonetics.

If you are looking for more sophisticated logic, probably the best would be either Beider-Morse Phonetic Matching (free of charge, sexy) or Metaphone3 (not so free of charge, sexier).

Almost all existing algorithms in one library (Python)

A useful resource in Russian (including 2 more algorithms, for matching Russian names)

Soundex, NYSIIS, Double Metaphone on Python:

NYSIIS on C++, Java, Python:

NYSIIS library in Python:

--

--

Szulicki

Working in the Process Automation field, passionate about technology and process improvements of all kinds