Soundex and Levenshtein distance in Python

Yash Agarwal
5 min readMay 2, 2018

--

Search. Photo by Annie Theby

What is Soundex? What is Levenshtein distance?
These are algorithms for comparing two strings. And often used to implement fuzzy search.
Soundex is a phonetic algorithm and is based on how close two words are depending on their English pronunciation while Levenshtein measure the difference between two written words.

Depending on your use case you can chose between Soundex or Levenshtein or other algorithms like Needleman–Wunsch algorithm or Metaphone. For the scope of this article we will limit ourselves to learning about Soundex and Levenshtein and how to implement them.

How does Soundex work?

Soundex works by converting your input string to a ‘4’ or more character output which can be compared to soundex value calculated for the other string.

Algorithm as defined in Wikipedia -

Step 1- Save the first letter. Remove all occurrences of a, e, i, o, u, y, h, w.
Step 2- Replace all consonants (include the first letter) with digits.
b, f, p, v → 1
c, g, j, k, q, s, x, z → 2
d, t → 3
l → 4
m, n → 5
r → 6
Step 3- Replace all adjacent same digits with one digit.
Step 4- If the saved letter’s digit is the same the resulting first digit, remove the digit (keep the letter).
Step 5- Append 3 zeros if result contains less than 3 digits. Remove all except first letter and 3 digits after it.

Soundex — https://en.wikipedia.org/wiki/Soundex

As examples, both “Robert” and “Rupert” yield “R163”.

How does Levenshtein distance work?

Levenshtein distances calculates the number of steps you need to take in order to reach from one string to another. These steps can be either addition, deletion or modification of character.

For e.g.
word1 = “Greatness”
word2 = “Graetnes”

“Greatness” → “Greetnes”→“Greatnes” → “Greatness”
So the Levenshtein distance is 3.

Levenshtein Distance — https://en.wikipedia.org/wiki/Levenshtein_distance

Soundex Implementation

If you would like to use a library for calculating Soundex you can go with https://pypi.org/project/Fuzzy/ . If you would like to check out implementation follow the code ahead.

def soundex(query: str):
"""
https://en.wikipedia.org/wiki/Soundex
:param query:
:return:
"""

# Step 0: Clean up the query string
query = query.lower()
letters = [char for char in query if char.isalpha()]

# Step 1: Save the first letter. Remove all occurrences of a, e, i, o, u, y, h, w.

# If query contains only 1 letter, return query+"000" (Refer step 5)
if len(query) == 1:
return query + "000"

to_remove = ('a', 'e', 'i', 'o', 'u', 'y', 'h', 'w')

first_letter = letters[0]
letters = letters[1:]
letters = [char for char in letters if char not in to_remove]

if len(letters) == 0:
return first_letter + "000"

# Step 2: Replace all consonants (include the first letter) with digits according to rules

to_replace = {('b', 'f', 'p', 'v'): 1, ('c', 'g', 'j', 'k', 'q', 's', 'x', 'z'): 2,
('d', 't'): 3, ('l',): 4, ('m', 'n'): 5, ('r',): 6}

first_letter = [value if first_letter else first_letter for group, value in to_replace.items()
if first_letter in group]
letters = [value if char else char
for char in letters
for group, value in to_replace.items()
if char in group]

# Step 3: Replace all adjacent same digits with one digit.
letters = [char for ind, char in enumerate(letters)
if (ind == len(letters) - 1 or (ind+1 < len(letters) and char != letters[ind+1]))]

# Step 4: If the saved letter’s digit is the same the resulting first digit, remove the digit (keep the letter)
if first_letter == letters[0]:
letters[0] = query[0]
else:
letters.insert(0, query[0])

# Step 5: Append 3 zeros if result contains less than 3 digits.
# Remove all except first letter and 3 digits after it.

first_letter = letters[0]
letters = letters[1:]

letters = [char for char in letters if isinstance(char, int)][0:3]

while len(letters) < 3:
letters.append(0)

letters.insert(0, first_letter)

string = "".join([str(l) for l in letters])

return string

Now you can calculate the Soundex of two words that you want to compare and figure out if the words sound similar.

Levenshtein distance Implementation

You can always use the library provided at https://pypi.org/project/python-Levenshtein/ . However if you want to learn more about implementing it, read ahead.

Let’s say you have two words -
word1 — lenght M
word2 — length N

First start by constructing a (M+1) x (N+1) matrix with all 0’s . Then set the first row and first column to have values from 0 to M and 0 to N.

def get_levenshtein_distance(word1, word2):
"""
https://en.wikipedia.org/wiki/Levenshtein_distance
:param word1:
:param word2:
:return:
"""
word2 = word2.lower()
word1 = word1.lower()
matrix = [[0 for x in range(len(word2) + 1)] for x in range(len(word1) + 1)]

for x in range(len(word1) + 1):
matrix[x][0] = x
for y in range(len(word2) + 1):
matrix[0][y] = y

print(matrix)

# TODO: Remaining logic

At this point your matrix will look this —

>> get_levenshtein_distance("great", "grae")
[[0, 1, 2, 3, 4],
[1, 0, 0, 0, 0],
[2, 0, 0, 0, 0],
[3, 0, 0, 0, 0],
[4, 0, 0, 0, 0],
[5, 0, 0, 0, 0]]

Now, the next step is to fill in the remaining values. The logic for this is

  • Finding the minimum of three values.
  • Two of those values are the the value on left of the current index and value on top of the current index.
  • The third value will be dependent on whether the characters at current index match for both the words.
  • If the characters match, then the third value is the immediate top-left index in the matrix.
  • If the character do not match, then the third value is the immediate top-left index in the matrix + 1.
  • Finally return the value at bottom-right corner index
def get_levenshtein_distance(word1, word2):
"""
https://en.wikipedia.org/wiki/Levenshtein_distance
:param word1:
:param word2:
:return:
"""
word2 = word2.lower()
word1 = word1.lower()
matrix = [[0 for x in range(len(word2) + 1)] for x in range(len(word1) + 1)]

for x in range(len(word1) + 1):
matrix[x][0] = x
for y in range(len(word2) + 1):
matrix[0][y] = y

for x in range(1, len(word1) + 1):
for y in range(1, len(word2) + 1):
if word1[x - 1] == word2[y - 1]:
matrix[x][y] = min(
matrix[x - 1][y] + 1,
matrix[x - 1][y - 1],
matrix[x][y - 1] + 1
)
else:
matrix[x][y] = min(
matrix[x - 1][y] + 1,
matrix[x - 1][y - 1] + 1,
matrix[x][y - 1] + 1
)

return matrix[len(word1)][len(word2)]

E.G. —

>> get_levenshtein_distance("greatness", "graetnes")
3

Conclusion

Hopefully this clarifies any confusion around Soundex and Levenshtein distance and their implementations. If you need any other clarifications do leave a comment. Thanks for reading as always!!!

--

--