# Soundex and Levenshtein distance in Python

What is Soundex? What is Levenshtein distance?

These are algorithms for comparing two strings. And often used to implement fuzzy search.**Soundex** is a phonetic algorithm and is based on how close two words are depending on their English pronunciation while **Levenshtein** measure the difference between two written words.

Depending on your use case you can chose between Soundex or Levenshtein or other algorithms like **Needleman–Wunsch algorithm or Metaphone. **For the scope of this article we will limit ourselves to learning about Soundex and Levenshtein and how to implement them.

# How does Soundex work?

Soundex works by converting your input string to a ‘4’ or more character output which can be compared to soundex value calculated for the other string.

Algorithm as defined in Wikipedia -

Step 1- Save the first letter. Remove all occurrences of a, e, i, o, u, y, h, w.

Step 2- Replace all consonants (include the first letter) with digits.

b, f, p, v → 1

c, g, j, k, q, s, x, z → 2

d, t → 3

l → 4

m, n → 5

r → 6

Step 3- Replace all adjacent same digits with one digit.

Step 4- If the saved letter’s digit is the same the resulting first digit, remove the digit (keep the letter).

Step 5- Append 3 zeros if result contains less than 3 digits. Remove all except first letter and 3 digits after it.

Soundex — https://en.wikipedia.org/wiki/Soundex

As examples, both “Robert” and “Rupert” yield “R163”.

# How does Levenshtein distance work?

Levenshtein distances calculates the number of steps you need to take in order to reach from one string to another. These steps can be either addition, deletion or modification of character.

For e.g.

word1 = “Greatness”

word2 = “Graetnes”

“Greatness” → “Greetnes”→“Greatnes” → “Greatness”

So the Levenshtein distance is 3.

Levenshtein Distance — https://en.wikipedia.org/wiki/Levenshtein_distance

# Soundex Implementation

If you would like to use a library for calculating **Soundex** you can go with https://pypi.org/project/Fuzzy/ . If you would like to check out implementation follow the code ahead.

**def **soundex(query: str):

*"""*

https://en.wikipedia.org/wiki/Soundex

**:param*** query:*

**:return***:*

"""

# Step 0: Clean up the query string

query = query.lower()

letters = [char **for **char **in **query **if **char.isalpha()]

*# Step 1: Save the first letter. Remove all occurrences of a, e, i, o, u, y, h, w.*

# If query contains only 1 letter, return query+"000" (Refer step 5)

**if **len(query) == 1:

**return **query + **"000"**

to_remove = (**'a'**, **'e'**, **'i'**, **'o'**, **'u'**, **'y'**, **'h'**, **'w'**)

first_letter = letters[0]

letters = letters[1:]

letters = [char **for **char **in **letters **if **char **not in **to_remove]

**if **len(letters) == 0:

**return **first_letter + **"000"**

*# Step 2: Replace all consonants (include the first letter) with digits according to rules*

to_replace = {(**'b'**, **'f'**, **'p'**, **'v'**): 1, (**'c'**, **'g'**, **'j'**, **'k'**, **'q'**, **'s'**, **'x'**, **'z'**): 2,

(**'d'**, **'t'**): 3, (**'l'**,): 4, (**'m'**, **'n'**): 5, (**'r'**,): 6}

first_letter = [value **if **first_letter **else **first_letter **for **group, value **in **to_replace.items()

**if **first_letter **in **group]

letters = [value **if **char **else **char

**for **char **in **letters

**for **group, value **in **to_replace.items()

**if **char **in **group]

*# Step 3: Replace all adjacent same digits with one digit.*

letters = [char **for **ind, char **in **enumerate(letters)

**if **(ind == len(letters) - 1 **or **(ind+1 < len(letters) **and **char != letters[ind+1]))]

*# Step 4: If the saved letter’s digit is the same the resulting first digit, remove the digit (keep the letter)*

**if **first_letter == letters[0]:

letters[0] = query[0]

**else**:

letters.insert(0, query[0])

*# Step 5: Append 3 zeros if result contains less than 3 digits.*

# Remove all except first letter and 3 digits after it.

first_letter = letters[0]

letters = letters[1:]

letters = [char **for **char **in **letters **if **isinstance(char, int)][0:3]

**while **len(letters) < 3:

letters.append(0)

letters.insert(0, first_letter)

string = **""**.join([str(l) **for **l **in **letters])

**return **string

Now you can calculate the Soundex of two words that you want to compare and figure out if the words sound similar.

# Levenshtein distance Implementation

You can always use the library provided at https://pypi.org/project/python-Levenshtein/ . However if you want to learn more about implementing it, read ahead.

Let’s say you have two words -

word1 — lenght M

word2 — length N

First start by constructing a (M+1) x (N+1) matrix with all 0’s . Then set the first row and first column to have values from 0 to M and 0 to N.

**def **get_levenshtein_distance(word1, word2):

*"""*

https://en.wikipedia.org/wiki/Levenshtein_distance

**:param*** word1:*

**:param*** word2:*

**:return***:*

"""

word2 = word2.lower()

word1 = word1.lower()

matrix = [[0 **for **x **in **range(len(word2) + 1)] **for **x **in **range(len(word1) + 1)]

**for **x **in **range(len(word1) + 1):

matrix[x][0] = x

**for **y **in **range(len(word2) + 1):

matrix[0][y] = y

print(matrix)

*# **TODO: Remaining logic*

At this point your matrix will look this —

`>> get_levenshtein_distance("great", "grae")`

[[0, 1, 2, 3, 4],

[1, 0, 0, 0, 0],

[2, 0, 0, 0, 0],

[3, 0, 0, 0, 0],

[4, 0, 0, 0, 0],

[5, 0, 0, 0, 0]]

Now, the next step is to fill in the remaining values. The logic for this is

- Finding the minimum of three values.
- Two of those values are the the value on left of the current index and value on top of the current index.
- The third value will be dependent on whether the characters at current index match for both the words.
- If the characters match, then the third value is the immediate top-left index in the matrix.
- If the character do not match, then the third value is the immediate top-left index in the matrix + 1.
- Finally return the value at bottom-right corner index

**def **get_levenshtein_distance(word1, word2):

*"""*

https://en.wikipedia.org/wiki/Levenshtein_distance

**:param*** word1:*

**:param*** word2:*

**:return***:*

"""

word2 = word2.lower()

word1 = word1.lower()

matrix = [[0 **for **x **in **range(len(word2) + 1)] **for **x **in **range(len(word1) + 1)]

**for **x **in **range(len(word1) + 1):

matrix[x][0] = x

**for **y **in **range(len(word2) + 1):

matrix[0][y] = y

**for **x **in **range(1, len(word1) + 1):

**for **y **in **range(1, len(word2) + 1):

**if **word1[x - 1] == word2[y - 1]:

matrix[x][y] = min(

matrix[x - 1][y] + 1,

matrix[x - 1][y - 1],

matrix[x][y - 1] + 1

)

**else**:

matrix[x][y] = min(

matrix[x - 1][y] + 1,

matrix[x - 1][y - 1] + 1,

matrix[x][y - 1] + 1

)

**return **matrix[len(word1)][len(word2)]

E.G. —

`>> get_levenshtein_distance("greatness", "graetnes")`

3

# Conclusion

Hopefully this clarifies any confusion around Soundex and Levenshtein distance and their implementations. If you need any other clarifications do leave a comment. Thanks for reading as always!!!