Understanding the Levenshtein Distance Equation for Beginners

I recently came across a situation where I needed fuzzy string matching functionality for a command line application I built. After googling a way to do this in Ruby, I found a Stack Overflow post telling me to install a ruby gem called levenshtein-ffi and be on my way. However, my curiosity got the best of me, and I went down an internet rabbit hole of trying to understand what is really going on behind the scenes. Ultimately, I landed on the Levenshtein Distance Wikipedia page, where I saw this:

Image for post
Image for post
What?

I have minimal experience with matrices and have never taken Linear Algebra, so initially, I was bewildered. Eventually however, I was able to piece together an elementary understanding of what is happening, which I’ll attempt to explain below. This explanation is meant for beginners — anyone who is confused by the equation above or has never taken higher level mathematics courses. Warning: if you don’t fall into the above camp, you’ll probably find the explanation below needlessly tedious.

Introduction

The Levenshtein distance is a number that tells you how different two strings are. The higher the number, the more different the two strings are.

For example, the Levenshtein distance between “kitten” and “sitting” is 3 since, at a minimum, 3 edits are required to change one into the other.

An “edit” is defined by either an insertion of a character, a deletion of a character, or a replacement of a character.

Functions

A quick refresher if you haven’t looked at functions recently… The first thing to understand is that functions are a set of instructions that take a given input, follow a set of instructions, and yield an output. You probably saw lots of basic functions in your high school math courses.

Image for post
Image for post

Piecewise Functions

Piecewise functions are more complex functions. In a piecewise function, there are multiple sets of instructions. You choose one set over another based on a certain condition. Consider the example below:

Image for post
Image for post

In the above example, we use different sets of instructions based on what the input is. Piecewise function are denoted by the brace { symbol.

With that in mind, the Levenshtein Distance equation should look a little more readable.

Image for post
Image for post
Original
Image for post
Image for post
In other words…

What do a, b, i, and j stand for?

a = string #1

b = string #2

i = the terminal character position of string #1

j = the terminal character position of string #2.

The positions are 1-indexed. Consider the below example where we compare string“cat” with string “cap”:

Image for post
Image for post

The conditional (aᵢ ≠bⱼ)

aᵢ refers to the character of string a at position i

bⱼ refers to the character of string b at position j

We want to check that these are not equal, because if they are equal, no edit is needed, so we should not add 1. Conversely, if they are not equal, we want to add 1 to account for a necessary edit.

Image for post
Image for post

Solving Using a Matrix

The Levenshtein distance for strings A and B can be calculated by using a matrix. It is initialized in the following way:

Image for post
Image for post

From here, our goal is to fill out the entire matrix starting from the upper-left corner. Afterwards, the bottom-right corner will yield the Levenshtein distance.

Let’s fill out the matrix by following the piecewise function.

Image for post
Image for post
Image for post
Image for post

Now we can fill out the rest of the matrix using the same piecewise function for all the spots in the matrixes.

One more example:

Image for post
Image for post
Image for post
Image for post

If you feel comfortable with the equation at this point, try to fill out the rest of the matrix. The result is posted below:

Image for post
Image for post

Since the lower-right corner is 3, we know the Levenshtein distance of “kitten” and “sitting” is 3. This is what we expected since we already knew the Levenshtein distance was 3 (as explained in the introduction). This is also reflected in the matrix shown on the Levenshtein Distance Wikipedia page.

Image for post
Image for post

Conclusion

Image for post
Image for post
Oh, I see!

Hopefully, the above looks less intimidating to you now. Even better, I hope you understand what it means!

Image for post
Image for post
Vladimir Levenshtein, inventor of the Levenshtein algorithm

Now that you understand to a certain degree what is happening under the hood, you can really appreciate all that’s happening when you download a gem like levenshtein-ffi and run a method like Levenshtein.distance(“kitten”, “sitting”) and get a return value of 3. All the hard work has been abstracted away so that you get a simple numerical representation of how different two strings are. We should all thank Vladimir Levenshtein, who came up with his algorithm in 1965. The algorithm hasn’t been improved in over 50 years and for good reason. According to MIT, it may very well be that Levenshtein’s algorithm is the best that we’ll ever get in terms of efficiency.

Works Used / Cited

Written by

Teacher + programmer

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store