An Introduction to Fuzzy String Matching

Julien Tregoat
Jan 9, 2018 · 5 min read

Starting out as a developer, one might not think of the term ‘fuzzy’ as applicable to anything they are doing. In fact, you might not think of the term at all (I didn’t. Except while looking at cat gifs on other people’s blogs).

However, we use ‘fuzzy’ to describe life regularly, because life can be extremely vague and general. As an object-oriented programmer, we strive to minimize the gap between our code and the real world through accurate models of this world. So why wouldn’t fuzzy apply? Of course, it already does.

Super cool vaguely original content.

Fuzzy logic?

The concept of ‘fuzzy logic’ was developed in the 20th century, elaborating on Jan Łukasiewicz’s proposition of many-valued logic in 1920. Jan specificlly pioneered negation and implication; you might know implication as an if statement. Many-valued logic is necessary because it allows for mathematical calculations around the ambiguous nature of life.

“Fuzzy logic… provides an effective conceptual framework for dealing with the problem of knowledge representation in an environment of uncertainty and imprecision.”

The importance of fuzzy logic has only become more apparent as science digs into computers and programming is taken further. It has become especially useful in the context of artificial intelligence. You may be thinking ‘But wait! Booleans are everywhere in programming. Isn’t that at odds with many-value logic?’ Well, yes and no.

We use many- (AKA infinite-) valued logic regularly when we code already. Every time you use an if statement, adding in multiple elsif clauses, you are using a form of infinite logic. You are accounting for multiple possible scenarios other than true or false, up to a potentially infinite amount of times. However, there are many other situations where fuzzy logic can become immediately helpful.

Edit Distance & Fuzzy String Matching

“Fuzzy String Matching is basically rephrasing the YES/NO ‘Are string A and string B the same?’ as ‘How similar are string A and string B?”

Fuzzy logic came in to play for me when I was working on a basic command line interface at the Flatiron School. If someone types a command or something incorrectly, it can break your program. The first step is to implement proper error handling, e.g. a message to let the user know their error and allow for another try. However, that isn’t forgiving when you’re a messy typer. You can have your program match the first three characters, but there is still room for mistakes there. Human error is consistently inconsistent. How do we account for incorrect letters, or extra ones, that can appear anywhere in the string? Enter edit distance.

An edit distance matrix from calculating the Levenshtein Distance.
The two possible paths for the above matrix. “=” Match; “o” Substitution; “+” Insertion; “-” Deletion

At its most basic, edit distance calculates how similar (or not) two strings are. This is calculated through the number of operations needed to transform one string into the other — e.g. with my name “Julien” and the common spelling “Julian”, only one operation is needed, a substitution of “a” for “e.”

The original algorithm can be attributed to Vladimir Levenshtein, who passed away in September of last year. It’s known as the Levenshtein Distance. It’s a recursive function that calculates the edit distance for every prefix and suffix. The algorithm results in a matrix of all possibilities.

Edit distance has found widespread use in the mapping and comparing of genomes. It has also become extremely useful in the namesake of this article — fuzzy string matching.

Fuzzy string matching, also known as approximate string matching, can be a variety of things; Regular expressions are a form of it, as are wildcards in the context of SQL. It is any form of attempting to match one string to another one.

Fuzzy string matching with regards to edit distance is the application of edit distance as a metric and finding the minimum edit distance required to match two different strings together. We encounter this on a daily basis in our interactions with computers — does the red line under a misspelled word ring a bell? Or maybe this one:

All of these implement some form of fuzzy string matching. The Levenshtein Distance is the most common metric, but there are other variations on the algorithm — Sellers, Damerau-Levenshtein, Hamming, and more. They all have different ways of computing the same thing.

Ok, how do I use this to my advantage?

The Levenshtein Distance in Ruby.

Much of the heavy lifting has already been done for us, as these algorithms have already been expressed in code in every conceivable language across the internet. I posted the Ruby version above, but for everyone else: Find the Levenshtein Distance expressed in other languages here.

In fact, the reason I found myself down this edit distance wormhole was thanks to the Ruby gem Amatch by Florian Frank. Check it out! It offers you a selection of algorithms to choose from depending on the data you’re inputting. Some are more suited to larger bodies of text, while others aren’t.

Amatch testing.

Julien Tregoat

Written by

creating decentralized applications @ThunderToken

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade