Fuzzy match algorithms explained

Mehul Gupta

Published in

Data Science in your pocket

10 min readAug 31, 2020

The above picture might have given you enough idea of what this post is about.

My debut book “LangChain in your Pocket” is out now !!

LangChain in your Pocket: Beginner's Guide to Building Generative AI Applications using LLMs

LangChain in your Pocket: Beginner's Guide to Building Generative AI Applications using LLMs eBook : Gupta, Mehul…

www.amazon.in

This post covers some of the important fuzzy(not exactly equal but lumpsum the same strings, say Rajkumar & Raj Kumar) string matching algorithms which include:

Hamming Distance
Levenstein Distance
Damerau Levenstein Distance
N-Gram based string matching
BK Tree
Bitap algorithm

But we should first know why fuzzy matching is important considering a real-world scenario. Whenever a user is allowed to enter any data, things can go really messy as in the below case!!

Let,

Your name is ‘Saurav Kumar Agarwal’.
You have been given a library card (with a Unique ID of course) which is valid for every member of your family.
You go to a library to issue some books every week. But for this, you first need to make an entry in a register at the library entrance where you enter your ID & name.

Now, the library authorities decide to analyze data in this library register to have an idea of unique readers visiting every month. Now, this can be done for many reasons (like whether to go for land expansion or not, should the number of books increase? etc).

They planned to do this by counting (unique names, Library ID) pairs registered for a month. So if we have

123 Mehul
123 Rajesh
123 Mehul
133 Rajesh
133 Sandesh

This should give us 4 unique readers.

So far so good

They estimated this number to be around 10k-10.5k but the numbers surged up to ~50k after analysis by exact string matching of names.

wooohhh!!

The a pretty bad estimate I must say !!

But should they be happy?

Or they missed a trick here?

After a deeper analysis, a major issue came up. Remember your name, ‘Saurav Agarwal’. They found some entries leading to 5 different users across your family library ID:

Saurav K Aggrawal (notice the double ‘g’, ‘ra-ar’, Kumar becomes K)
Saurav Kumar Agarwal
Sourav Aggarwal
SK Agarwal
Agarwal

There exist some ambiguities as Library authorities don't know

How many people visit the library from your family? whether these 5 names represent a single reader, 2 or 3 different readers. Call it an unsupervised problem where we have a string representing unstandardized names but their standardized names aren’t known (i.e Saurav Agarwal: Saurav Kumar Agarwal is not known)
Should all of them merge with ‘Saurav Kumar Agarwal’? can’t say as SK Agarwal can be Shivam Kumar Agarwal (say your Father’s name). Also, only Agarwal can be anyone from the family.

The last 2 cases (SK Agarwal & Agarwal) would require some extra information (like gender, Date-of-Birth, and Books issued) which will be fed to some ML algorithm to determine the reader’s actual name & should be left for now. But one thing I am pretty confident about is the 1st 3 names represent the same person & I should not need extra information to merge them to a single reader i.e ‘Saurav Kumar Agarwal’.

How this can be done?

Here comes fuzz string matching which means an almost exact match between strings that are slightly different from each other (mostly due to wrong spellings) like ‘Agarwal’ & ‘Aggarwal’ & ‘Aggrawal’. If we get a string with an acceptable error rate (say a letter wrong ), we would be considering it as a match. So let’s explore →

Hamming Distance

Hamming distance is the most obvious distance calculated between two strings of equal length which gives a count of characters that don’t match the corresponding given index. For example: ‘Bat’ & ‘Bet’ has hamming distance 1 as at index 1, ‘a’ not equal to ‘e’

Levenstein Distance/ Edit Distance

Though I assume you are comfortable with Levenstein distance & Recursion. If not, do read this. The below explanation is more of a summarized version of Levenstein's distance.

Let ‘x’= length(Str1) & ‘y’=length(Str2) for this entire post.

Levenstein Distance calculates the minimum number of edits required to convert ‘Str1’ to ‘Str2’ by performing either Addition, Removal, or Replace characters. Hence ‘Cat’ & ‘Bat’ has edit distance ‘1’ (Replace ‘C’ with ‘B’) while for ‘Ceat’ & ‘Bat’, it would be 2 (Replace ‘C’ with ‘B’; Delete ‘e’).

Talking about the algorithm has majorly 5 steps

int LevensteinDistance(string str1, string str2, int x, int y){if (x == len(str1)+1) return len(str2)-y+1;if (y == len(str2)+1) return len(str1)-x+1;if (str1[x-1] == str2[y-1]) 
return LevensteinDistance(str1, str2, x+ 1, y+ 1);return 1 + min(LevensteinDistance(str1, str2, x, y+ 1), #insert                LevensteinDistance(str1, str2, x+ 1, y), #remove               LevensteinDistance(str1, str2, x+ 1, y+ 1) #replace
}
LevenesteinDistance('abcd','abbd',1,1)

What we are doing is start reading strings from the 1st character

If the character at index ‘x’ of str1 matches with the character at index ‘y’ of str2, no edit is required & calculate the edit distance by calling

LevensteinDistance(str1[:x+1],str2[:y+1])

If they don’t match, we know an edit is required. So we add +1 to the current edit_distance & recursively calculate edit_distance for 3 conditions & take the minimum out of the three:

Levenstein(str1[:x],str2[:y+1]) →insertion in str2.

Levenstein(str1[:x+1],str2[:y]) →deletion in str2

Levenstein(str1[:x+1],str2[:y+1])→replacement in str2

If any of the string runs out of characters while recursion, the length of the remaining characters in the other string are added to edit distance (the 1st two if statements)

Demerau-Levenstein Distance

Consider ‘abcd’ & ‘acbd’. If we go by Levenstein Distance, this would be (replace ‘b’-’c’ & ‘c’-’b’ at index positions 2 & 3 in str2) But if you look closely, both the characters got swapped as in str1 & if we introduce a new operation ‘Swap’ alongside ‘addition’, ’ deletion’ & ‘replace’, this problem can be solved. This introduction can help us solve issues like ‘Agarwal’ & ‘Agrawal’.

This operation is added by the below pseudocode

1 + 
(x - last_matching_index_y - 1 ) +
(y - last_matching_index_x - 1 ) +
LevinsteinDistance(str1[:last_matching_index_x-1],  str2[:last_matching_index_y-1])

Let’s run the above operation for ‘abcd’ & ‘acbd’ where assume

initial_x=1 & initial_y=1 (Starting indexing from 1 & not from 0 for both strings).
current x=3 & y=3 hence str1 & str2 at ‘c’ (abcd) & ‘b’ (acbd) respectively.
Upper limits included in the slicing operation mentioned below. Hence ‘qwerty’[:4] will mean ‘qwer’.

last_matching_index_y will be the last_index in str2[:y-1] that matched with str1[:x] i.e. 2 as ‘c’ in ‘acbd’ (index 2)matched with ‘c’ in ‘abcd’.
last_matching_index_x will be the last_index in str1[:x-1] that matched with str2[:y] i.e 2 as ‘b’ in ‘abcd’ (index 2)matched with ‘b’ in ‘acbd’.
Hence, following the pseudocode above, DemarauLevenstein(‘abc’,’acb’)=

1+(4–3–1) + (4–3–1)+LevensteinDistance(‘a’,’a’)=1+0+0+0=1. This would have been 2 if we followed Levenstein Distance

N-Gram

N-gram is basically nothing but a set of values generated from a string by pairing sequentially occurring ’n’ characters/words. For example, n-gram with n=2 for ‘abcd’ will be ‘ab’,’bc’,’cd’.

Now, these n-grams can be generated for strings whose similarity has to be measured & different metrics can be used to measure the similarity.

Let ‘abcd’ & ‘acbd’ be 2 strings to be matched.

n-grams with n=2 for

‘abcd’=ab,bc,cd

‘acbd’=ac,cb,bd

Component match (qgrams)
Count of n-grams matched exactly for the two strings. Here, it would be 0
Cosine Similarity
Calculating dot product after converting n-grams generated for strings to vectors
Component match (no order considered)
Matching n-grams components ignoring order of elements within i.e bc=cb.

There exist many other such metrics like Jaccard distance, Tversky Index, etc which haven’t been discussed here.

BK Tree

BK tree is amongst the fastest algorithms to find out similar strings (from a pool of strings) for a given string. BK tree uses a

Levenstein Distance
Triangle inequality (will be covering this by the end of this section)

to figure out similar strings(& not just one best string).

A BK tree is created using a pool of words from an available dictionary. Say we have a dictionary D={Book, Books, Cake, Boo, Boon, Cook, Cape, Cart}

A BK Tree is created by

Select a root node (can be any word). Let it be ‘Book’
Go through every word W in Dictionary D, calculate its Levenstein distance with the root node, and make the word W a child of the root if no child with the same Levenstein distance exists for the parent. Inserting Books & Cake.

Levenstein(Books,Book)=1 & Levenstein(Book,Cake)=4. As no child with the same distance is found for ‘Book’, make them new children for root.

https://nullwords.wordpress.com/2013/03/13/the-bk-tree-a-data-structure-for-spell-checking/

If a child with the same Levenstein distance exists for the parent P, now consider that child as a parent P1 for that word W, calculate Levenstein(parent, word) and if no child with the same distance exists for the parent P1, make the word W a new child for P1 else continue going down the tree.

Consider inserting Boo. Levenstein(Book, Boo)=1 but there already exists a child with the same i.e Books. Hence, now calculate Levenstein(Books, Boo)=2. As no child with the same distance exists for Books, make Boo Child of Books.

Similarly, insert all words from the dictionary to BK Tree

Once BK Tree is prepared, for searching similar words for a given word (Query word) we need to set a tolerance_threshold to say ‘t’.

This is the maximum edit_distance we will consider for the similarity between any node & query_word. Anything above this is rejected.
Also, we will be considering only those children of a parent node for comparison with Query where the Levenstein(Child, Parent) is within

[Levenstein(Parent, Query)-t, Levenstein(Parent,Query)+t]

why? will be discussing this shortly

If this condition is true for a child, then only we will be calculating Levenstein(Child, Query) which will be considered similar to query if

Levenshtein(Child, Query)≤t.

Hence, We won’t be calculating Levenstein distance for all nodes/words in Dictionary with the query word & hence save a lot of time when the pool of words is huge

Hence, for a Child to be similar to Query if

Levenstein(Parent, Query)-t ≤ Levenstein(Child,Parent) ≤ Levenstein(Parent,Query)+t
Levenstein(Child,Query)≤t

Let’s go through an example quickly. Let ‘t’=1

Let the word we need to find similar words for is ‘Care’ (Query word)

Levenstein(Book,Care)=4. As 4>1, ‘Book’ is rejected. Now we will be considering only those children of ‘Book’ which have a LevisteinDistance(Child,’Book’) between 3(4–1) & 5(4+1). Hence, we would be rejecting ‘Books’ & all its children as 1 doesn’t lie in the limits set
As Levenstein(Cake, Book)=4 which lies between 3 & 5, we will be checking Levenstein(Cake, Care)=1 which is ≤1 hence a similar word.
Readjusting the limits, for Cake’s children, become 0(1–1) & 2(1+1)
As both Cape & Cart has Levenstein(Child, Parent)=1 & 2 respectively, both are eligible to get tested with ‘Care’.
As Levenstein(Cape,Care)=1 ≤1& Levenstein(Cart,Care)=2>1, only Cape is considered as similar word to ‘Care’.

Hence, we found 2 words similar to ‘Care’ that are: Cake & Cape

But on what basis are we deciding over the limits

Levenstein(Parent,Query)-t, Levenstein(Parent,Query)+t.

for any child node?

This follows from Triangle Inequality which states

Distance (A,B) + Distance(B,C)≥Distance(A,C) where A,B,C are sides of a triangle.

How this idea helps us to get to these limits, let's see:

Let Query,=Q Parent=P & Child=C form a triangle with Levenstein distance as edge length. Let ‘t’ be the tolerance_threshold. Then, By the above inequality

Levenstein(P,C) + Levenstein(P,Q)≥Levenstein(C,Q)as Levenstein(C, Q) can’t exceed 't'(threshold set earlier for C to be similar to Q), HenceLevenstein(P,C) + Levenstein(P,Q)≥ t
OrLevenstein(P,C)≥|Levenstein(P,Q)-t| i.e the lower limit is set

Again, by Triangle Inequality,

Levenstein(P,Q) + Levenstein(C,Q) ≥ Levenstein(P,C) or
Levenstein(P,Q) + t ≥ Levenstein(P,C) or the upper limit is set

Bitap algorithm

It is also a very fast algorithm that can be used for fuzzy string matching pretty efficiently. This is because it uses bitwise operations & as they are pretty fast, the algorithm is known for its speed. Without wasting a bit, let’s get started:

Let S0=‘ababc’ is the pattern to be matched with string S1=‘abdabababc’.

First of all,

Figure out all unique characters in S1 & S0 that are a,b,c,d
Form bit patterns for each character. How?

Reverse the pattern S0 to be searched
For bit pattern for character x, place 0 where it occurs in pattern else 1. In T[a], ‘a’ occurs at 3rd & 5th place, we placed 0 at those positions & rest 1.

Similarly, form all other bit patterns for different characters in the dictionary.

S0=ababc S1=abdabababc

Take an initial state of a bit pattern of length(S0) with all 1’s. In our case, it would be 11111 (State S).
As we start searching in S1 and pass through any character,

shift a 0 at the lowermost bit i.e 5th bit from the right in S &
OR operation on S with T[x].

Hence, as we will encounter 1st ‘a’ in S1, we will

Left Shift a 0 in S at the lowermost bit to form 11111 →11110
11010 | 11110=11110

Now, as we move to the 2nd character ‘b’

Shift 0 to S i.e to form 11110 →11100
S |T[b]=11100 | 10101=11101

As ‘d’ is encountered

Shift 0 to S to form 11101 →11010
S |T[d]=11010 |11111=11111

Observe that as ‘d’ isn’t in S0, the state S gets reset. Once you traverse through entire S1, you will the below States at each character

How do I know whether I found any fuzz/full match?

If a 0 reaches the topmost bit in State S bit pattern(as in the last character ‘c’), you have a full match
For fuzz match, observe 0 at different positions of State S. If 0 is at 4th place, we found 4/5 characters in the same sequence as in S0. Similarly for 0 at other places.

One thing you might have observed that all these algorithms work on spelling similarity of strings. Can there be any other criteria for fuzzy matching? Yes,

Phonetics

Will be covering phonetics(sound-based, like Saurav & Sourav, should have an exact match in phonetics) based fuzz match algorithms next !!