Fuzzy Matching

Karan Arya
NLP Gurukool
Published in
3 min readNov 13, 2019

It works with matches that may be less than 100% perfect

We will be using a library called fuzzywuzzy. Install the library using any of the following methods:

pip install fuzzywuzzyORconda install fuzzywuzzy

Begin by importing the library:

import warnings
warnings.filterwarnings("ignore")
from fuzzywuzzy import fuzz

Ratio

Ratio function computes the standard Levenshtein distance similarity ratio between two sequences

Str1 = "Apple Inc."
Str2 = "apple Inc"
Ratio = fuzz.ratio(Str1.lower(),Str2.lower())
print(Ratio)
Output Cell

Partial Ratio

It is a powerful function that allows us to deal with more complex situations such as substring matching

If the short string has length k and the longer string has the length m, then the algorithm seeks the score of the best matching length-k substring.

Str1 = "Los Angeles Lakers"
Str2 = "Lakers"
Ratio = fuzz.ratio(Str1.lower(),Str2.lower())
Partial_Ratio = fuzz.partial_ratio(Str1.lower(),Str2.lower())
print(Ratio)
print(Partial_Ratio)
Output Cell

Token Sort Ratio

What happens when the strings comparison the same, but they are in a different order?

The fuzz.token functions have an important advantage over ratio and partial_ratio. They tokenize the strings and preprocess them by turning them to lower case and getting rid of punctuation. In the case of fuzz.token_sort_ratio(), the string tokens get sorted alphabetically and then joined together. After that, a simple fuzz.ratio() is applied to obtain the similarity percentage.

Str1 = "united states v. nixon"
Str2 = "Nixon v. United States"
Ratio = fuzz.ratio(Str1.lower(),Str2.lower())
Partial_Ratio = fuzz.partial_ratio(Str1.lower(),Str2.lower())
Token_Sort_Ratio = fuzz.token_sort_ratio(Str1,Str2)
print(Ratio)
print(Partial_Ratio)
print(Token_Sort_Ratio)
Output Cell

Token Set Ratio

Still, what happens if these two strings are of widely differing lengths? That's where fuzz.token_set_ratio() comes in.

Instead of just tokenizing the strings, sorting and then pasting the tokens back together, token_set_ratio performs a set operation that takes out the common tokens (the intersection) and then makes fuzz.ratio() pairwise comparisons between the following new strings:

s1 = Sorted_tokens_in_intersection

s2 = Sorted_tokens_in_intersection + sorted_rest_of_str1_tokens

s3 = Sorted_tokens_in_intersection + sorted_rest_of_str2_tokens

The logic behind these comparisons is that since Sorted_tokens_in_intersection is always the same, the score will tend to go up as these words make up a larger chunk of the original strings or the remaining tokens are closer to each other.

Str1 = "The supreme court case of Nixon vs The United States"
Str2 = "Nixon v. United States"
Ratio = fuzz.ratio(Str1.lower(),Str2.lower())
Partial_Ratio = fuzz.partial_ratio(Str1.lower(),Str2.lower())
Token_Sort_Ratio = fuzz.token_sort_ratio(Str1,Str2)
Token_Set_Ratio = fuzz.token_set_ratio(Str1,Str2)
print(Ratio)
print(Partial_Ratio)
print(Token_Sort_Ratio)
print(Token_Set_Ratio)
Output Cell

Fuzzy Process Extract

A module called process that allows you to calculate the string with the highest similarity out of a vector of strings

from fuzzywuzzy import process
str2Match = "apple inc"
strOptions = ["Apple Inc.","apple park","apple incorporated","iphone"]
Ratios = process.extract(str2Match,strOptions)
print(Ratios)
# You can also select the string with the highest matching percentage
highest = process.extractOne(str2Match,strOptions)
print(highest)
Output Cell

--

--