Fuzzy Matching with FuzzyWuzzy: A Comprehensive Guide

Alpha Iterations
5 min readApr 30, 2024

--

Photo by Fallon Michael on Unsplash

Introduction

Fuzzy matching (also known as approximate string matching) is a technique used to compare strings for similarity, even when they are not exact matches. It’s particularly useful when dealing with textual data that may contain variations, such as typos, misspellings, abbreviations, or different word orders.

Imagine a situation where you collect data of your customers from various sources. Some customers come from online shopping, some customers have filled the offline form at an offline marketing campaign. Now you want to identify common customers from online and offline shopping. You would look for a unique identifier like mobile number or email id for matching. But what about the customers who have not filled their email id and mobile numbers. That’s where you would do a fuzzy match based on customer names, location etc.

Fuzzy String Matching Example 1. (Matching the similar names for the profile deduplication task)

Imaging another scenario, which I am sure everyone of us would have witnessed. If you want to search for shoes, but on google search you make a typing mistake “shose” then also google shows the results of the shoes. See below example:

Fuzzy String Matching Example 2. (Google matches the misspelled keyword “shose” to correct keyword “shoes”)

This magic is possible through fuzzy string match.

FuzzyWuzzy, a powerful Python library, provides tools for comparing and matching strings based on their similarity.

Fuzzy Matching Algorithms

There are various fuzzy string matching algorithms available.

  1. Levenshtein Distance
  2. Cosine Similarity
  3. Hamming Distance
  4. n-gram Algorithm
  5. Bitap Algorithm
  6. BK Tree Algorithm

If python-Levenshtein library is already installed in your system, FuzzyWuzzy uses Levenshtein distance at the backend.

Else, it uses SequenceMatcher from difflib.

Levenshtein Distance

This algorithm measures the minimum number of single-character edits (insertions, deletions, or substitutions) required to transform one string into another.

Installation

pip install python-Levenshtein

Consider two strings “Heart” and “Hurt”

from Levenshtein import distance as lev
lev('Heart', 'Hurt')

#Output = 2

Below is the detailed explanation of how above strings got edit distance of 2 units.

Levenshtein Distance Flow

Notice Below Points about Levenshtein Distance:

  1. It is based on character edits. Levenshtein Distance between “time is money” and “money is time” is 6.
from Levenshtein import distance as lev
lev('this is it', 'it is this')

#output = 6

2. It is case sensitive. Levenshtein Distance between “Heart” and “heart” is 1.

from Levenshtein import distance as lev
lev('Heart', 'heart')

#output = 1

FuzzyWuzzy Library:

By quantifying the similarity between strings, FuzzyWuzzy assigns a score out of 100, indicating how closely two strings match.

The Score or Ratio is calculated as below:

a = len(string_1)

b = len(string_2)

total_length = a + b

lev_dist = levenshtein_distance(string1, string2)

Ratio = 2*(total_length - lev_dist)*100 / total_length

Installation

pip install fuzzywuzzy

Features of FuzzyWuzzy

  1. Ratio: Computes the similarity ratio between two strings. Refer below example:
from fuzzywuzzy import fuzz
str1= 'kitten'
str2 = 'sitting'
fuzz.ratio(str1, str2)

#output = 62

2. Partial Ratio: Considers partial matches even if the strings have additional characters. Refer below example:

from fuzzywuzzy import fuzz
str1= 'the cat is sleeping'
str2 = 'the cat is sleeping on the table'
print("ratio = ", fuzz.ratio(str1, str2) )
print("partial_ratio = ", fuzz.partial_ratio(str1, str2) )

#output:
# ratio = 75
# partial_ratio = 100

3. Token Sort Ratio: Compares words regardless of their order.

If full_process=True, then Strings are first cleaned (converted to lowercase and special characters are removed), else strings are considered as it is.

Then each word is considered as a token. The tokens are sorted alphabetically. Then fuzz.ratio is applied on them.

from fuzzywuzzy import fuzz
str1= 'MY COUNTRY IS INDIA'
str2 = 'india is my country!'
print("ratio = ", fuzz.ratio(str1, str2) )
print("partial_ratio = ", fuzz.partial_ratio(str1, str2) )
print("token_sort_ratio (full_process=True) = ", fuzz.token_sort_ratio(str1, str2, force_ascii=True, full_process=True) )
print("token_sort_ratio (full_process=False) = ", fuzz.token_sort_ratio(str1, str2, force_ascii=True, full_process=False) )

#output:
# ratio = 15
# partial_ratio = 16
# token_sort_ratio (full_process=True) = 100
# token_sort_ratio (full_process=False) = 15

4. Token Set Ratio: Treats duplicate words as a single word. Refer below example:

If full_process=True, then Strings are first cleaned (converted to lowercase and special characters are removed), else strings are considered as it is.

Then each word is considered as a token. The tokens are deduplicated and then sorted alphabetically. Then fuzz.ratio is applied on them.

from fuzzywuzzy import fuzz
str1= 'my country is India'
str2 = 'India is my country, INDIA.'
print("ratio = ", fuzz.ratio(str1, str2) )
print("partial_ratio = ", fuzz.partial_ratio(str1, str2) )
print("token_sort_ratio = ", fuzz.token_sort_ratio(str1, str2) )
print("token_set_ratio (full_process=True) = ", fuzz.token_set_ratio(str1, str2, force_ascii=True, full_process=True) )
print("token_set_ratio (full_process=False) = ", fuzz.token_set_ratio(str1, str2, force_ascii=True, full_process=False) )

#output:
# ratio = 52
# partial_ratio = 65
# token_sort_ratio = 86
# token_set_ratio (full_process=True)= 100
# token_set_ratio (full_process=False)= 85

5. WRatio: Weighted Ratio. Weights are applied based on the lengths of the strings.

6. Partial Token Sort Ratio: This is same as Token Sort Ratio with fuzz.partial_ratio applied instead of fuzz.ratio on the processed strings.

7. Partial Token Set Ratio: This is same as Token Set Ratio with fuzz.partial_ratio applied instead of fuzz.ratio on the processed strings.

8. QRatio: Quicker version of fuzz.ratio, uses a different similarity checking method.

9. UQRatio: Unicode version of the QRatio. [i.e. QRatio with force_ascii = False.]

10. UWRatio: Unicode version of the WRatio. [i.e. WRatio with force_ascii = False.]

11. Process : Process function provides an easy way to select top matches based on the fuzzy ratio. Refer below example to find the best matching song:

from fuzzywuzzy import process

choices = [ "Bohemian Rhapsody", "Hotel California", "Stairway to Heaven",
"Imagine", "Hey Jude", "Smells Like Teen Spirit", "Yesterday",
"Wonderwall", "Thriller", "Billie Jean" ]

process.extract("Billie Jeen", choices, limit=2)


# output
# [('Billie Jean', 91), ('Smells Like Teen Spirit', 58)]

Note on force_ascii = True

if force_ascii = True, then strings are converted to ascii characters. If any character is not present in ascii vocabulary (example: [µ, ¥]), then that character is converted to white space.

If force_ascii = False, then strings are converted to unicode versions.

Refer below example with force_ascii = True and force_ascii = False

str1 = "tµble"
str2 = "t¥ble"
print("token_sort_ratio(force_ascii=True): ", fuzz.token_sort_ratio(str1, str2, force_ascii=True, full_process=True))
print("token_sort_ratio(force_ascii=False): ",fuzz.token_sort_ratio(str1, str2, force_ascii=False, full_process=True))

#output
# token_sort_ratio(force_ascii=True): 100
# token_sort_ratio(force_ascii=False): 60

Applications of FuzzyWuzzy

  1. Spell Checking: Identify misspelled words by comparing them to a dictionary.
  2. Data Cleaning: Standardize variations in data entries (e.g., company names, addresses).
  3. Record Deduplication: Detect duplicate records in databases.
  4. Natural Language Processing (NLP): Improve search functionality by handling typos and variations.
  5. Plagiarism Detection: Compare text passages for similarity.

Conclusion

FuzzyWuzzy is a versatile tool that simplifies string matching tasks. Whether you’re building search engines, data pipelines, or recommendation systems, FuzzyWuzzy can enhance your applications.

References:

  1. https://chairnerd.seatgeek.com/fuzzywuzzy-fuzzy-string-matching-in-python/
  2. https://github.com/miohtama/python-Levenshtein/
  3. https://github.com/seatgeek/fuzzywuzzy/

--

--