Clean Matching with FuzzyWuzzy
String Matching in Python with use of the Levenshtein Distance
Python comes equipped with numerous ways to handle inconsistencies in strings as well as overall language (cue Natural Language Processing) to vectorize and transform words into usable formats for analysis. With methods such as .upper(), .title(), .strip(), .replace(), etc. available to format text, matching two similar strings becomes quite simple. However, Python has proved to me time and time again that there is almost always a simpler method to the madness— or to receiving error messages.
Here is where I discovered the FuzzyWuzzy package. When attempting to merge two DataFrames with various names that were mostly related but not exactly, I wound up exhausting different types of joins and concatenations, as well as hours of mental clarity, just trying to not lose majority of the rows that could not be directly matched on. Merging with Pandas is more temperamental than with FuzzyWuzzy, which allows you to set a threshold of the minimum distance between two strings that you would desire for a match.
Named after mathematician Vladimir Levenshtein, this distance computes the required number of character edits to directly change one string into another. The distance between two strings is quantified by indices locations and the computational cost needed to move from one index to another.
To understand the implementation of the Levenshtein Distance Algorithm:
import numpy as npdef levenshtein_ratio_and_distance(s, t, ratio_calc = False): “”” levenshtein_ratio_and_distance:
Calculates levenshtein distance between two strings.
If ratio_calc = True, the function computes the
levenshtein distance ratio of similarity between two strings
For all i and j, distance[i,j] will contain the Levenshtein
distance between the first i characters of s and the
first j characters of t
“”” # Initialize matrix of zeros
rows = len(s)+1
cols = len(t)+1
distance = np.zeros((rows,cols),dtype = int)# Populate matrix of zeros with the indices of each character of both strings
for i in range(1, rows):
for k in range(1,cols):
distance[i][0] = i
distance[0][k] = k# Iterate over the matrix to compute the cost of deletions, insertions and/or substitutions
for col in range(1, cols):
for row in range(1, rows):
if s[row-1] == t[col-1]:
cost = 0
# If the characters are the same in the two strings in a given position [i,j] then the cost is 0 else:
# In order to align the results with those of the Python Levenshtein package, if we choose to calculate the ratio
# the cost of a substitution is 2. If we calculate just distance, then the cost of a substitution is 1.
if ratio_calc == True:
cost = 2
else:
cost = 1
distance[row][col] = min(distance[row-1][col] + 1, # Cost of deletions
distance[row][col-1] + 1, # Cost of insertions
distance[row-1][col-1] + cost) # Cost of substitutions
if ratio_calc == True:
# Computation of the Levenshtein Distance Ratio
Ratio = ((len(s)+len(t)) — distance[row][col]) / (len(s)+len(t))
return Ratio
else:
# print(distance) # Uncomment print(distance) if you want to see the matrix showing how the algorithm computes the cost of deletions, insertions and/or substitutions# This is the minimum number of edits needed to convert string a to string b
return “The strings are {} edits away”.format(distance[row][col])
When using this function, we can determine the value associated with the Levenshtein distance between two strings as well as the number of edits required to have a direct match.
print(levenshtein_ratio_and_distance('3 Bears OG','Three Bears og'))
print(levenshtein_ratio_and_distance('3 Bears OG','Three Bears og', ratio_calc=True))The strings are 7 edits away
0.5833333333333334
Uncommenting the print(distance) portion of the above function returns a matrix that visually explains the computational cost of matching the strings. The number in the bottom right corner denotes the required number of edits to match the strings, with each horizontally and vertically adjacent value representing an insertion and deletion respectively. The diagonally adjacent value can be a cost of 1 if the two characters in the row/column do not match, and a cost of 0 if they do mach. Each cell aims to minimize the local cost of computing the differences. Below, we have the matrix of the character differences between the strings ‘3 Bears OG’ and ‘Three Bears og’.
[[ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14]
[ 1 1 2 3 4 5 6 7 8 9 10 11 12 13 14]
[ 2 2 2 3 4 5 5 6 7 8 9 10 11 12 13]
[ 3 3 3 3 4 5 6 5 6 7 8 9 10 11 12]
[ 4 4 4 4 3 4 5 6 5 6 7 8 9 10 11]
[ 5 5 5 5 4 4 5 6 6 5 6 7 8 9 10]
[ 6 6 6 5 5 5 5 6 7 6 5 6 7 8 9]
[ 7 7 7 6 6 6 6 6 7 7 6 5 6 7 8]
[ 8 8 8 7 7 7 6 7 7 8 7 6 5 6 7]
[ 9 9 9 8 8 8 7 7 8 8 8 7 6 6 7]
[10 10 10 9 9 9 8 8 8 9 9 8 7 7 7]]
Beautiful but verbose, so let’s make this even easier by just installing and importing the Levenshtein package.
pip install python-Levenshtein==0.12.0import Levenshtein as levlev.distance('3 Bears OG','Three Bears og') = 7
The Levenshtein package returns the number of edits required to match the strings, rather than the actual distance between the two when calling the distance function. To put the value acquired from the distance into use, we apply fuzzywuzzy to our strings or lists of strings to gather the associated terms. Fuzzywuzzy makes use of the Levenshtein distance through it’s ratio function that calculates the character differences between two strings.
pip install fuzzywuzzyfrom fuzzywuzzy import fuzz
# Create a function that takes two lists of strings for matching
def match_name(name, list_names, min_score=0):
# -1 score incase we don't get any matches
max_score = -1
# Returning empty name for no match as well
max_name = ""
# Iterating over all names in the second list
for name2 in list_names:
#Finding fuzzy match score
score = fuzz.ratio(name, name2)
# Checking if we are above our threshold and have a better score
if (score > min_score) & (score > max_score):
max_name = name2
max_score = score
return (max_name, max_score)
This above function is aimed towards taking in various names from different DataFrames for merging into one DataFrame based on similarity. For my particular use, I input names of various strains of cannabis for matching the deviations in text across different websites. For instance, as seen above, the strain 3 Bears OG was present in all three of my DataFrames for the project I worked on, but was written as ‘Three Bears OG’ and ‘3 Bears Og’ on different databases which gave errors in merging due to the capitalization and spelling differences.
# List for dicts for easy DataFrame creation
dict_list = []
# iterating over df with more strains
for name in df1.strain:
# Use our method to find best match, we can set a threshold here
match = match_name(name, df3.name, 80)
# New dict for storing data
dict_ = {}
dict_.update({"strain" : name})
dict_.update({"name" : match[0]})
dict_list.append(dict_)
merge_table = pd.DataFrame(dict_list)
The above code does not include the distance score between the two strings, but this can easily be shown by updating the dictionary created with the value of the match, which would appear as a new column in the DataFrame. Only the first match is displayed, as denoted by match[0].
Here, we can see that strains with lowercase values were matched with those in the title case without having done any text pre-processing. Similarly, apostrophes are given a weight as well in calculating whether Blue’s Dream is most similar to Blue Dream or Blues Dream.
FuzzyWuzzy can be used along with NLP to improve the use of categorical variables in various aspects of data science and analysis. In it’s simplest form, it can just be a fun way to see how much work is needed to get two strings to be the same.