String matching like a boss

Hint: FuzzyWuzzy

Published in

DataX Journal

3 min readJul 16, 2020

If you’ve been handling large datasets, you might have come across misspelt strings or strings with partial words referring to the same topic. Fuzzy string matching is the solution to such problems.

Usually, we would do the following for matching strings in python:

But, what about string matching for such strings?

Or this?

This is where the FuzzyWuzzy library comes in handy for data analysis.

According to pypi.org,

The FuzzyWuzzy library uses Levenshtein Distance to calculate the differences between sequences in a simple-to-use package.

This library allows you to match strings according to a pattern also called the approximate string matching library.

The Levenshtein distance measures the minimum number of edits that you need to do to change a one-word sequence into the other. (source:datacamp.com)

Applications:

Leading consultancies perform string matching using FuzzyWuzzy for a lot of data processes:

Softwares with fuzzy string matching are used to find different accounts that place the order to the same address due to different styles of address entries or typos in the name (like Vidushi Gupta or Vidhushi G.). This helps the firms in realizing the concentration of their brand geographically thereby providing an idea of brand expansion.
Its quality of identifying reuse of text makes it helpful in identifying plagiarism texts and spam filtering.
Another simple application would be searching in search engines. They show the “showing results for _____” which checks for the typos and similar strings to get better-customized results.

Fig 4: Application of Fuzzy logic in search engines

Comparing two columns with FuzzyWuzzy:

Firstly, we have to determine the appropriate fuzzy logic for our dataset by applying the functions to two strings of the same dataset.

Install fuzzywuzzy using

!pip install fuzzywuzzy

And then import the required package

From fuzzywuzzy import fuzz

Now we use the 4 popular fuzzy logics on the two strings:

Fuzz.ratio: It matches using the pure Levenshtein Distance.
Fuzz.partial_ratio: It matches using the best Match substrings.
Fuzz.token_sort_ratio: It tokenizes the strings and sorts them alphabetically before matching.
Fuzz.token_set_ratio: It tokenizes the strings and compares the intersection and the remainder.

The above code makes it clear that we should use a token set ratio as our fuzzy logic and apply it to the dataframe for analysis.

df[‘ResultCol’] = df.apply(lambda x: yourFuzzyWuzzyFunction(x[‘col1’], x[‘Col2’]), axis=1)

On the application of the above code, it would create a new column in the existing dataframe and give the percentage of matching on a comparison row-wise for both the columns. According to your desired threshold, you can filter out the top few entries and derive the conclusions.

Hence, FuzzyWuzzy is the go-to library for string matching in large datasets thereby easing the process. Not only does FuzzyWuzzy come handy for analysis in Python, but also in other languages like Java, C++, C#, R, etc.

Kudos! Happy fuzzing :)

String matching like a boss

Hint: FuzzyWuzzy

Applications:

Comparing two columns with FuzzyWuzzy:

Written by Vidushi Gupta