String matching like a boss
If you’ve been handling large datasets, you might have come across misspelt strings or strings with partial words referring to the same topic. Fuzzy string matching is the solution to such problems.
Usually, we would do the following for matching strings in python:
But, what about string matching for such strings?
This is where the FuzzyWuzzy library comes in handy for data analysis.
According to pypi.org,
The FuzzyWuzzy library uses Levenshtein Distance to calculate the differences between sequences in a simple-to-use package.
This library allows you to match strings according to a pattern also called the approximate string matching library.
The Levenshtein distance measures the minimum number of edits that you need to do to change a one-word sequence into the other. (source:datacamp.com)
Leading consultancies perform string matching using FuzzyWuzzy for a lot of data processes:
- Softwares with fuzzy string matching are used to find different accounts that place the order to the same address due to different styles of address entries or typos in the name (like Vidushi Gupta or Vidhushi G.). This helps the firms in realizing the concentration of their brand geographically thereby providing an idea of brand expansion.
- Its quality of identifying reuse of text makes it helpful in identifying plagiarism texts and spam filtering.
- Another simple application would be searching in search engines. They show the “showing results for _____” which checks for the typos and similar strings to get better-customized results.
Comparing two columns with FuzzyWuzzy:
Firstly, we have to determine the appropriate fuzzy logic for our dataset by applying the functions to two strings of the same dataset.
Install fuzzywuzzy using
!pip install fuzzywuzzy
And then import the required package
From fuzzywuzzy import fuzz
Now we use the 4 popular fuzzy logics on the two strings:
- Fuzz.ratio: It matches using the pure Levenshtein Distance.
- Fuzz.partial_ratio: It matches using the best Match substrings.
- Fuzz.token_sort_ratio: It tokenizes the strings and sorts them alphabetically before matching.
- Fuzz.token_set_ratio: It tokenizes the strings and compares the intersection and the remainder.
The above code makes it clear that we should use a token set ratio as our fuzzy logic and apply it to the dataframe for analysis.
df[‘ResultCol’] = df.apply(lambda x: yourFuzzyWuzzyFunction(x[‘col1’], x[‘Col2’]), axis=1)
On the application of the above code, it would create a new column in the existing dataframe and give the percentage of matching on a comparison row-wise for both the columns. According to your desired threshold, you can filter out the top few entries and derive the conclusions.
Hence, FuzzyWuzzy is the go-to library for string matching in large datasets thereby easing the process. Not only does FuzzyWuzzy come handy for analysis in Python, but also in other languages like Java, C++, C#, R, etc.
Kudos! Happy fuzzing :)