Fuzzy String Matching at Scale

Entity Resolution is Everywhere, and Sometimes All We Got Are Strings

Published in

Tim Black

6 min readJul 3, 2019

Courtesy of Giphy Images

More and more often, companies are blending data from different sources to enhance and enrich its value. Often critical to reaching this goal is the practice of entity resolution (or record linkage) to ensure that we are looking at the same record across multiple different sources. In some cases the records may have enough different types of information that can be used to build a probabilistic estimate on whether it is the same entity. In other cases, we may only be looking at one field, such as a name, and we need to decide whether it is enough of a match or not.

As the Data Grows, so Does the Need for Speed

Fuzzy string matching is not a new problem, and several algorithms are commonly employed (Levenshtein distance, Jaro–Winkler distance). However, given the growth in the number of data that are being matched, it is increasingly important to be able to perform this matching at scale. Instead of comparing every record to every possible match, we can employ a vectorized approach using Natural Language Processing (NLP) libraries that make this not only possible, but relatively painless in implementation.

TF-IDF Approach

I got excited about using the NLP toolkit for short string matching after reading about its success on Chris van den Berg’s Blog Post. TF-IDF stands for “term frequency-inverse document frequency” and is a common approach to measuring similarity/dissimilarity among documents in a corpus (a collection of documents). The TF-IDF calculation consists of the following steps:

Pre-processing & Tokenization: Perform any cleaning on the data (case conversion, removal of stopwords & punctuation) and convert each document into tokens. Although tokenization is typically performed at the word level, we have the flexibility to define a token at a lower level, such as an n-gram, which is more useful for short string matching since we might only have a few words in each string.

Calculate the Term Frequency: The purpose of this step is to determine which words define the document; words that appear more frequently are indicative of the document’s subject matter. For each document (a string in our case), calculate the frequency for each term (token) in the document and divide by the total number of terms in the document. If we define a token as an n-gram, we will calculate the frequency of each n-gram in our string.

TF(t) = (Number of times term t appears in a document) / (Total number of terms in the document)

Calculate the Inverse Document Frequency: The purpose of this step is to calculate the appropriate weight for each term, depending on how often it appears across all documents. A term that appears in all the different documents will have a lower weight compared to a term that only appears in one of the documents. The idea is that a token that appears in all documents is less is less descriptive of any particular document compared to a token that appears in only one of the documents.

IDF(t) = ln(Total number of documents / Number of documents with term t in it)

Calculate the TF-IDF Weights: Multiply the term frequency with the inverse document frequency

TF-IDF(t) = TF(t) * IDF(t)

Calculate the Cosine Similarity: Cosine similarity is often used to compare the similarity of two vectors (in this case TF-IDF values). Data scientists at ING developed a custom library to make the cosine similarity calculations faster than the built-in sci-kit learn implementation, which you can read about more here. We will use this library for a faster cosine similarity calculation than the built-in scikit-learn cosine_similarity function.

Application: Matching Film Titles

Matching titles is a perfect use case, since in many cases there may not be much more than a name to use for matching and we need to find the best match against a medium-large data set. For this example, I will demonstrate the TF-IDF string matching approach by matching titles from the MovieLens Kaggle dataset to the IMDB title dataset. I chose these datasets because (1) the titles in the MovieLens dataset are very similar to the IMDB titles, with only a couple minor differences, as shown in the example below, and (2) the MovieLens Kaggle dataset contains the IMDB Id key, which we can use to get the true match for every record to validate our best guess.

MovieLens Title: Confessional, The (Confessionnal, Le) (1995)
IMDB Title: The Confessional (1995)

The TF-IDF approach is well suited for matching movie titles, since there are some words contained in titles that we would consider more important for matching compared to others. Articles such as “The” and “A” are worth keeping around for matching, as opposed to being removed as a stopword — a common early step in the NLP process — since titles may differ by only an article word (eg. The Batman vs. Batman). Although we want to keep articles and other common words, we want to place less weight on them compared to more distinctive words. For example, Playmobil: The Movie should be more similar to Playmobil than to Deadwood: The Movie. As mentioned earlier, because TF-IDF takes into account the distinctness of tokens among different documents (here each title is a document), we get this benefit of variably weighted tokens.

Code

For this specific exercise, my goal is to find the top candidate IMDB Primary Title (target) among 868,517 titles for each of the 58,094 MovieLens titles (source), and then compare that result with the true match. Building off Chris van den Berg and the ING Advanced Analytics team’s work, I created a python StringMatch class to perform this matching.

With the StringMatch class, we can now match two lists of strings in just a few lines of code:

Performance: Speed

This is where the TF-IDF method really shines. The chart below shows the incredible difference between the Levenshtein Distance algorithm (using Python’s fuzzywuzzy package), and the TF-IDF approach advocated for here to find the top match among 868,517 titles.

The difference in speed is not even close. It took nearly 8 minutes on my laptop to find the top match for 10 titles against the target dataset using the fuzzywuzzy package (I stopped increasing the size n for the fuzzywuzzy approach after this). Using TF-IDF, I was able to find the top match for all of the source titles (58,094) in a little over 18 minutes.

Performance: Accuracy

Right off the bat, we are able to match 83% of the titles correctly; however, I wanted to look further at the mismatches. In 9 cases (45%) of the first twenty mismatches, the title matching algorithm picked a title that had the exact same name and year as the correct title. In other words, string matching is not enough when there are two candidate titles with the exact same name and year — if we wanted to go further, we’d need more information to differentiate the titles.

I would like to have done an accuracy comparison with the Levenshtein Distance formula using Python’s fuzzywuzzy package, but, as shown in the plot above, it would simply have taken too much time, which is why I recommend using the TF-IDF approach for data at any serious scale.

Attribution

MovieLens Data Source: https://www.kaggle.com/grouplens/movielens-latest-full

IMDB Data Source: https://www.imdb.com/interfaces/
Information courtesy of IMDb (http://www.imdb.com).
Used with permission.