Member-only story
Fuzzy matching at scale
From 3.7 hours to 0.2 seconds. How to perform intelligent string matching in a way that can scale to even the biggest data sets.
### Update December 2020: A faster, simpler way of fuzzy matching is now included at the end of this post with the full code to implement it on any dataset###
Data in the real world is messy. Dealing with messy data sets is painful and burns through time which could be spent analysing the data itself.
This article focuses in on ‘fuzzy’ matching and how this can help to automate significant challenges in a large number of data science workflows through:
- Deduplication. Aligning similar categories or entities in a data set (for example, we may need to combine ‘D J Trump’, ‘D. Trump’ and ‘Donald Trump’ into the same entity).
- Record Linkage. Joining data sets on a particular entity (for example, joining records of ‘D J Trump’ to a URL of his Wikipedia page).
By using a novel approach borrowed from the field of Natural Language Processing we can perform these two tasks on large data sets.