Member-only story

Fuzzy matching at scale

From 3.7 hours to 0.2 seconds. How to perform intelligent string matching in a way that can scale to even the biggest data sets.

Josh Taylor
TDS Archive
8 min readJul 1, 2019

--

Same but different. Fuzzy matching of data is an essential first-step for a huge range of data science workflows.

### Update December 2020: A faster, simpler way of fuzzy matching is now included at the end of this post with the full code to implement it on any dataset###

Data in the real world is messy. Dealing with messy data sets is painful and burns through time which could be spent analysing the data itself.

This article focuses in on ‘fuzzy’ matching and how this can help to automate significant challenges in a large number of data science workflows through:

  1. Deduplication. Aligning similar categories or entities in a data set (for example, we may need to combine ‘D J Trump’, ‘D. Trump’ and ‘Donald Trump’ into the same entity).
  2. Record Linkage. Joining data sets on a particular entity (for example, joining records of ‘D J Trump’ to a URL of his Wikipedia page).

By using a novel approach borrowed from the field of Natural Language Processing we can perform these two tasks on large data sets.

The problem with Fuzzy Matching on large data

--

--

TDS Archive
TDS Archive

Published in TDS Archive

An archive of data science, data analytics, data engineering, machine learning, and artificial intelligence writing from the former Towards Data Science Medium publication.

Responses (17)