Deduping data without a key

One of the challenges I was working through was trying to identify unique person records from a blizzard of API data. It purports to have a unique key for each person, but, in reality, not so much (or, honestly, at all). As a result, I’m working to identify unique people by shared attributes across records. For instance, email can be a reliable unique identifier, but we have many families in our records where dad or mom might share the same email address as the kids. So we’re taking a more holistic approach to examine all the attributes of a particular record. For instance, in the case below (credit: Melissa), do these two records represent one woman or two?

Image for post
Image for post
Two similar datasets (one for Beth Smithe, one for Elizabeth Smith), but do they refer to the same person?

Here are some observations about the challenge to illustrate that it’s not simply a byte-to-byte comparison.

  • Can we identify “Beth” as a nickname for “Elizabeth,” and would this woman use both names? Note that the record with the more-formal “Elizabeth” also contains the complete spelling of her city while the informal “Beth” record also uses an informal abbreviation for the city.
  • Could the last name of “Smithe” on the more-informal record represent a typo that should be considered a match for “Smith” on the more formal record?
  • Both records list the same address, so that’s looking positive.
  • The purchase history looks consistently-upscale across both records.

So now we inject human judgment to determine whether these records represent one woman or two. If we judge both records to represent the same woman, we have created a rule we can reuse for other records about nicknames, typos, formality of record source, addresses, and purchase history.

As we create those rules, our algorithm grows smarter and more accurate as it applies them over the entire dataset. To make that happen, I’m using the dedupe library in Python. It approaches this initially as a clustering problem, and then as I provide rules from the examples it identifies, it can classify more of our blizzard of records into unique people.

I’ll keep you updated on what I learn from the process, but welcome your ideas and feedback. Feel free to comment or tweet back your thoughts.

Originally published April 15, 2018.

Written by

Diving into oceans of data to discover pearls that help you make wiser decisions. Predictive data analyst, machine learner, data engineer. Disciple in Austin.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store