Who’s Who? Identity Matching Using Applied Machine Learning

Published in

Checkr Engineering

7 min readAug 7, 2019

Ask yourself — how many people have your first and last name? What about the same name and date of birth (DOB)? How about the same name and live in your state?

At Checkr, millions of records run through our system every day. These records are generally from court systems and contain varying amounts of PII (Personally Identifiable Information). Sometimes we get social security and driver’s license numbers, but often the information is limited to just name and DOB.

As a CRA (Consumer Reporting Agency), Checkr is responsible for delivering maximum accuracy without fail. In other words, we need to be really, really good at identity matching.

At what point are two identities the same?

Given Identity A and Identity B, what are the chances that they are the same person? Take myself, Patrick Dalton, born March 5, 1991. Now, examine the identities below. Which ones looks like they may be the same person?

After looking at just a few examples, the complexities become apparent. The problem becomes even more complicated when dealing with common names. There are approximately 50,000 “Michael Smith”s in the United States! The famous birthday problem tells us that 23 people in a room means a 50% chance of 2 people having the same birthday, and this probability increases to 99.9% with 70 people in a room. With about 500 new “Michael Smith”s born each year, it becomes evident that simply a name and DOB are often weak identifiers.

Codifying the solution

How do we automate identity matching with software? The simplest approach is to use string matching:

def match?(identity_a, identity_b)
  return false unless name_match?(identity_a.name, identity_b.name)
  return false unless dob_match?(identity_a.dob, identity_b.dob)  return true
end

However, a simple typo from a court researcher could sink our ship! Patrick != Patrikc, but I’m willing to bet that the latter is a misspelling.

The evolution of the solution

More `if` statements

After looking at live examples, we start realizing that reality is a bit more complicated:

As the edge cases pop up (green “x”s and red “+”s), we start to bail water with `if` statements. Perhaps we make a rule that the two identities must have similar address history:

if has_not_lived_in_same_state?(identity_a, identity_b)
  return ‘no_match’

Then, a month later we may find that we are producing false negatives. For example, if we see a record with the same name as our identity — Jimmer Taft Fredette from the state of Utah, but the record is from Wyoming, then we will still want to match the two identities.

if name.uncommon? and has_not_lived_in_same_state?(identity_a, identity_b)
  return ‘match’

As time goes on, we start to find that we are incorrectly matching names that are somewhere between common and uncommon.

if name.average_commonality? and 
has_not_lived_in_same_state?(identity_a, identity_b)
  return ‘no_match’

If, else if, else if, else if, else if, else…..

🙊🙉🙈

These rules get out of hand very quickly, and all of a sudden no one knows what’s going on. We aren’t able to generate enough binary statements to fit a function that excludes all the mismatches and includes all the matches. In order to innovate, we realize that our fundamental understanding of the problem needs to change.

The world is continuous, not binary

It turns out there are two distinct match attributes for each piece of PII:

Commonality — how many people share this information?
Similarity — does the PII match?

These two features affect one another. For example, if my government name was McKeyboard Computerton III, then even with an egregious typing error (Keyboard McComp), we are still able to match the names based on extreme rarity.

A probabilistic approach

For the commonality component of identity matching, we can estimate how many people share each piece of PII. At Checkr we use a representative sample of 40,000,000 names from the United States to benchmark name commonality. We also use population information at the county and state level. Middle names, four part names, and suffixes give us even more information.

In the U.S., 11,000 people are born every day and 4,000,000 people are born each year. That’s a lot of people with the same DOB!

Multiplying the commonality of a name and a date of birth offers a good proxy for the number of people that share that PII:

probability_of_identity_match = smoothing_function((num_people_with_name * num_people_with_dob) / num_people_in_america)

Taking a probabilistic approach allows us to write some very powerful rules based on the level of matching PII. Instead of our previous if statements based on binary attributes, we are able to write smarter rules like:

if predicted_num_people_with_pii < 1
  return ‘match’

Instead of hundreds of fragmented functions, we have a single function for commonality that is grounded in reality. However, it doesn’t matter if we only have a good commonality function because there is often messy data plagued by typos, poor formatting, or different information sources. What do we do in these cases?

A data driven approach

So far we have developed a solution for assessing whether two identities match, but now we need to build a more flexible model that can account for both deterministic name matching rules and the more fuzzy nuances in this problem. We need to set up a framework for the problem that allows us to experiment. The basic idea here is: let’s turn this into a y=mx+b problem. Whether the m and b (our function) is determined by a classical ML model, deep net, or piece of custom code is up to us. As long as we have the inputs (x) mapped to labels (y), we are on the right path.

How do we get good labels? Using subject matter experts and consensus. At Checkr, our subject matter experts consist of our product quality operations, investigations, quality assurance, and legal teams. Without these subject matter experts, it’s hard to build anything useful. Once we have a sufficient number of labels, we can train our model or start coding the solution. Producing a representative distribution and precise labels can be a large effort, but it’s required to make a highly accurate system.

What happens if we don’t have enough information to discern whether two identities are the same? In this case we add a third label to our model: “need_more_pii”. A human reviews identities with this label, we feed their evaluation back into the model for further tuning. This “human-in-the-loop” approach allows subject matter experts to provide the most value, while we see huge gains from automation on the easier samples.

Putting it all together — identity matching feature vectors

As mentioned before, there are several approaches we can take to make a state of the art identity matching system. Here is how Checkr does it: given two identities, we generate some features about each identity and some features about their relation.

For example, with names we have clues: commonality and common nicknames. For the relation of two names, we have clues like the Jaro-Winkler distance, whether one is a known nickname of the other, and how many matching parts each name has.

As we saw in the example above, we can also take the commonality of the name parts. For date of birth, it’s helpful to define some common transcription mistakes. A 1 can look like a 7, day digits are often flipped (10/12 vs 10/21), and so on. We can also consider the time difference between two dates of birth. Are they 2 days apart? or 6 months apart? Given this information, we can create a vector for the identity match. The following example maps the Patrick Dalton vs. Patricia Dalton match pair described above.

What we are doing here is creating a feature set similar to one that subject matter experts would develop over time. When a human looks at PII, they likely subconsciously compute similar features and develop a decision tree based on those features. With software, we can develop a rich set of granular features, that will allow us to create a function that optimizes accuracy.

With our labels, and the expressiveness of these features, our software engineers can now explore applying machine learning or even quickly iterate on a bunch of if/else statements!

Now our function is much more complicated than our initial exact match function, resulting in a highly accurate identity matching engine. In the end it turns out that Patrick is rarely Patricia, and Patrick is sometimes Pat, but it’s better to just let the data decide.

Checkr’s mission is to build a fairer future. Building an accurate identity matching engine is one way we are supporting that mission. If you’re interested, come chat with us — we are hiring!