Building software to manage the complexities of background checks

Ben Jacobson
Checkr Engineering
Published in
6 min readMay 23, 2019

At Checkr we are building software to help companies hire at scale, and our background checks are a big part of what makes that possible. On the surface a background check seems straightforward — you might think it’s as simple as looking up records in a database and returning some data, but in practice it’s messy and slow.

At a very high level, a background check falls into two phases:

  1. Getting Data
  2. Reporting Data

The goal of this blog post is to help reveal a little about how Checkr builds software to manage this complexity and help modernize background checks for our customers and people in search of work.

Getting Data

Getting data might seem like a simple thing to do. A couple API calls, parse some JSON, some for-loops, and we got a background check going. And at a very high level you’d be right, however, it’s far from being that clean or that simple.

There are national databases that haphazardly aggregate records from state and county reporting agencies. They include XML documents which contain tags which themselves contain base64 encoded XML documents 🤯. There is no standard for how to describe this data — we have to normalize it into something our system, customers, and applicants can understand.

"Nolle Prosequi"? Yeah… that’s literally latin for “case dismissed” and is still used today in modern courts! And, it is sometimes (aka often) misspelled!

There are 54 DMVs (Yes 54! Don’t forget Puerto Rico, Guam, American Samoa, and Washington DC) that all handle records differently. There are 100s of municipalities and state databases. There are roughly ~3,000 counties in the US, and they all have unique laws and regulations governing how we can report data.

Let’s focus on counties because they represent our largest surface area for data. We can think of counties breaking down into 2 groups — counties with electronic records (fast 👏) and counties with physical records (slow 😨).

SELECT court_access_method, SUM(population)
FROM counties
GROUP BY court_access_method;
>>
court_access_method
population
electronic_records_available 160,259,716
in_court_researcher 87,369,080
clerk_assisted_county 61,125,939

The above is a query of one of our proprietary internal databases managing metadata around counties. It’s based on a variety of sources, including census data, voting records, postal data, and various government lists. It shows us that 48% of the US population currently lives in a county that lacks electronic records. (I’m looking at you California… 😅)

Lacking electronic records means we need to physically send someone into the courthouse, wait in line, talk to the court clerk, go into the basement, open a filing cabinet, pull out a physical piece of paper, avoid paper cuts, read the records, type them into their phone, and finally submit the records back us.

Let’s say for a given applicant we search 5 counties and 3 of them are offline. Our system will split the work for each county into long lived asynchronous jobs:

  • Some might complete in seconds and some others might be days if they are offline.
  • Some might try an online source and fallback to an offline source if our quality bar isn’t met.
  • Some might result in pointer or transferred cases which means we need to start another county search elsewhere.

We use Kafka to manage these asynchronous tasks so that different services and teams can work on reliably implementing each search type. We are working on more explicit workflow frameworks, tooling, and visualizations to help manage these jobs in production.

Once all of the jobs are done and we have a final list of results we begin the next phase…

Reporting Data

Federal, state and county laws affect how background checks operate and different government agencies are tasked with enforcing those laws. They generally do this for a very good reason— to protect consumers. Just because someone made a mistake some years ago doesn’t always mean that should adversely impact their job opportunities. To do this we need to analyze every result to make sure it is compliant. Our software is responsible for helping protect consumers and conforming to these laws.

There are even laws that affect how certain types of employers use the results uncovered in the background investigation. For example, doctors, nurses, financial advisors, home care providers, and transportation companies are often required by law to disqualify individuals with certain offenses. And, these laws can even contradict each other!

So, Checkr needs a system to apply these rules and correctly determine which data we can and can’t report.

Here’s an example rule:

if (record.arrest? || record.dismissed? || 
record.alternative_adjudication?) &&
record.date <= 7.years.ago
record.display = false
record.save!
end

This describes how we should treat records that have been dismissed over 7 years ago. record.display = false — we shouldn’t report them.

This type of system has worked well but there have been some growing pains over the years.

In the above example the rule only exists in code. There is no ability for non-engineers to dissect this result and understand what happened and why. If someone from our quality team is investigating a record we wouldn’t have a way to provide them context. “Why is this record hidden?” they might ask. We could maybe extend it like this:

record.display = false
record.reason = "Hide arrest/dismiss/alt that are older than 7yrs"
record.save!

This is an improvement, we at least have some reasons associated with the outcome now. However, it’s still tightly coupled with the record model.

  • There are side effects because it updates data on the record.
  • The compliance logic is hardcoded (which engineers aren’t an expert in) and thus in a place which is inaccessible by our legal team (who are the experts).
  • There is a subtle dependency on the current time (what if we want to rerun this rule later? or project what the outcome would be in 6 months?).

How can we solve this?

Rules Engine

A rules engine offers a simple contract and allows us to split apart the rules, data, results and code.

results = engine(rules, data)

Here’s the same rule represented as JSON for input to our rules engine:

{
"event": {
"params": {
"display": false,
"message":
"Hide arrest/dismissed/alternative that are 7+ years old",
}
},
"conditions": {
"all": [
{
"fact": "applied_filter",
"operator": "in"
"value": [
"arrest_filter",
"dismissed_filter",
"alternative_adjudication_filter"
]
},
{
"fact": "years_since_context_date",
"operator": "greaterThanInclusive"
"value": 7
}
]
}
}

Why is the rule represented in JSON?

  • If you have very complex rules (e.g. based around geography of the candidate, account settings, existence of other rules, etc..) then if/else statements become hard to manage.
  • We can now render these rules in a UI or PDF that we can provide our legal team and regulators in way they can understand.
  • We can snapshot the rules and run multiple rules sets against the same data. For example: when we make a change we might want to run the change against previous production results and compare the difference (we call this backtesting internally).

Here is a rule being rendered for our legal team:

And here is a more complex rule with a custom UI and test harness:

Complex rule with a WIP UI / test harness

The above is proof of concept but represents the direction we are moving in — we still have a long way to go. But, ultimately it will allow us to move faster and more confidently when laws change because we can easily test and reproduce results.

Every step in the background check process is messy, error prone, hard to reason about, scary for job seekers, and governed by a long list of different entities. At Checkr, our goal is to make this process fair for our applicants — and that starts with simple and intuitive interfaces into this complex problem.

If any of this sounds interesting we are hiring! There are a lot more problems still to solve!

Send me an email at ben@checkr.com

--

--