International expansion and multi-region deployments

Jason Brown
Checkr Engineering
Published in
5 min readMar 8, 2021

Checkr’s fair chance hiring initiative unblocked 400k job candidates in 2019 and 1.5 million in 2020 who were previously ineligible due to criminal records that were irrelevant to their prospective position. This was accomplished in partnership with 16,000+ US-based customers. Many of these customers expressed interest in managing not just domestic, but global hiring within one platform; to enable this Checkr needed expanded, robust support for international checks.

International checks require multi-region application deployments to support data privacy localization requirements, particularly to meet the General Data Protection Regulation (GDPR) requirements within the European Union (EU). The GDPR went into effect in the EU on May 25, 2018; it is a set of protections intended to keep individuals’ data secure. For Checkr, meeting GDPR requirements means running applications and databases in the US and EU region. In order to support this, there were 3 engineering must-haves:

  1. Multi-region deployment capabilities
  2. Cross-region traffic routing
  3. Data store strategies (for cases where routing alone is not sufficient)
Pre-EU deployment… not supporting candidates in Europe

Multi-region deployment capabilities

Checkr has a 7-year-old monolith built with Ruby/Sinatra, along with dozens of supporting microservices in Ruby or Go. Until recently, they were deployed to staging and production environments with an internal open-source project, Codeamp. Checkr migrated to GitLab because it offers one application for source code management, CI/CD, and security. Specifically, its declarative CI/CD tooling simplifies multi-region deployments via the .gitlab-ci.yaml file:

All services needed to be migrated from Codeamp to GitLab. Migrating the Checkr monolith alone took six weeks, split between multiple phases (move code and CI, deploy to staging, deploy to production, and bugfixes/cleanup). With customers initiating tens of thousands of reports per day, a clean cutover from the old deployment to the new one was imperative; it was accomplished with incremental changes in the traffic volume that was routed to them.

Now that services were migrated from Codeamp to GitLab, the next step was for dev teams to run their applications in multiple regions. After considerable team coordination, dozens of backend services now deploy to both US and EU staging and production environments:

20+ dev teams supported their services in multiple clusters

Cross-region traffic routing

Checkr APIs and customer dashboard must appear to users as a single unified experience. We rely on Cloudflare and API Gateway (Kong) for API routing. A Cloudflare worker inspects a request to api.checkr.com and routes it to the correct region. The routing rules are based on country encoding in the resource identifier. Kong is used once the request is within the target region. A detailed review of Checkr’s use can be found in this post “Hybrid API management with Kong”.

As an extra validation step, the application layer verifies the request as well:

Data store strategies

Cross-region routing is the preferred solution for managing multi-region network traffic. However, due to service or customer requirements, there were a few cases where a data store replication or other strategy was necessary:

MySQL

Certain resources in the monolith MySQL database are region-specific, while others must be available across regions. For instance, candidate PII is region-specific, while customers have a single global account, not a US and EU account. Ideally the monolith would have been broken up into several different microservices, and cross-region routing utilized. This was not possible given time limitations and disruption to other work. Thus, we utilized AWS Data Migration Services to manage the replication task of certain resources between the source (US) and target (EU) data stores. Which models to replicate is based on an annotation on the model:

Kafka

Two services, Billing and API Gateway (Kong), require Kafka topic replication across regions. For this we utilize MirrorMaker2. The Billing service consumes EU billing events in the US cluster, while API Gateway requires some of its resources be synced from the US to the EU cluster. Both use cases had sufficiently manageable volume to allow this approach.

Elasticsearch

Customers using the dashboard must be able to search across multiple clusters and view unified candidate search results. Since candidate personal data is involved, data replication across clusters was not feasible. Instead, the Search service (a service only recently split from the monolith) added a custom-built aggregation proxy that fans searches out to separate clusters, and then aggregates results before returning them to the dashboard. For index requests (create, update, delete), the routing is based on the country encoding in the resource identifier.

Region Compliance Service

In addition to must-haves that are relevant to multi-region deployments in general, there were specific requirements for Checkr’s business: namely, that international checks must abide by region-specific legal regulations. For instance, to verify identity, Australia assigns point values to different forms of IDs, while China requires a candidate’s name in Chinese characters. For this we created a new service, Region Compliance Service, that acts as a centralized authority for country-specific regulations.

Data privacy requirements vary by country too (GDPR in the EU, LGPB in Brazil, etc.). Region Compliance Service also manages these data erasure, portability, and localization requirements. For candidates requesting deletion of their personal data, we use an internal open source project, DSRHub, to coordinate deletion across backend services.

Conclusion

Offering international checks is only one step in the long-term vision of providing a unified customer and candidate experience that removes all friction for both. This international initiative was no simple undertaking; it involved multi-region application deployments, cross-region traffic routing, and data store strategies. A “TON” of net new infrastructure code was written. We estimate it took 2,000 hours of cross-functional work. Jira indicates it involved 400 stories and 1,350 points 😃. But for Checkr, our customers, and our candidates, we think it was worth it.

Thanks to Anjana Manian, Maria Ramos, Ivan Rylach, Zhuojie Zhou, Scott Agnew, and Jonathan Perichon for their input on this post, and many thanks to the Infra team and everybody else who was involved in this project.

--

--