A Kafka Epic — How Checkr built a more reliable and performant background check platform by migrating to Apache Kafka

Michelle Berry
Checkr Engineering
Published in
8 min readJan 29, 2019

In fall of 2017, a group of Checkr engineers gathered during a hack-day event to research a bottleneck that was causing delays in our background check product. The team’s research led to a proposal: migrate Checkr’s messaging infrastructure from RabbitMQ to Apache Kafka. After some further experiments and validation, Kafka earned a spot on our technical roadmap, and in 2018 we invested heavily in the tooling and support necessary to enable a full migration. What started as a hack-day experiment, culminated in a new infrastructure platform with a much higher degree of reliability, observability, and fault-tolerance. This is the story of our Kafka journey.

Act I: The bottleneck

To explain the bottleneck that started us on our Kafka journey, I first need to teach you a little bit about how background checks work.

Lifecycle of an example Checkr Background Report

A background check is a consumer report containing one or more screenings, such as a motor vehicle report, criminal search, sex offender search, or employment verification. Each screening is supported by API requests to third party sources to retrieve information about the candidate. Some of these screenings complete quickly, but others can require a court researcher to visit a local courthouse and retrieve paper records in person. After the raw data is received from the source, records are filtered according to compliance and customer rules. At Checkr, the background check is anchored by a large state machine that deterministically controls the state of the report, screenings, and data request objects.

The data we receive from our sources has the ability to influence the state of both the screening and the report, and we receive a majority of these responses asynchronously. Consequently, we need to ensure that only one process is accessing the screening and report resources at a time to avoid race conditions.

In 2017, Checkr used RabbitMQ to deliver state machine events. Since multiple asynchronous responses for the same screening or report might be processed by Rabbit’s workers at the same time, we used a Redis-backed internal library to acquire a “lock” on the relevant object. When a worker would begin work on a resource, it would place the resource’s ID into Redis so that no other process could access it until the lock was released. This solution worked, but it came with some inefficiencies; our state machine workers encountered lock contentions semi-frequently. A lock contention would cause a message to be retried until the lock had been released, slowing down overall report completion.

Lock Errors were slowing Checkr’s report completion

Our team explored a couple of options to solve this locking issue with Rabbit. The first idea was to create one queue per report. However, at Checkr’s report volume (> 15 million/year), it would be too cumbersome to create and garbage collect millions of queues.

The second idea was to create a static set of workers that would each work from their own unique queue. If we deterministically routed all events for a given report to the same queue, we could ensure strict ordering of message processing. However, scaling workers and queues up and down with this design posed many challenges.

This second idea is somewhat similar to the native architecture of Kafka. Unlike Rabbit’s architecture, in which messages from a queue are delivered to a pool of workers, Kafka’s topics (queues) are pre-split into partitions. The partition is the basic unit of parallelism within Kafka, so more partitions means more messages can be consumed in parallel. For a given consumer group, only one worker can process messages from a partition at a time, so Kafka’s architecture guarantees that all messages within a partition will be processed in the order they are received.

Kafka’s partitioned design

Kafka also has a partitioning feature which allows you to deterministically route messages to a partition based on a key. If we partition on a screening’s ID, all messages related to that screening are guaranteed to be consumed by one worker in the order they were received, and we avoid lock contentions.

Switching from Rabbit to Kafka proved to be enormously successful for solving our performance bottleneck. One particular type of report saw turnaround time improvements of 87%!

Beyond performance, switching to Kafka has given Checkr a higher degree of messaging reliability due to its ‘durable’ design. Messages passing through Rabbit’s broker are ephemeral; when a message is delivered to a consumer, it disappears from the broker. Occasionally we’d drop messages, but it was very hard to debug where and why this was happening. Conversely, Kafka’s messages are saved to disk. The Kafka broker is essentially a log file and messages are retained for 7 days by default. This design helps ensure that we never lose any messages. Furthermore, Kafka consumers can read from any position in the log file. They can even replay old messages, which comes in use when bugs are introduced that cause erroneous processing.

Act II: Killing the Rabbit

Once we validated that Kafka had solved the performance issues in our state machine processing, we had a decision to make: maintain two messaging infrastructures, or invest in tooling so we could convert entirely to Kafka. We decided we only wanted to support one messaging infrastructure at Checkr.

The first step towards migrating off of RabbitMQ was to get our Kafka tooling up to feature parity. We started with Phobos, an open-source Kafka client library written in Ruby. Our team began contributing to Phobos, and one of our engineers became a maintainer of the project. However, we needed additional tooling and infrastructure in 3 main areas: handling of publish errors, delayed retries for consumer errors, and dead-letter handling.

We designed our Kafka tooling from the ground-up with the following principles in mind:

  • Reliability: Reliability and fault tolerance are baked into Kafka’s core design, but all systems fail. We should have a reliable backup.
  • Observability: We should collect metrics and logs that indicate both when a system is working properly and when errors occur. Engineers should be able to easily find and query for failed messages.
  • Usability: Engineers should be able to interact with failed messages via APIs/UIs, not custom scripts. Configuring retries should be painless.

Based on these principles, Checkr’s platform team developed a client library built on-top of Phobos and a new service called Republishr. Republishr is a gRPC server written in Golang that handles both publisher and consumer errors.

Republishr Architecture

On the publishing side, if a message fails to be delivered to Kafka after a series of in-memory retries, our client library will route it to Republishr over gRPC, where the message is stored in a PostgreSQL database. We built a UI, backed by a gRPC API, where engineers can query for publish errors using topic and time range filters. The UI can also be used to republish a set of messages.

# Example API call to republish publish errorsclient.republishPublishErrors(
Idl::Republishr::RepublishPublishErrorsRequest.new(
topic: 'test',
start_time: '2019–01–01T00:00:00',
end_time: '2019–01–02T00:00:00'
)
)

On the consumer side, engineers have intuitive configuration options for marking exceptions for a given worker as retryable.

# Example consumer with a retryable exception configuredclass MyConsumer
include Phobos::Checkr::Handler

RETRYABLE_EXCEPTIONS = [Timeout::Error].freeze
def consume(payload, metadata)
end
end

If a non-retryable exception occurs, the message is sent to Republishr and stored in the db. However, if a retryable exception occurs, the message is retried in memory using the Phobos library (blocking other message processing). Next, if the message exceeds the in-memory retries, it is sent to Republishr where a Redis scheduler queues it to be republished at a later time. The number of in-memory retries, delayed retries, and exponential backoff are configurable per each consumer. Similar to publish errors, engineers can use a UI to query for consumer errors in the db and re-queue them.

# Example API call to republish consumer errorsclient.republishConsumeErrors(
Idl::Republishr::RepublishConsumeErrorsRequest.new(
topic: 'test',
consumer_group: 'my-group',
error_type: 'NoMethodError',
start_time: '2019–01–01T00:00:00',
end_time: '2019–01–02T00:00:00'
)
)

Once we completed this new tooling for a reliable Kafka messaging platform, we needed a formal process to complete the migration off of RabbitMQ. The Platform team initiated the migration on a set of the highest-volume and most business-critical queues. We documented this process and held office hours for product engineering teams to receive support on their queues. In total we migrated over 60 queues. Some additional tooling that helped make this migration successful were Flagr, which Checkr uses to roll out new features, and Checkr’s integration testing framework.

Migration challenges and messaging anti-patterns

Three interesting challenges came up during the migration that initially seemed like tradeoffs to using Kafka, but ultimately informed our messaging and microservice design philosophy.

Message size:

When Checkr used Rabbit as a message bus, we tended to send large payloads containing multiple data resources between services. Rabbit has no limit on message size, but Kafka imposes a 1MB restriction by default. Increasing the default can affect messaging throughput and performance. In order to start migrating these queues to Kafka, we built a custom message compression feature. However, sending full payloads through a message bus promotes coupling between microservices and conflicts with Martin Fowler’s concept of “smart endpoints and dumb pipes”. Now we aim to send simple event messages that mostly contain object references and are agnostic to downstream consumers.

Message processing time:

We used Rabbit to process some lengthier jobs, e.g. > 2 minutes, but slow messages can cause trouble with Kafka. Because of Kafka’s partitioned design, slow messages block processing of all other messages in the same partition, which lowers throughput. Additionally, unless you are using a client library that supports background heartbeating, a message that exceeds the session timeout will throw an error. As Checkr’s report and screening volume scales, it is important to keep message processing latency down so that we can handle the throughput. Many of our lengthier processes were in fact better suited for background job processing than event consumption.

Idempotency:

Idempotency was not top of mind in our Rabbit consumers unless the business logic they contained was particularly sensitive to duplicates. Kafka guarantees at-least once delivery, meaning that messages might be delivered more than once on occasion. Most commonly, Kafka will process duplicate messages during deploys when consumers are leaving and re-entering their group, which is known as “consumer rebalancing”. Thus, Kafka encouraged us to make more resilient and fault-tolerant design decisions that include idempotency.

Epilogue: An event driven future

Although Checkr first considered Kafka as a solution to our state machine processing bottleneck, it now commands a broad influence on the whole engineering org. Product engineering teams are building new features using microservices backed by Kafka events, and our data team has integrated Kafka into their data infrastructure pipeline.

If you’d like to learn more about our Kafka journey or if this sounds like the kind of work you’d like to do, drop us a line! This year, we’re investing heavily in our platform architecture, maturing our microservices ecosystem, and building/supporting open-source tools. We’re hiring on all teams including Platform, Data, DevOps, and Product engineering. You can check out our open listings here.

Thank you to Ziru Zhu, Zhuojie Zhou, Jason Dougherty, and Eric Psalmond who all contributed to this work and gave feedback on this post.

--

--