Doubling our Volume while Increasing Reliability: Checkr’s State Machine
At Checkr, the core model of our business is the background check, referred to as a report. At a first glance, completing a report is not too complex:
In reality, the management of state required to complete a report both correctly and accurately is much more complicated. It looks something more like this:
As illustrated with the example above, processing a report involves integrating with several data providers via independent requests. It is then updated upon receipt of each individual request. Each data provider requires a custom integration due to non-standardized request and response formats. We initially solved this by implementing independent handlers that parsed the response and updated the report. This approach was fast, simple, and it worked 98% of the time. What happened when it didn’t work, you ask?
The truth was, not much. Reports were left stuck in a pending state and problems were usually fixed by retrying an HTTP request or with small code changes. Since most requests were successful, running a few thousand a day resulted in a manageable number of issues. However, as volume increased, the percentage of errors remained constant. It became more difficult to differentiate stuck reports from reports that were truly pending. Looking at reports individually to find problems was completely impractical. The only apparent information when investigating an issue was that the request failed or the report was never updated.
A year ago, the absolute number of “stuck” reports we knew about at any time could be 200, and sometimes as high as 1000. We had “stuck report parties” in Eng to get the number under control and fix the issues that caused them. Countless engineering support tickets were filed for stuck reports, and we were often ignorant of them until a customer or applicant brought the issue to our attention. We were fighting a losing a battle. Something needed to change.
We learned that running a report end-to-end efficiently and understanding its various checkpoints was going to require a better abstraction that defined and captured a report’s state over time. We needed a solution that enabled immediate reflection into state flow.
Where we arrived? Enter the finite state machine.
The Finite State Machine
Thinking of models as state machines has immediate benefits. One of the main reasons for using a state machine is to help with the design process. Drawing a state machine diagram clearly lays out and defines the creation state, completion states, and intermediate states. It also helps to identify any possible edge cases related to the flow. The state machine diagram for a single data provider integration is shown below.
Naively, it appears the only steps required for building an integration with a data provider are requesting the response and updating the corresponding report with the results. A number of missing steps surfaced. For example, there are three different cases in which data is received:
- Synchronous (instant) response.
- Asynchronous response via webhooks.
- Asynchronous response via polling.
Instant responses can transition directly to parsing the response, while asynchronous requests require some additional states. Requests waiting for a webhook stay in “wait for response” until the webhook is received, and then transition to “parse response”. Requests that require polling cycle between “wait for response” and “poll for results” until a completed result is returned.
Defining “parse response” as its own state is particularly constructive. Prior to updating database records, the raw data is normalized and stored in a separate parsed response object. As shown below, the raw payload is not always clean or organized:
A normalized parsed response simplifies working with similar data across multiple sources. Separating out this step decreased the overall number of bugs and led to a significant increase in code readability.
The state machine diagram highlights relationships between states. For instance, sending a request and polling for results require external network requests while the other states do not. These particular state transitions can performed by a separate group of workers set up for handling network I/O. Better yet, each integration can have its own group of workers for handling all network communication with a specific data provider. This way, only a small group of workers will be affected if a data provider has down time.
It’s also clear from the state machine diagram that the “send request”, “wait for response”, and “poll for results” states all require the receipt of a raw payload before a valid transition can be made to “parse response.” A before transition precondition can be defined to ensure the raw response is present. This type of precondition is incredibly powerful and can be enforced by the state machine. Having an extremely rigid set of rules for all transitions can lessen the negative impact of invalid or duplicate state transition events.
Certain states are also inherently transient. A request object should never stay in any state except for “wait for response” and “complete” for an extended period of time. If an object stays in a transient state for more than a few minutes, it’s apparent that there is an issue. These issues can be fixed quickly depending on the request object’s current state.
Most importantly, using the state machine for our data provider integrations standardized our process of building them. New engineers joining the team can instantly contribute because each state transition is concretely defined.
The Audit Trail: A State Machine’s Best Friend
The most practical benefit from using a state machine is the ability to view logs from a variety of state transitions.
We store this basic information about each state transition:
- The previous state
- The current state
- The event that induced the transition
- The timestamp
- Any exceptions that occurred during the transition
Let’s apply this to a data provider integration.
State transition logs allow us to audit every data provider integration at an individual request level. Glancing at transition history for a single request instantly shows the request’s previous states, its current state, and any errors. Whenever there’s a problem, the first question asked is “Have you looked at the state transitions yet?”
Logging exceptions together with the corresponding state transition has led to quick fixes for previously hard to track or irreproducible issues. Previously when a request failed, it was difficult to see where the error occurred. Now we can see the exact error message coupled with the state of its occurrence. This is particularly valuable when diagnosing the source of the error. Batch operations can be run on objects that raised the same exception and share the same state. If a data provider goes down and all network requests fail for a day, they can easily be re-queued with one query.
Timestamp logging adds insights to each request’s state transition. Knowing how long a request remains in a particular state is valuable information. For example, the turnaround time for a request can be determined by the time it takes for the request to transition from “send request” to “parse response”. Timestamps also provide additional context when debugging exceptions.
Consider Using a State Machine
Using a state machine for parts of the report lifecycle has been hugely beneficial to Checkr. Incorporating it into our core logic has significantly increased reliability despite our total volume doubling. Today, the total number of stuck reports stays below 50 regardless of how much we grow. There are not any hidden stuck reports and morale on the engineering team is much higher. We’re aiming to get this number down to zero towards the end of the year by fixing lingering exceptions, adding state machine logic to new parts of the report lifecycle, refining our state transitions, and better automating state transition retries.
State machines are awesome, and should be considered when modeling data for web applications. Your web application may already contain several examples of state machines. Take a piece of it, draw a state machine diagram, and see what happens.