The Sounds of Silence: Lessons From an API Outage

Paul Zaich
Checkr Engineering
Published in
14 min readJun 29, 2020

This article is a written version of the talk “The Sounds of Silence: Lessons from an 18 hour API outage” presented at RailsConf 2020.2 Couch Edition. You can watch the video here.

Bugs are the constant companion of an engineer. Anytime change is introduced to a complex system, no matter how strict the quality controls and tooling, there’s a chance human error will introduce a bug. Let’s face it: bugs are a reality of software development.

https://imgs.xkcd.com/comics/fixing_problems.png

So if bugs are inevitable, the question is “how do we reduce their impact?” In this article I’ll share the details of an outage at Checkr in 2017, the steps that we took as a result of the incident, and share general lessons around metric monitoring that can help you maintain a healthy web application.

The Incident

In 2017 on October 6th, Checkr experienced an 18 hour partial API outage where some customers were unable to create reports using our REST API. At the time the Checkr engineering team consisted of 30 engineers across 6 teams. Request volume had grown significantly over just a few years and much of the team’s focus at the time was on stabilizing the system to handle the increased load. The team had doubled in the previous year and much of the knowledge of the system was held informally by members of the team.

In the Checkr system, a Report is the model that describes a Checkr background check. The report represents the legal document that is delivered to the customer when completed. It also governs many associated screenings responsible for fetching and parsing data. Each screening or search we execute is associated back to the Report.

Customers initiate a background check with Checkr by making a HTTP request to the Checkr reports REST API. Checkr creates the report record synchronously and completes the report through a series of asynchronous processes. You can read more about the challenges of managing the complexity of the background check process here.

Timeline

The incident began at 4:30pm on a Friday afternoon. A script was run to migrate some old screening records from an integer foreign_key to use a UUID instead.

Within an hour, by 5:30pm, an on-call engineer on an unrelated team was paged because of a spike of errors on one of our frontend applications.

The application submitted a small percentage of our total volume of reports and did not use the main API used by customers. Upon initial investigation, the responder decided that the error was likely the result of user error. They snoozed the page.

Early on Saturday morning, the same error triggered another page. The on-call engineer began investigating the error and eventually noticed a strange status code being returned by our reports#create endpoint.

The public and private report endpoints were returning 404s for many but not all requests. At this point around 9:30am, it was finally clear that there was a major problem that impacted our apis, not just the client application.

The initial responder escalated the issue to get all on-call engineers involved. With a smaller engineering team, escalation via Slack worked well. They pinged the #eng-fire channel and the rest of the team started to login to our VPN. I remember pulling my laptop in the parking lot of Crissy Field park.

The surge in traffic took our VPN down. We needed to go to the office to access the intranet and production servers to investigate.

Finally, we made it into the office around 10am. Now with full access to request logs within an hour, the issue had been reproduced.

A hot fix was quickly implemented within 15 minutes to patch the immediate issue.

Postmortem

At Checkr we strive for a blameless culture where we learn from our mistakes and make improvements as a result of outages. An important part of that process for us is creating a Postmortem document.

The Postmortem document and process has evolved over time as the team has grown, but the goal remains the same — to learn from the incident so that the same mistake does not happen again. The postmortem document captures the root cause, timeline, and action items. Identifying the root cause is an important component of the postmortem, but follow up action items are even more important. Here’s what we learned.

Lessons Learned: Corrupted Database Entries

The root cause was exposed due to the backfill run at the start of the incident. The associated records and database level constraints were not enforced. This was in part due to the use of two different databases for storing reports and the screenings. As a result, we could not add stricter database constraints.

The backfill corrupted records in our database; Our table relationships assumed that foreign keys would always been present. As a result, some screenings ended up with a nullified reference to their parent report and had a null value for report_id.

When a new report was validated by ActiveRecord validations, the report ensures that it’s associated screenings are valid given the configuration of the report and the information provided about the candidate. A screening object is found in the database or built in memory. Since the report_id of the new report was null, the nullified screenings were found as a result. When the validation was called on the screening, the screening’s foreign key was used to look up the report. In this case, the report could not be found and an ActiveRecord::NotFound exception was raised. Our API routes handle ActiveRecord::NotFound automatically and return a 404 response.

class Report
def get_screening(klass)
klass.find_by(report_id: )
end
end
class Screening
validate :candidate_field_format
def report
Report.find(report_id)
end
def candidate_field_format
validate_field(report.candidate.field)
end
end
end

The backfill only impacted two of our screening tables. As a result, some types of reports continued to create successfully while others didn’t. This turned out to be a contributing factor in our overall incident response because not all report types were impacted.

Lessons Learned: Slow Response Time

The second problem we identified was the 14 hour lag in response time from the start of the incident.

This timeline can be broken into two buckets: the Time to Resolution and the Time to Response. In this incident, the Time to Response severely impacted the overall duration of the degraded report API. 75% of the incident’s duration occurred before an active response was initiated. Every hour of additional downtime was a multiplier in terms of impacted API requests. We asked ourselves what would have alerted us to the issue sooner.

Remember, we were paged within an hour of the start of the incident. But the alert was far removed from the impacted component of the application. The error was in an entirely different repository, the alert pointed to a frontend service and the service utilized an internal api that received only a small percentage of our overall report traffic. The alert from Sentry was better than nothing but it didn’t contain clear, actionable information that a responder could use to make an informed decision and assessment of the problem.

Here’s the most painful part of this incident from a response perspective. We had a monitor specifically setup for the express purpose of detecting an outage in report creation. We captured a statsD event every time a new report was created in the system. This was exactly what we needed!

report.created count never fell below our minimum threshold

The problem was that our alerting rules were far too simplistic to detect more subtle failures. When the monitor was setup, we made the assumption that we’d see close to 100% of reports failing to create. As a result we had set a constant threshold floor of 100 reports created in 30 minutes. If the rate dropped below that threshold, an alarm would sound.

The outage did not impact all reports. Only certain report configurations were impacted. As a result we saw the report creation metric drop abnormally low but it never dipped below our floor value during the incident. The monitor was not sensitive enough to detect the issue. This incident highlighted that we needed to improve the overall observability of our critical systems.

Improving observability

By definition, Observability is a measure of how well internal states of a system can be inferred from knowledge of its external outputs. In this next section, I’ll explore tools for improving overall observability in your application, how those tools work together, and how to use those tools to craft meaningful alerts when something goes wrong.

There are three key components to building an observability stack. First, you need to need mechanisms for gathering metrics, measurable events that indicate something about the state of your system. Second, you need to define monitoring rules that define when a particular metric is green or red. Third, your monitoring rules should be connected to an incident management platform that governs rules around on-call responders and escalation. Do not use email or Slack as your primary notification method!

Metric Collection

Let’s talk a bit about metric collection. There are three broad categories of metric collection common in web applications. There are many stacks and solutions that can address one or more of these categories.

One of the most common types of metric collection are Exception Trackers. These are services that provide libraries to capture events when exceptions occur in your application. Examples include Sentry, Rollbar, and Airbrake.

Another common type of metric collection is Application Performance monitoring or APM. APM gives you access to industry standard metrics across common protocols and stacks. The benefit of APM is that you get a wealth of generic metrics for free. You can easily drill down to see request volume, latency, or investigate a full trace from the application, to the database and back.

The final category is real-time custom application metric collection. These are metrics implemented by the engineer to describe a specific business process performed by your application. These metrics are the hardest to define and maintain but give you direct visibility.

To visualize the kind of visibility that each of these metrics gives to you. Let’s imagine that your application is a black box.

Exception Tracking

Exception tracking will give you targeted insights into hotspots in your application. It will give you a sense of the impact and frequency of the error and a location where the exception is being triggered (via stack trace). Depending on the size of the application and an engineer’s amount of context, it may or may not be clear what the impact of the exception is. As we saw earlier exception events eventually triggered a response to the incident, but it wasn’t immediately clear what if anything was wrong.

Application Performance Monitoring (APM)

I like to think of APM as a heat-map overlaid over your application. It’s possible to zoom in on a specific part of the application but it also gives you a high-level view of the overall health of your application. It’s very useful for identifying system-wide outliers that might indicate a problem.

APM often comes in handy when you haven’t anticipated a specific mode of failure. A recent example of that occurred for us back in February. A configuration change was made that impacted the ability of several services in our stack from authenticating with each other. As we saw some more targeted metrics begin to trend in the wrong direction, the spike in 401s in our applications made it clear that something was wrong with authentication.

Custom Metrics

Custom metrics give you, the engineer, a way of describing the health of a specific feature or component in your application. You can view each of these components as distinct units using the telemetry generated by the metric. As a result, the engineer has health metrics that correlate closely to specific components or systems within the application.

Monitoring your metrics

Once you have metrics, you have the raw data to identify when intervention is needed, but you still need to define rules that determine when something isn’t healthy. What marks a good monitor?

High Fidelity — it needs to be a trusted measure of system health. You don’t want to miss true positives that indicate a problem but you also don’t want an overly sensitive monitor that constantly alerts on false positives. False positives degrade confidence in the alarm and lead to responder fatigure. When a real alarm is triggered, it’s possible the responder will simply ignore the alert.

Targeted — In most cases, it’s best to define monitors that measure the health of a discrete feature or component. A more narrowly defined monitor will give the responder actionable information that will make a response quicker or more effective.

Leading vs Lagging — Finally, ideally the indicator should be a leading indicator rather than a lagging one. For example, at Checkr, we also measure report completion to ensure as another metric. This metric would not be a good indicator of the health of report creation however because reports can take several hours to complete.

Recall that our monitor failed due to a lack of sensitivity. Let’s talk about a few types of monitor patterns that could have helped us be alerted much sooner in our incident.

Composite Monitors

First let’s talk about composite monitors. The concept is pretty straightforward: tie multiple metrics together as a combined signal. We ultimately added a composite monitor as part of our followup actions as a response to this incident. By measuring some additional critical sub-components, we had better visibility into overall system health. We combined measurements that looked at both report creation and individual screening creation to spot issues where a specific screening was causing report failures.

Rate Monitors

One way is to make our measurements to make our monitoring rules more resilient to fluctuations while maintaining sensitivity is to use rates instead of raw values. A rate monitor measures 2 metrics relative to each other. Absolute metrics do not handle changes in overall metric volume well. That’s one of the reasons why our original metric monitor failed; it handled fluctuations in the metric by having a very low sensitivity to failure. Measuring the report created count against the total number of report creation attempts gives you a metric that easily adapts to spikes and dips in overall request volume.

Anomaly Detection

Another option is to use a more sophisticated algorithm to measure the raw values; Datadog’s anomaly detection monitor uses machine learning and statistical models to model your metrics. Anomaly detection gives you the ability to detect outliers in your metric over time. This method allows you to apply statistical methods like standard deviation to account for variability and define a constantly re-fitted bounds rather than defining rules against absolute values. This type of monitor has the benefit of adjusting over time as your metric’s patterns change accounting for seasonality and long-term trends.

In practice, we’ve found these monitors to be overly sensitive (creating false positives) even when allowing for more flexible bounding and up to this point we have not used anomaly detection as a core piece of our alerting rules.

Building observability into your culture

Let’s talk about how these metrics can work together in a growing business. Let’s pretend that we’ve started our own e-commerce store to support a brand new box business. When you are first starting out, your traffic is very low and you have a small team that understands the full system. It’s likely that when an exception occurs, you will know pretty quickly what and why an error has occurred. Traffic is inconsistent so it’s hard to predict how many boxes will be purchased or how many shoppers will be interacting with your store.

As your store grows, it becomes harder to understand how everything works together. You added new features to your product page and recently added coupon codes to checkout. How do you know that your core store features are working? That’s where Custom metrics start to shine. You can start to instrument higher volume events like “Add to Cart” first.

As traffic continues to grow, you can add other critical features as discretely monitored events.

Start small when it comes to implementing custom metrics. Do not try to measure everything. Prioritize parts of your system that are critical for the business:

  • What would I want to know before my customer notices? Target monitoring in places where you are delivering the most value to your customers.
  • What systems are most brittle (due to technical debt or other factors) and at risk of breaking?

As your system explains, introduce more structure. Consider defining the importance of specific services or components more explicitly and mapping requirements including monitoring and on-call to those tiers.

Example service definitions

Bugs are an inevitable part of the software development cycle. A robust testing automated and manual testing will reduce the number of bugs deployed to production. The goal after deployment should be to mitigate the impact of these bugs with a faster and more proactive incident response. Observability is the foundation for maintaining reliable systems to improve incident response when systems fail.

Start small. Turn-key tools like Exception Trackers and APM provide a huge amount of value with limited investment.

Measure things that are critical to your application. Don’t try to measure everything.

Iterate on your monitoring rules. Tune them to ensure that alerts high fidelity and actionable for your responders.

Continually set the quality bar higher as you grow. Keep holding your services to higher and higher observability standards higher as traffic and complexity grow. New requirements should be added over time and more requirements should filter down the service tiers. A requirement for Tier 1 services should eventually become the norm for lower tiers.

This article is a written version of the talk “The Sounds of Silence: Lessons from an 18 hour API outage” presented at RailsConf 2020.2 Couch Edition. You can watch the video here.

--

--