Atom’s Journey to 99.9% Availability

Tom Cheng
Atom Platform
Published in
10 min readJun 10, 2022

A few years ago, I was given responsibility over the health and reliability of Kaplan’s flagship Atom learning platform. In addition to meeting the high expectations of our students, Kaplan’s leadership wanted Atom to be a platform our B2B education partners and institutional customers (such as medical and nursing schools) can count on to deliver their courses to their students.

Up until then, I had a peripheral, if avid, interest in system availability. Kaplan has been doing productive blameless postmortems for years, and I’ve diligently collected data from them, flagging patterns in root causes and lessons learned, and sharing those with my team. However, I was now being called on to devise a unified strategy for improving the overall health and reliability of the platform, which the dozen or so teams working on Atom would be following.

In my experience, the hardest part of devising and executing a strategy is making the right decisions about what to focus on (and consequently what not to focus on). Unfortunately, the many resources I consulted — like research papers on system reliability, books like Google’s Site Reliability Engineering, and engineering blogs — provided me with a long list of ideas, tools, projects, and best practices, but didn’t provide much of a playbook on how to plan this work out in the real world.

For example, given limited development resources, how do you choose which action items from your post-mortems to tackle first? How does one choose between implementing resiliency vs. better logging? Should you buy that expensive APM package or would that money be better spent on backup infrastructure for resiliency? Should you rally your engineers around secure coding or better documentation?

Over the last several years, we were able to devise some ways to answer these questions, which I will be sharing in this article. The focus will be on how we think about availability here at Kaplan and how we prioritized availability-related work, and less about the specific things we did to improve our availability. (Recently, Google has released a followup Site Reliability Engineering Workbook, which also addresses many of these practical questions and which I would recommend for additional perspective.)

Finding a KPI and Setting Goals

Platform health is a complicated topic and is more than just high availability, but we wanted to have a single metric that we could use to gauge our progress. This would serve as a “north star” metric that the whole organization (from the CEO on down) could use to track and communicate the health of the Atom platform. Furthermore, the Atom team could use this to track the impact (or lack thereof) of any work we do to the health of the platform.

After numerous conversations on the topic with the engineering leadership team, we settled on a variation of the standard availability equation as our Key Performance Metric (KPI):

% Availability = 1-(Total Downtime / Total Time)

One challenge to applying this metric to a distributed microservice system like Atom is that the platform is almost never completely “down”, so we need to more clearly define what constitutes “downtime” in a way that reflects the impact to customers and the business. The article How to Define, Measure, and Report IT Service Availability is the best summary I’ve found about applying the availability equation in practice. For Atom, we identified a set of features that were critical to our users — for example, the ability to watch instructional videos and to take quizzes. If any one of these features is unavailable to more than 50% of users for more than 5 minutes, Atom would be considered “down”.

At the time, our availability for the previous 12 months was 98.6%, based on the criteria described earlier. We knew we could and needed to do better — we wanted to provide at least 99% availability to our customers, but our CTO gave us a more ambitious target — to be able to deliver 99.9% availability by the end of the following year.

Prioritization Framework

99.9% seemed like a reasonable goal, but we needed a roadmap for getting (and maintaining) Atom’s availability at that level. As I mentioned in the introduction, we had a huge number of things we knew we could improve, but we needed a way to zoom in on the things that would have the greatest impact. In the absence of any other ideas, I decided to start by simply working backward from our % Availability KPI.

In the Availability Equation, the only true variable is Total Downtime, which is the sum of of the durations of each individual outage:

Total Downtime = duration of outage 1+duration of outage 2+…+duration of outage n

I prefer to express this in a more compact notation, where n is the number of incidents and dᵢ is the duration of a given outage i:

Total Downtime = d1+d2+…+dn = summation(di) where i ranges from 1 to n

From this equation, we can see that there are two levers to reducing downtime:

  1. Reduce the number of incidents (n)
  2. Reduce the duration of incidents (dᵢ)

It turns out most of the actions you can take to improve availability will move one lever but not the other. For example, reducing the number of incidents is a matter of coding and architecture (for example, implementing resiliency, secure coding, QA), automation (like automated testing, auto-healing and failover), and processes (in areas such as configuration management and release coordination). Meanwhile, reducing the duration of incidents is about people (having enough to sustainably cover the platform 24x7), training and documentation (simulations, incident playbooks), and tooling (such as alerting, logging and analysis, APM).

One can argue that many of these activities move both levers — for example, automating your deployment pipeline can help prevent outages (by improving deployment reliability) as well improve outage response time (by enabling faster rollbacks or redeployments). However, your team (hopefully) has more deployments than outages, so the tooling predominantly serves one use case more than the other.

Anatomy Of An Outage

When we visualized the outages from the prior year, the thing that troubled us most wasn’t the number of outages (though that wasn’t a happy number either), but the fact that so many outages lasted more than 2 hours (yellow in the chart below):

Graph showing platform outages from the baseline 12 months. X axis represents the date and y axis represents incident duration. Twelve incidents are listed, with six lasting more than 2 hours, three lasting 90–120 minutes, and the remaining three lasting less than 90 minutes.

We believed that we could do much better. At the very least, we wanted to see if we can get those outages that lasted more than half a day (particularly the one at the top of the graph that lasted several days) to a more reasonable level.

(Incidentally, there’s nothing particularly special about the 2-hour threshold–we selected it because it felt like a long time for the system to be down, and we had enough instances to enable us to analyze it for patterns.)

As noted previously, there are a number of different factors that impact incident duration: people (i.e., having enough to sustainably cover the platform 24x7), training and documentation (simulations, incident playbooks), and tooling (such as alerting, logging and analysis, APM).

To determine which areas and initiatives to focus on, we had to examine the lifecycle of each incident:

Diagram representing the incident lifecycle. Key events are: Issue introduced, issue hits production, issue detected, team begins investigating, issue mitigated, and issue resolved.

After an issue is introduced, there may be some time before it manifests in production (for example, a memory leak is introduced when buggy code is first merged into the codebase, but won’t have an immediate impact, even after it’s released to production). Even once a problem starts having an impact on end users, there may be some delay to the engineering team being made aware of it. We call this the “Detection Time” or “Detection Lag”. As you can imagine, ​​the quality of monitoring and testing on a system has the biggest impact on Detection Time.

Once an engineer has been alerted of an issue, there may be some delay before the engineer acknowledges and starts working on it. We call this the “Response Time” or “Response Lag”. We’ve found this time to be highly variable because it is ultimately a human issue: response times dramatically increase during off-hours, when people are usually away from their computers. Other factors include the depth of coverage (is there a backup in case the on-call engineer is sick?) and the engineers’ dispositions (are they suffering from alert fatigue caused by too many false alarms?).

Finally, the “Resolution Time” — or time it takes an engineer to investigate, analyze, and resolve an issue — depends greatly on the type of issue, but can generally be improved through training, clear procedures, good system observability (telemetry and logs, for example), and automation.

Focus On Outage Duration

When we examined the timelines of the outages from that year, we noticed that the biggest contributors to long downtimes were 1) delays in detection (in other words, high detection lag), and 2) lengthy investigation/analysis periods.

Delays in detection usually indicate a gap in monitoring. Sure enough, after a comprehensive audit of our service monitoring, we discovered that a number of Atom services were not set up in Pingdom (our availability monitoring tool) or New Relic (our application performance monitoring tool). We don’t currently have an automated way to create or delete monitors in New Relic and Pingdom when services are added or removed, so we implemented a quarterly audit (assisted by some scripts that pull data from the Pingdom and New Relic APIs) to ensure that monitors are up to date.

Lengthy investigation/analysis periods could be due to issues with training or preparedness, but we found that, in most cases, investigations were delayed because the initial alert was inaccurate (for example, it pointed to a service that was related to the outage, but was not the root cause itself) or due to insufficient telemetry or logging. The introduction of distributed tracing in New Relic made a big difference for us, because it helped us more quickly narrow down the root cause of an issue.

We also worked with all of the teams on improving the quality of logging: writing clearer messages, providing more contextual data in logs, and also removing unnecessary logs. Furthermore, we divided our alerting into two tiers, so a service that fails due to an upstream service throws a different alert than one that fails due to an internal issue. This made it much easier for the team to see the root cause of an outage amidst a stream of alerts.

The Result

With changes like the ones mentioned above, among quite a few others, we were able to dramatically drive down our max and median outage duration over the next two years. This is illustrated in the graph below, most notably by the higher proportion of blue circles (outages that took fewer than 90 minutes to resolve):

Graph showing platform outages from the following two years. X axis represents the date and y axis represents incident duration. Twenty four incidents are listed, with twelve lasting more than 2 hours, three lasting 90–120 minutes, and the remaining nine lasting less than 90 minutes.

We still had far more yellow circles (outages that lasted more than 2 hours) than we would have liked, but the median duration of those outages had improved as well, which indicates that the work we did on monitoring and training also helped in those situations.

Our 90-day availability was up to 99.5%, a dramatic improvement from our baseline of 98.6%, but still short of our goal. Additional work was needed. We noticed that, while our outage durations improved, the number of outages didn’t go down — we were still having about 12 per year, so we decided to pivot our focus to reducing the number of outages (n).

An analysis showed that the majority of outages were caused by code releases, which we felt should have been preventable. This jibes with the observation in Site Reliability Engineering that “The number one source of outages is change.” To address this, we introduced a number of stringent manual checkpoints in our release process — such as pre-release checklists, pre-mortems, and walkthroughs/dry runs.

Those process improvements allowed us to finish the last three months of the year with zero outages while still releasing code to meet critical business objectives. Since then, we have only dipped slightly below 100% availability a few times, and are averaging 99.98% availability for the year.

What’s Next

While we did manage to hit our goal, the controls we added to reduce release-related outages — which included additional manual QA, multiple meetings/real-time conversations, and checklists — have a heavy cost in time/effort, and consequently release velocity. We will be exploring ways to automate these manual failsafes so we can have a release process that is both robust and low-effort.

Meanwhile, we will continue to work on improving our incident response, expanding the pool of engineers qualified to respond to outages, while also improving training, monitoring, and tooling so they can resolve things more quickly.

That being said, a Google SRE recently published an analysis showed that, given that outages are relatively rare but have a broad range of causes and triggers, it is very difficult to change a system’s average incident duration (MTTR) in a meaningful and consistent way.

Consequently, we will also be taking a closer look at the auto-healing and auto-scaling capabilities of Atom’s various components, to minimize the types of outages that require human intervention in the first place.

--

--