The importance of near misses

Vanguard Tech
Vanguard Tech
Published in
10 min readApr 14, 2023

This blog post is adapted from content originally presented by Christina Yakomin at the Learning from Incidents Conference in Denver in February 2023. Though the conference was themed around incidents, the event this article describes was not an incident for Vanguard; it was an incredibly powerful near miss, which bolstered confidence in our fledgling cloud program and helped propel our migration to the public cloud forward in a meaningful way.

The “incident”

On the morning of Saturday, August 31, 2019, one of the data center facilities in Amazon Web Services’ (AWS) us-east-1 region lost power, and its backup generators failed. At the time, this data center was one of ten facilities that made up one of six availability zones within the us-east-1 region, accounting for about 7.5% of the total servers for that availability zone. The data center lost power around 7:30 a.m. Eastern time, and the backup generators kicked on. But around 9:00 a.m., those backup generators failed, taking down all servers in the building. A few hours after power was restored to the data center, AWS had recovered 99.5% of the affected systems.

A diagram of the relationships between regions, availability zones, and data center facilities.

Understanding the impact

For any customers that had deployed enough redundancy in their systems — spanning multiple availability zones, or even multiple regions — the failure of just one data center should not have had major impact. There were plenty of compute resources still available in the remaining availability zones and regions to handle the affected workloads. However, frantic news coverage following the event made it clear some companies weren’t operating this way.

Though many large enterprises had made significant progress in adoption of public cloud by 2019, some had focused on a “lift and shift” approach — taking whatever architectures they were using on-premises in their self-managed data centers and re-creating them in the cloud. This may have accelerated their cloud migrations, but it didn’t allow these companies to take advantage of the scalability and regional redundancy that public cloud providers offer, leaving them potentially vulnerable to localized failures like this one.

Investigation at Vanguard

When I came to the office on Tuesday morning after the Labor Day holiday weekend and saw news articles about an AWS outage, I was confused, and a bit concerned. Since the us-east-1 region was the primary AWS region being used by Vanguard at the time, I wasn’t sure why I hadn’t been paged to respond to any incidents over the weekend. My first thought upon reading the news wasn’t, “wow, it’s great we didn’t experience an outage,” but rather, “oh no, do we have a gap in our monitoring strategy?”

Our cloud program was still relatively new and emerging, and I wanted to make sure that we’d covered all of our bases, so a co-worker and I set off investigating any impacts to our key cloud platforms. Sure enough, we had lost several EC2 instances in our AWS environment during the power outage. We analyzed and documented the observed changes to system behavior across three of our production platform components as a result of these EC2 instance failures.

A brief slowdown for our caching system

In 2019, many of our cloud-hosted microservices were leveraging a centralized platform for caching so they wouldn’t need to reach back out to our on-premises systems for data retrieval. At 8:44 a.m. on the day of the AWS outage, one of the cache servers in the active cache cluster became unhealthy. This server remained active in the cluster for five minutes in this unhealthy state before it was systematically identified as impaired and removed from the cluster.

Throughout the five-minute period prior to its removal, a portion of requests continued to be routed to the bad cache server, which impacted response times. Between 8:44 a.m. and 8:49 a.m., about 6.8% of requests for externally facing apps and services on the platform that used this cache cluster were “unhealthy,” as determined by a combination of response time and status code. 0.3% of requests returned HTTP 5XX errors, while the remainder of the unhealthy designations were exclusively due to elevated response times. Once the cache server was no longer active in the cluster, the elevated response times subsided.

The health of requests to the external application platform, including response time percentiles.
Average latency in milliseconds, measured at the ELB in front of the external webservers (time in UTC).

After removal from the cache cluster, the virtual server was marked as unhealthy. The unhealthy EC2 instance was terminated automatically by AWS, which triggered an Auto Scaling event, due to the number of running instances being lower than the desired number of instances. The new instance started up at 9:01 a.m. on a host in one of the remaining functional data centers in the same availability zone, and by 9:15 a.m., the new cache server was started up in a healthy state and was added back to the cluster.

The number of “members” in the active cache cluster.

Though the cache cluster was operating at a reduced capacity for approximately 30 minutes, no adverse impacts to request health were observed after the initial identification of the unhealthy cache server. This indicated we had provisioned sufficient redundancy to handle our typical request load in the case of a lost cache server. Since the response time increase was so brief, it only impacted a small percentage of requests, and stabilized without any human intervention, our alert thresholds were not breached and no page needed to be sent to our on-call engineers.

Microservice platform

At the time of this outage, Vanguard’s production platform for containerized applications in AWS was a complex system made up of over 60 BOSH-directed virtual machines (VMs) per environment, not including webservers and other utility instances. These VMs spanned three availability zones in the us-east-1 region, including the availability zone that was impacted by the power outage. The platform infrastructure used to host internal applications (those used only by Vanguard employees) lost just one EC2 instance during the outage, while the external platform infrastructure (for applications used by Vanguard’s clients) lost five VMs, including one instance for hosting application containers, one router instance, and other various VMs.

One of the benefits of BOSH is that any deviations from the intended configuration will be automatically detected and remediated. However, this process is not immediate. In the case of the cache cluster, for which the re-creation of the VM was managed by AWS-native Auto Scaling functionality, there were about 15 minutes between the cache server VM becoming unhealthy and its replacement by the AWS Auto Scaling group. With the container platform’s BOSH VMs, the replacement process took significantly longer. The screenshot below depicts CPU utilization of one of the lost VMs in the external environment. The original VM stopped reporting CPU utilization around 8:40 a.m., but its replacement did not spin up until approximately 11 a.m.

It took about two and a half hours for one of the platform’s VMs to be re-created by the BOSH director.

This next screenshot depicts the CPU utilization of the BOSH director VM, which is noticeably higher than usual for several hours. This elevated CPU utilization, while still only a small percentage of the VM’s overall capacity, is an indication of the BOSH director taking action to recreate unhealthy or missing VMs.

BOSH director’s CPU utilization was elevated between 8:45am and 2:45pm EDT (screenshot depicts times in UTC).

Despite several hours of reduced capacity, the health of the application container platform remained consistent. Once again, there was enough redundancy provisioned in the environment to allow the remaining healthy VMs to handle the traffic load until replacements could be created by the BOSH automation.

Our API Gateway

Both Vanguard’s self-hosted, third-party API Gateway and our Identity Provider (IdP) Gateway experienced brief periods of downtime during the power outage. These were both self-hosted gateways, which were deployed to groups of EC2 instances that we maintained — unlike more fully-managed products like the AWS API Gateway service which don’t provide as much visibility into the underlying infrastructure. It was unclear from our investigation how many of the unhealthy instances we observed were actually in the affected data center and how many became impacted by the status of the other instances with which they interact. What we were able to clearly identify is that, for 11 minutes, there were no healthy hosts for the IdP Gateway, which prevented requests from ever getting through to the API Gateway.

The number of healthy hosts for the IdP (times in UTC).

The API Gateway also had no healthy hosts available for a few minutes, starting at 9:10 a.m. just like the IdP, but a healthy host became available prior to restoration of service for the IdP. Both the IdP and API Gateways returned to their usual operating capacities less than 20 minutes after the impact began.

The number of healthy hosts for the API Gateway (times in UTC).

The impact of the lack of healthy host was brief downtime — about 10 minutes during which no HTTP 200s were returned by our API gateway. Fortunately, this all occurred on a Saturday on a holiday weekend, which was not a high-traffic time for our systems. More importantly, in 2019 our API gateway product hadn’t been adopted by our most critical services yet, so 10 minutes of downtime for APIs using the gateway was not grounds for a major incident, or even an alert to our on-call engineers — since the system recovered quickly on its own.

The number of 200 HTTP status codes returned by the API Gateway (times in UTC).

However, the near miss investigation sparked questions about the system’s stability. Why would one or two lost hosts bring down the rest of the previously healthy hosts, causing a brief total outage rather than the capacity reduction we had observed with our other systems? It was clear that either more redundancy, better resilience mechanisms, and/or a change to system health checks was required. The teams responsible for these products took the action of further investigation and remediation after I produced my internal write-up. Now several years removed from this event, after plenty of rigorous testing and system improvements, we have much more confidence in the resilience of our API gateways.

Publishing our findings

When this near miss occurred a few years ago, we took the opportunity to document what we learned. This was a way for us to not only celebrate the robustness of what we had implemented, but also acknowledge the architecture patterns that had made this possible, ensuring we continued to follow these practices in the buildout of highly-available, cloud-native applications and platforms as we proceeded with our technology modernization efforts.

By publishing an internal write-up with our findings, we provided our executive leadership team and business partners with concrete evidence of the value of our hard work modernizing our systems. For the first time, they were finally able to see direct value in what we were delivering, and it made them feel far more confident in the public cloud overall.

Operationalizing “near miss” analysis

Three and a half years later, I still hear folks reference this internal blog post from time to time when teaching others about the cloud, even though our presence in the public cloud doesn’t quite look the same anymore. We now consider those cache servers, that microservice platform, and that API gateway provider to be our “legacy” systems. Given that, why don’t I have a more recent report to share?

The truth is, these near misses can be tough to spot, and even tougher to justify spending significant time analyzing. Not every cloud provider incident is going to make headlines like the AWS outage did. In the public cloud environment, we’re losing compute instances and self-healing and experiencing scaling events all the time. In a way, these are ALL near misses, just like the one I wrote about. Without the redundancy or scalability that we’d implemented, we’d be subjecting our clients to downtime or other adverse effects when these things happen.

Though it’s unlikely that anyone could spare the time and resources to write a report every time their applications auto scale, I do have some suggestions for how to ensure we don’t miss the opportunity to keep learning from non-incidents.

Once in a blue moon, something really exciting and engaging will make news, and I’ll be able to justify a large-scale investigation like the one documented in this blog post. For example, if a major regional outage were to occur for our cloud provider, but we were prepared with automatic failover to another region, it’s clear that this would be a near miss worth celebrating and analyzing to ensure it always goes that well.

In the absence of a major event like that, Site Reliability Engineers or operators (or anyone responsible for maintaining system availability) might want to take a look at near miss trends on a recurring basis — quarterly, for example — to stay informed about how the system is operating today. These engineers could conduct regular “near miss” evaluations to look for occurrences of certain behaviors indicative of near misses and analyze changes and trends in these occurrences over time.

The types of system behavior to look out for include:

  • An automated or manual failover taking place.
  • A database backup being promoted to primary, or being used to restore data.
  • An instance or service crashing and restarting without incident.
  • A scaling event occurring.
  • A canary deployment being automatically rolled back before widespread client impact is felt.
  • An alert firing early enough to signal that an engineer needs to respond and take action to repair a system before the impact is widespread or severe.

Just as we analyze incidents after they occur, it is important to spend time analyzing situations in which our systems successfully adapted to failure scenarios. These near misses can be more difficult to detect than outages, but they can tell us a lot about the way our systems are operating.

Come work with us!
Vanguard’s technologists design, architect, and build modernized cloud-based applications to deliver world-class experiences to 30 million investors worldwide. Hear more about our tech — and the crew behind it — at vanguardjobs.com.

Photo by Kelly Sikkema on Unsplash

--

--