Mitigating an AWS Instance Failure with the Magic of Kubernetes

On Tuesday, AWS experienced some significant issues in us-east-1, the same region where we host the majority of our infrastructure. The first reports were of an S3 outage affecting many popular sites. Of course that initial outage grew to encompass a large number of AWS services in the region.

Screenshot of the AWS status page part of the way through this incident.

As the afternoon went on, and the scope of the issues increased, it became increasingly unbelievable that our systems were unaffected. Although we have several different monitoring solutions in place, I kept on manually checking our systems to verify that everything was still functional for us.

Then it happened. Around 2 hours into this incident, one of our instances (a Kubernetes node) became unresponsive. In many systems this would mean the Ops team getting flooded with alerts, and an overall feeling of panic as the team tries to figure out how to redistribute resources. On Tuesday, if this meant needing to spin up a new server in us-east-1, you would have been out of luck for several hours. Realistically this kind of failure on a day like Tuesday could have resulted in hours of downtime. With Kubernetes, there was never a moment of panic, just a sense of awe watching the automatic mitigation as it happened.

Kubernetes immediately detected what was happening. It created replacement pods on other instances we had running in different availability zones, bringing them back into service as they became available.

All of this happened automatically and without any service disruption, there was zero downtime. If we hadn’t been paying attention, we likely wouldn’t have noticed everything Kubernetes did behind the scenes to keep our systems up and running.

Although Kubernetes would likely find a way to recover from this automatically in most instances, there are a few things you can do to help ensure this happens smoothly.

  1. Distribute your nodes across multiple availability zones. Although there are lots of ways to do this, Kops provides a very simple way to provision multi-az, production ready Kubernetes clusters. Kops can also output Terraform configurations if you already use (or want to use) Terraform for provisioning.
  2. Ensure that your nodes have capacity to handle at least one node failure. Even if the Auto Scaling Group automatically creates a new node instance, the pods on a failed node will need a place to go before that new node is ready to go. If you don’t have sufficient capacity, this kind of auto failover could end up overwhelming your healthy nodes until a new one can spin up. For example, one of our clusters has 4 nodes. We aim to keep utilization on those nodes below 60%, so that if we had to run on 3 nodes, utilization would only rise to something like 80% on those 3.
  3. Use at least 2 pods per deployment. Without additional pods running on healthy nodes, we would have encountered downtime when this instance failed. It takes time for Kubernetes to create new pods, and if you’ve only got one to begin with, there’s only so much you can do. If you’re using deployments (you should be), adding additional pods is as simple as setting replicas: 2 in the deployment spec.
  4. Use readiness and liveness probes for everything you can. Kubernetes makes these so simple to set up, and they can be lifesavers. Seriously, here’s all the code that’s required to set up a readiness or liveness probe:
path: /monitoring/alive
port: 3402
initialDelaySeconds: 10
timeoutSeconds: 1

When a readiness probe fails, Kubernetes will stop sending traffic to that pod immediately. Alternatively, if a liveness probe fails, Kubernetes will attempt to restart the pod. Beyond providing smoother recovery from failure, readiness probes can provide zero downtime deploys, or even just prevent you from deploying a broken image. When updating a deployment Kubernetes won’t remove existing pods until new pods are successfully responding to readiness probes.

At Spire, we’ve been using Kubernetes for a little over 9 months at this point, the last 6 of which were in production. It’s transformed our workflow and provided us with a significantly more reliable product. If you’re considering a move to Kubernetes, I highly recommend it. It’s an incredibly powerful tool that is guaranteed to leave you in awe at least a few times.

A big thanks to everyone who’s contributed to Kubernetes over the years to make magic like this happen.

Edited to include liveness probes.