Kubernetes Rolling Update and Termination Grace Periods

Published in

Brex Tech Blog

7 min readJun 4, 2020

We recently investigated an incident where asynchronous operations were not fully completed, due to a faulty deployment of a service’s event consumer. We realized that the configurations of this deployment, as well as the graceful shutdown behavior, wasn’t what our engineering teams expected.

Let me start by giving some context on Brex’s infrastructure.

Note: this article assumes some knowledge of Elixir/Erlang and Kubernetes.

Kubernetes, microservices, gRPC and Events

We’re a micro services architecture, written mostly in Elixir, deployed and orchestrated by Kubernetes. We built our communication infrastructure on top of two primitives, synchronous RPCs using gRPC and asynchronous operations through an Events infrastructure backed by Kafka.

We embraced Kubernetes, as we knew we’ll have soon enough dozens of independent micro services to deploy and operate. After a year-long effort from the production engineering team, our entire infrastructure is defined as terraform modules and resources in our mono-repo.

Here is a simplified diagram of our architecture:

These decisions introduced complexity that every engineer building product is exposed to. This recent incident is an illustration of how every piece of configuration should be thought of carefully and with an understanding of the underlying behavior.

The incident

The root cause of the incident was a bad image deployed where a database repository was referencing a wrong configuration (for those of you familiar with Ecto, an Elixir ORM, the Ecto.Repo was defined with the wrong otp_app so no configuration could be loaded). Upon bootstrap, the app’s supervision tree would start up but this database repository would fail to connect to the database, crash, leading the whole app to crash.

The same code was used in a gRPC server and an Event Consumer, and we didn’t notice any visible impact with the RPC Server (otherwise, a bunch of synchronous operations would have started failing loudly). This was due to the fact that the old ReplicaSet was still running. After being deployed, the new ReplicaSet entered a CrashLoopBackOff state, and therefore never became available (Ready) — some of the old pods were therefore not terminated.

However, for the Event Consumer, it appeared that the old Replica Set had been terminated even though the new one was in CrashLoopBackOff!

The initial hypothesis was that the Readiness Probe had returned true because it wasn’t set up properly for Pull-based services such as Events Consumer, and that Kubernetes then terminated the old pods. The Readiness Probe is an endpoint exposed by a service that Kubernetes use to detect when the service is ready to accept incoming traffic. Pull based services don’t strictly require this probe, but we rely on it to detect services that cannot start up, like in this case, hence the initial hunch that it had failed to serve its purpose.

As it turned out, this wasn’t the case.

Kubernetes RollingUpdate Deployments

Configuration

After some investigation, we discovered that it wasn’t a faulty readiness check that led to the bad behavior of our deployment. The Readiness Probe actually had worked as expected, and never returned true. From Kubernetes perspective, the new Replica Set had never become ready.

What had happened was a bit more subtle. Let’s look at part of our Events Consumer Deployment configuration. We use Helm charts and templates, and some configuration values were shared between our RPC Server and Events Consumer deployments. It would resolve to the following:

spec:
  revisionHistoryLimit: 1
  replicas: 1
  strategy:
  type: RollingUpdate
  rollingUpdate:
  maxUnavailable: 1
  maxSurge: 100%
template:
  spec:
  terminationGracePeriodSeconds: 400

The maxUnavailable and maxSurge values were shared by the RPC Server and Events Consumer — even though the RPC Server had a replica count of 4. One pod for our event consumer is enough for our needs.

In “English”, please…

For Kubernetes, this means that:

The desired state is to have 1 Pod per ReplicaSet
Any state can have at most 1 Unavailable Pod.
Any state can have at most 2 Pods (maxSurge of 100% for a replica count of 1)

In practice

Here is what happens when our new Event Consumer ReplicaSet is deployed, step by step:

The new ReplicaSet is created, scheduling 1 Pod using the new image.
The Pod is scheduled, bringing the current state to 2 Pods: 1 Available (old pod) and 1 Unavailable (new pod)
As the desired state is 1 Pod and it is acceptable to have 1 Unavailable Pod, Kubernetes terminates the old pod bringing the current state to 1 Unavailable Pod.
The Unavailable Pod continues starting up.

This is actually equivalent to the Recreate Deployment strategy, where the old ReplicaSet is terminated before the new one is scheduled.

And here lies the problem that caused the incident: the old and functioning ReplicaSet would be terminated before the new buggy one would finish starting up. And it never did!

The solution

The solution was to not allow any unavailable pod:

spec:
  revisionHistoryLimit: 1
  replicas: 1
  strategy:
  type: RollingUpdate
  rollingUpdate:
  maxUnavailable: 0
  maxSurge: 100%
template:
  spec:
    terminationGracePeriodSeconds: 400

This ensures the new Pod is up and running (Live and Ready) before the old one is terminated. Success!

A note on Graceful Shutdown with Elixir

Kubernetes Termination Period

As you may have noticed in the previous code snippet, the “terminationGracePeriodSeconds” was set to 400. This configuration affects how Kubernetes will handle the termination of a Pod. How that works is pretty simple:

Kubernetes decides to terminate a Pod.
It sends a SIGTERM to the running processes (to each container).
It waits up to terminationGracePeriodSeconds.
If the processes aren’t terminated yet, it sends them a SIGKILL, thus forcing their termination.

This delay allows processes to gracefully shut down. For instance, for a RPC server, that means processing the current operations, replying, and terminating (once Kubernetes triggers the termination of a Pod, it will stop routing traffic to it so no new RPC calls will be received).

So why 400 seconds? This was meant to give some time for Consumers to process the current events and terminate.

The problem

Setting the parameter to 400 seconds wrongly assumed how the Graceful Shutdown is handled by the Events Consumer.

It was assumed that it would give 400 seconds for the consumer to handle already received events, and then terminates.

Problem is, it doesn’t just work. Erlang and Elixir applications need to properly implement the graceful shutdown. By design, the Erlang supervision tree enables each process to gracefully terminate as needed. However, all processes must coordinate if they need to delay the shutdown more than a few seconds (the default shutdown being 5 seconds), which is complex to achieve when an app relies on several libraries (the ORM, the gRPC library, the Kafka client…). We decided to use an alternative solution, with k8s_traffic_plug, a library to handle graceful shutdown.

Without going into too much details (you can find more in this blog post), here is a simplified version of what happens within the Erlang/Elixir Application upon receiving a SIGTERM:

The SIGTERM signal is captured by the k8s_traffic_plug implementation of a erl_signal_server, a gen_event (a more advanced GenServer) that converts the SIGTERM into the termination of the application’s supervision tree and dependencies.
Instead of terminating immediately, the handler sends itself a delayed `stop` message (the default duration being 20s).
Upon receiving the `stop` message, it calls `init:stop()` which terminates the application’s supervision tree and dependencies.

You may have identified the problem — we had kept the default configuration which worked well for our RPC servers. And the Elixir application terminates itself after 20 seconds! Therefore Kubernetes will not wait the whole terminationGracePeriodSeconds period, as the pod’s process will terminate before it expires. You can configure this differently within your application but…

There is another problem…

There is also another assumption made that defeats the purpose of having such a long termination period.

On a Pull Based model, there’s no concept of inbound traffic that is cut by Kubernetes when terminating the Pod. The Consumer itself is responsible for stopping consuming incoming messages. And our Events Consumer library did not implement graceful shutdown, which means that after processing a batch of events, it will fetch another one…. and continue until the 20s expires!

This was an opportunity to address this issue and implement graceful shutdown of consumers!

Wrapping it up

All our Events Consumers’ configurations have been updated to:

spec:
  revisionHistoryLimit: 1
  replicas: 1
  strategy:
  type: RollingUpdate
  rollingUpdate:
  maxUnavailable: 0
  maxSurge: 100%
template:
  spec:
    terminationGracePeriodSeconds: 25

This sets the terminationGracePeriodSeconds to a value that doesn’t convey the impression of a different behavior, and ensures a safe rollout in case of a broken new image.

Takeaways

Kubernetes abstracts away a lot of complexity, but properly leveraging the platform does require understanding closely how it functions.
Kubernetes is complex, and exposing all the available configurations to a developer building features will lead to cargo-culting and ill configured services.
Graceful shutdown is hard to implement right, as all the pieces of the infrastructure and the software must be properly configured and implemented in harmony.

Thanks for reading!