Making zero downtime deployments a reality

Why we take a continuous approach and how we minimise impact on end user experience

Published in

Engineering at Cytora

5 min readNov 15, 2019

At Cytora we are on a mission to make insurance programmable. Insurance companies can leverage our API platform to get more accurate information about the risks posed to a particular commercial property.

As a platform team, we try to get our services as close to the 12-Factor App ideal as possible. Part of this effort involves running services in a managed Kubernetes cluster (or as Lambdas). All of our services are dockerised from the ground up, which allows us to practice immutable infrastructure (and by proxy enhancing our automation capabilities).

What is a zero downtime deployment?

Technically speaking a zero downtime deployment is defined as:

A software-update to a service that does not require the service to be taken out of production in order for the new update to be available to the users.

While this is a good first definition, at Cytora we also look at how a service update would impact the end-user experience.

Specifically, this means our working definition of a zero downtime deployment is:

A software-update to a service that does not negatively impact the end-user-experience.

While the meaning of these two statements appears the same at first glance, this change of perspective has a big impact on how we architect our services, how we automate our deployment process and how we handle feature updates.

Feature update management and backward-compatibility is not a super-obvious point, but it really does have the greatest impact on end-users. In particular, if we introduce a zero downtime change that breaks the current stable API then, from an end-user’s point of view, this is even worse than a downtime update because now they are required to fix their system as well rather than just wait.

Why is it important?

A whole host of people can be affected by how we publish our APIs, so zero downtime deployment is important for many reasons!

From the platform team’s perspective, a more robust and automated process will allow us to achieve tighter feedback loops and allow us to be better at continuous delivery as a result. Eliminating the need for a human in the loop forces us to build a more robust, repeatable and faster process for deployments.

From a developer standpoint, deployment is now an uninteresting process. There’s nothing special going on, just the robots carrying out the same exact steps every time. Deployment now feels safer, so you don’t think twice about pushing an update. There are enough tests and environments to gate a broken deployment well before it hits users.

From an end-user standpoint, the API should now feel more reliable as new features or bug-fixes now magically appear without any downtime. Additionally, since we are committed to making backwards compatible changes, the APIs will return responses in the format the clients are used to consuming.

Continuous delivery takes a lot of planning and it is as much of a technical problem as it is a cultural one. For continuous delivery to work, we must have a trustworthy deployment system in place with enough tests to give us confidence in our deployment through fast and accurate feedback. We must also think of how to push these updates out without disrupting existing workflows (i.e. designing backwards compatible code, using versioned endpoints, etc).

For the purposes of this post we are going to focus on a small part of the process, namely the Kubernetes objects that are involved in making this happen.

Zero downtime deployments in Kubernetes

At Cytora, we have found that the best strategy for us is to follow these three steps to minimise disruption during a deployment:

Create a new revision of a deployment
Wait for the new pods to become ready
Scale down old revision

This approach has proven to give us more stability during deployments, and is a good example of how immutable infrastructure can help with automating deployments.

If we didn’t take this approach, we’d have to somehow get into the running containers and somehow update the running binary — note the somehow’s here! The Kubernetes objects that make zero downtime updates possible: DeploymentObject and PodDisruptionBudget!

The first thing that we need to do is define a Pod Disruption Budget, which basically tells Kubernetes how many pods to affect during any scaling operations:

apiVersion: policy/v1beta1
kind: PodDisruptionBudget
metadata:
  name: my-svc-pdb
spec:
  maxUnavailable: 0
  selector:
    matchLabels:
      service-name: my-svc

It’s a very simple object, which basically checks that all running pods (old and new) are healthy before creating or removing any pods.

Next, we need to configure our deployment object:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: my-svc
  labels:
    service-name: my-svc
spec:
  revisionHistoryLimit: 1
  replicas: 2
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxSurge: 1
      maxUnavailable: 0
  selector:
    matchLabels:
      service-name: my-svc
  template:
    metadata:
    labels:
      service-name: my-svc
      spec:
        terminationGracePeriodSeconds: 60
        containers:
          - name: my-svc
            lifecycle:
              preStop:
                exec:
                  command: [ "/bin/sh", "-c", "sleep 10" ]
            ports:
              - name: http
                containerPort: 8080
                protocol: TCP
            readinessProbe:
              httpGet:
                path: /healthcheck
                port: 8080
              failureThreshold: 1
              initialDelaySeconds: 10
              periodSeconds: 10
              successThreshold: 1
              timeoutSeconds: 2
            livenessProbe:
              httpGet:
                path: /healthcheck
                port: 8080
              failureThreshold: 3
              initialDelaySeconds: 10
              periodSeconds: 10
              successThreshold: 1
              timeoutSeconds: 2

This is a fairly complex object so let’s unpack it and relate each part back to our requirements.

We have defined that we want multiple pods for each deployment, allowing us to have some redundancy. We can then check that our newly created pods are healthy and can take traffic, and for this we use the readinessProbe and livenessProbe. At this point we have the old revision’s pods, as well as the new revision’s pods serving traffic.

The following part of the deployment object is crucial: the pods of the old revision need to gracefully terminate in order to finish serving existing connections and drain cleanly, rather than the pods just being taken down and dropping existing connections! For this reason, we make use of the preStop lifecycle hook and also terminationGracePeriodSeconds to tell Kubernetes that the shutdown process for this pod might take up to a minute, hence preventing the scheduler from killing the existing pods.

Closing thoughts

We have found that continuous delivery can be challenging in many ways, but it also gives us a lot of room to experiment with different strategies — it is definitely an invitation to improve our tooling and the team’s dynamics.

While the technical aspects of a continuous delivery culture are not that hard to achieve, it’s only a small part of the process.

Recommended reading

You can find a lot more useful information about the ideas and technologies behind CI/CD in the “recommended reading” here:

Deployment Pipeline by Martin Fowler — https://martinfowler.com/bliki/DeploymentPipeline.html

Kubernetes deployment hooks — https://kubernetes.io/docs/concepts/containers/container-lifecycle-hooks/

Kubernetes Pod Lifecycle — https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle/

Rolling Updates — https://kubernetes.io/docs/tutorials/kubernetes-basics/update/update-intro/

Introduction to immutable infrastructure — https://www.oreilly.com/radar/an-introduction-to-immutable-infrastructure/

ConcourseCI — https://concourse-ci.org/

Making zero downtime deployments a reality

Why we take a continuous approach and how we minimise impact on end user experience

Written by Dionisio Perez-Mavrogenis