On-Demand CI/CD with Gitlab and Kubernetes

Published in

Radio France Engineering

8 min readJun 2, 2020

Radio France is a French public service radio broadcaster. We design, build, and operate websites, mobile applications, APIs, podcast delivery, voice assistant skills, an audio streaming platform… for seven radio stations. All our applications are running on a microservice architecture, built on top of Kubernetes. Our technology stack includes things like, PHP, React, NodeJS, Svelte, Golang, RabbitMQ, PostgresSQL.

In this post, we want to share our vision and implementation of what a modern CI/CD toolchain should be.

Culture

The main idea behind our CI/CD toolchain at Radio France is the following

Don’t try to save money on CI. Developer’s time is more valuable than Gitlab’s.

Optimise for speed rather than cost
Fail fast, give feedback early
Use CI heavily, don’t be afraid of failed jobs

When reading books and blogs about testing and CI best practices, most people will say “Run all your tests locally before pushing any new code to the repository”. We believe the opposite. Running all tests and linters on your local machine can take a lot of time, and a lot of CPU or memory resources, preventing you from doing anything else at the same time.

An example of CI/CD pipeline using Gitlab CI at Radio France

We prefer to push new code as soon as possible, start working on something else, and come back to our CI results when they are done. We use the same process for Code Reviews. We push code early, even unfinished work, get feedback early and refactor quickly, before too much unneeded work has been done.

Another aspect of our CI/CD culture at Radio France : we don’t enforce anything on development teams. The team in charge of CI/CD is responsible for providing tools and methods for implementing CI/CD, sharing best practices, and providing advice and counsel. Each team is responsible for its own pipelines, whether they choose to do continuous delivery on production, for instance, is their responsibility.

Automation

When we first started to re-engineer of our CI/CD toolchain, we had been using gitlab with a few VMs as runners, only for CI. the CD part was done by Jenkins. Developers had to switch from one tool to the other to manage their whole pipeline.

Running Gitlab Runners in Kubernetes

We are currently running our fleet of gitlab runners within Kubernetes. The architecture behind the Kubernetes executor is pretty simple.

One or more gitlab-runner pods are permanent. They are in charge of querying the gitlab API to find new jobs to run. Once a job is picked. The runner will ask Kubernetes to create a new “CI pod” that will run the actual CI scripts as defined in the configuration of the job. The gitlab-runner pod itself doesn’t require a lot of memory or CPU as it only orchestrates APIs. (We allocate 64mcpu and 256MB of memory to gitlab-runner).
The “CI pod” is the actual pod where the CI scripts are executed. It’s usually composed of multiple containers. One for the script itself, one helper to prepare the script environment, get the source code. There can also be some “services” containers (a postgres container needed by the CI script). These pod usually requires more memory or CPU to run, but they are short term, and only live as long the CI script executes.

We run multiple instances of gitlab-runner. This allow us to perform upgrades without interruption. gitlab-runner gracefully handles termination when it receives a signal, letting the current jobs terminate, but not picking any new job. Each one of our gitlab-runner auto-register towards gitlab when it starts. We also use runner tags to create different pods matching different system resource requirements.

We don’t rely on Gitlab native integration with Kubernetes, for managing deployments, or for the runners. However, the gitlab-runner Helm chart does a pretty good job if you want to quickly bootstrap gitlab-runner on Kubernetes.

Building our Docker images in Kubernetes with Kaniko

Now that our CI was scalable and resilient, we needed to work on the build of our Docker Images. As we run our microservices on kubernetes, our deployment unit is a docker image, shipped with everything it needs to go to production.

We wanted our Docker Images CI to be as performant and scalable as every other job. That’s why we started to look at rootless docker builds, being able to build a docker container without the need of a docker daemon, within a Kubernetes Pod. Kaniko was the obvious choice at that time, but other tools, like img, exist. We didn’t went through an exhaustive comparison. Kaniko just worked for us.

The process of building an image with kaniko is simple

Create a build context from the source code. It’s a tar file that contains everything needed to build the docker image, including the Dockerfile.
Send it to a S3 Bucket
Run the kaniko_executor binary. We run kaniko within pods, as kubernetes jobs, giving it the S3 path to the build context, the registry and image where the final image should be pushed.

Running on Spot Instances

As we said previously, our CI/CD philosophy is :

Don’t try to save money on CI. Developer’s time is more valuable than Gitlab’s.

However, if we want to make this possible and make both development teams and finance teams happy, there are solutions to keep a performant CI/CD toolchain while reducing overall cost. One of them is Spot Instances.

Spot Instances (or preemptible VMs for Google Cloud) are really low-cost instances, their price evolves regarding the supply and demand over several months. It doesn’t fluctuate much. When you want to purchase a spot instance, you’ll get it at the current market price if there is enough capacity for the instance type you require.

However, your instance can be terminated by AWS at any time, with a 2 minutes termination notice. It makes it a perfect fit for short running workloads, with no strict SLA, such as CI jobs. In theory, Spot Instances can be terminated at any time. In practice, it really doesn’t happen so often.

We use Kops to manage our Kubernetes clusters on AWS. Within kops, we define multiple instance groups which map directly to autoscaling groups in EC2, one them targets Spot Instances. Kubernetes autoscaler handles the autoscaling process based on kubernetes resources requests. We only need to define, the min and max size of our instance group, and the max price we are willing to pay for a spot instance.

spec:
  machineType: m5.4xlarge
  maxPrice: "0.50"
  maxSize: 10
  minSize: 0

The most recent version of Kops handles this more gracefully, using mixedInstancePolicy where you can define multiple instance types, spot or on-demand, in a single instance group. This article describes this process in more detail.

Once our spot instances are up and running, the last thing we need to do is tell our gitlab runners and kaniko jobs to run on these instances. For gitlab, we configure both a node toleration and a node selector. Currently, gitlab-runner doesn’t support affinity.

[runners.kubernetes.node_selector]
  "kops.k8s.io/instancegroup" = "nodes_spot"
[runners.kubernetes.node_tolerations]
  "dedicated=spot" = "NoSchedule"

For Kaniko jobs, we use node toleration, but with node affinity this time

spec:
  tolerations:
  - effect: NoSchedule
    key: dedicated
    operator: Equal
    value: spot  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchExpressions:
          - key: kops.k8s.io/instancegroup
            operator: In
            values:
            - nodes_spot

Measurement

Consume only what is needed

The main benefit of this CI/CD toolchain is that we only consume what we truly need, On-Demand. Since everything is running within Kubernetes, along with Prometheus, alert-manager or Grafana, we can benefit from the metric and alerting system at no extra cost.

A grafana dashboard showing the demand for CI/CD runners on a normal day — Grafana dashboard showing memory consumption for all our CI/CD workloads on a typical workday

Alert if something looks wrong

Another benefit of this integration is the alerting system. We have two main alerts for our CI/CD.

Alert when the number of pending jobs is above X for more than Y minutes.

This usually means something is going wrong with our runners. An example of that is when the cluster-autoscaler is no longer able to create new spot instances, resulting in most CI pods being stuck in ContainerCreating. The number of pending jobs increases rapidly. This is a problem that deserves our attention.

Alert when the rate of gitlab “system” failure is above X for more than Y minutes

gitlab-runner allows you to differentiate runners failures in two types. “script failure” which are normal CI failed job (Our average rate of script failure across all our CI jobs is around 15%). “system failures” on the other hand should not happen. It means the runner had an internal error. If the number of “system failures” stays low, we should not worry. There are cases we know this can happen (When a spot instance is terminated, the jobs running on this instance that do not complete within the 2 minutes termination notice will result in a system failure). If the rate is too high , our CI system is not working properly, we should investigate.

Sharing

Explaining the constraints around spot instances

Spot Instances are cheap, but they have a drawback. They can disappear at anytime. This is even more true if you take into account your cluster autoscaler, that can drain instances when they are underused

Runners system failures for our CI jobs on a normal day

We usually see this phenomena twice a day, for lunch and at the end of the day, when people stop pushing any new code. Spot Instances that were provisioned to run our CI jobs are no longer used heavily. The cluster autoscaler marks them as “unused” and then terminates most of theses instances. The few remaining CI jobs that were running on these instances are drained and the jobs are marked as failed. It’s not possible to define a retry mechanism within the runner configuration itself. However you can write a retry bloc specific to runner_system_failure in your job configuration. It can also be set as a global default for each project.

As an operational team, it’s a trade-off we were aware of, but we accept it. As a developer team, it can be frustrating to see random job failures. We spent time explaining pros and cons, why we think these failures are acceptable, regarding the other benefits we gain. And this was accepted by everyone.

Sharing Best-Practices across teams

As we stated previously, we don’t enforce anything on development teams. But to have this “modern” CI/CD toolchain that we promote, people need to be aware of available tools, process, and see what other teams have implemented. This is part of the role of the team in charge of CI/CD to make this possible, composed mainly of people experienced in both operations and development.

This is certainly the aspect of CI/CD at Radio France where we can improve a lot. We recently created a community of practice for CI/CD, joining the other “COPs” about agile, frontend, golang…