DAZN Engineering
Published in

DAZN Engineering

GitHub Self-Hosted Runners on ECS

Docker in Docker in Docker — a new level of inception

CI/CD systems are always difficult solutions to put in place. It is always complicated to satisfy everybody with one single approach. Here, we will see how we, the DAZN cloud engineering team, designed and built a solution to provide more secure, more performant self-hosted runners for GitHub actions.

GitHub Actions, the solution of GitHub for running some automation for your repositories, offers the possibility to create workers running in your own infrastructure.

Some context to get started

I work in the cloud engineering team at DAZN. Last year, we started to migrate our CI/CD pipelines from Drone to GitHub Actions (read more about it here). GitHub Actions offers different kind of virtual machines for running our pipelines. Here are the different configuration available:

  • Linux / Windows runners: 2 core CPU (x86_64), 7Gb of memory, 14Gb of SSD disk space
  • Mac OS: 3 core CPU (x86_64), 14Gb of memory, 14Gb of SSD disk space

To use the different workers, we pay per time units which means that every pipeline running inside of our GitHub organisation consumes time units. Quite a few times, we ended up having to top up our account before the end of the month in order to allow our pipelines to keep running. With more than 4000 repositories in our organisation, it can quickly become a bottleneck.

The hosted runners offered by GitHub also do not offer access to the ARM64 CPU architecture. All their runners are on x86_64, most likely because Microsoft Azure doesn’t offer for now ARM64 VMs. In case we want to build a Docker container to run on an ARM64 host, we have to emulate the processor using QEMU. With the low amount of resources available on the hosted runners, VM inception can be extremely slow following what we do in our Dockerfile.

Another aspect to take into consideration is our former setup with Drone. Every pipeline was running on a dedicated EC2 instance of type t2.2xlarge (8 core CPU, 32Gb of memory). For some of our projects, the switch to GitHub Actions has been a huge regression performance wise. However, such a setup for Drone comes at a price. With DAZN growing, we had to run more than 40 EC2 instances during work days for providing enough resources for the teams to build and to deploy their projects. One EC2 instance can be used by multiple pipelines making the workspace / instance clean up complicated to ensure that two consecutive jobs running on the same machine can’t access each others data.

To solve the issue with the costs and with the performances, we decided to deploy our own self hosted runners. We, the cloud engineering team, had to create a solution allowing us to decrease the costs related to our CI while providing more powerful runners without being driven by the tooling (not everything should be fixed by simply vertically scaling resources) and providing a greater isolation for our pipelines in comparison to Drone and having the least impact on the existing workflows of our developers.

Note: this article was written sometimes ago, in between, GitHub announced new feature for the hosted runners allowing to choose their sizes. Read more about it here

Choosing an infrastructure

Before implementing something new, as part of our research, we looked into the solutions suggested here. The K8S solution wasn’t a really an option for us as we have a really limited and specific usage of it within DAZN, we haven’t (yet) jumped into the hype. The solution by Philips wasn’t an option either as it would bring us back to the Drone situation by using EC2 instances for the runners. However, the approach suggested in the GitHub Actions documentation for ephemeral runners, runners used only for one single job, was something appealing to us as it solves a security issue that we couldn’t fix with Drone by isolating every job in their one environment but also because it simplifies scaling in and out our pools of runners by preventing long running instances.

Containers seemed to us to be the way to go for our self-hosted runners. The way GitHub actions works, it uses one runner for one job. Therefore, we can have one workflow (aka Drone pipeline) with jobs running on multiple agents. Using containers, our idea was to have a fully isolated ephemeral worker for our jobs that developers could use “on demand” based on their needs. Once their jobs complete, the container instance is killed and removed from the pool as well as its data. Using an EC2 instance seemed more complicated as we would either have to handle manually the clean up of a workspace or we would have ephemeral EC2 instances which on the long run would be very expensive. We also considered micro VMs with Firecracker or QEMU but the lack of orchestration tools was a deal breaker for us as we didn’t want to implement extra softwares from the start.

Another thing we considered is the fact that we have two kind of jobs, the ones requiring a Docker daemon and the ones not requiring Docker. A job requiring Docker could be a Docker build, a job using a service or a job using an action running a Docker container.

At DAZN, we use AWS to host our infrastructure. A lot of our containerised services are running in ECS with Fargate or in ECS EC2.

Fargate is a serverless compute engine. In our case, the usage of the serverless architecture is totally relevant as we have ephemeral runners. Every time a job starts on a runner, it runs until completion then dies and our Fargate service takes care of recreating a new instance to match the desired amount of instance set for our service. However, because Fargate runs on an infrastructure that we don’t manage, the containers can’t be privileged which prevents us from using this workers for the Docker kind of jobs. We could have used a solution such as Kaniko but GitHub Actions doesn’t support it out of the box and it would require modifications of the existing pipelines which is a no-go for us.

Therefore, we decided to move forward with a hybrid infrastructure. A part of our runners are running in ECS EC2, another one is running in Fargate. As mentioned earlier, we also wanted to provide a way to run pipelines on ARM64 because of the new Apple M1 (we then offer a similar experience in the CI/CD as what we run locally) but also for cost optimisations (running on Graviton 2 processors allows us to improve the performances of our services by 20~25% and reduce their costs by ~40% for a comparable setup on x86_64 based on AWS data). Since end of last year, Fargate can be used on hosts using the Graviton 2 processor (read more about it here) which makes it easy for us to provide our Fargate runners for both x86_64 and ARM64. For our ECS cluster, we created different pools of agents, some running on t3.2xlarge instances for the x86_64 CPU architecture and some others running on t4g.2xlarge instances for the ARM64 CPU architecture.

In total, our infrastructure has 7 pools of workers. As we also want to improve our spending for the CI/CD, we created 4 pools of runners for our ECS cluster. As we have different kind of jobs running in our pipelines, we offer both spot and non spot pools for both CPU architectures. On Fargate, we offer 3 pools - 2 x86_64 spot and non-spot and 1 ARM64 non-spot as Fargate doesn’t offer (yet) ARM64 spot instances.

Still with the costs in mind, we opted in for a scheduled scaling for our workers, during working hours, we offer up to 100 runners across our pools. Out of working hours, we reduce the amount of runners to a total of 20 across all pools.

Clusters and their pools of workers
Runner pools per clusters

The diagram above shows the configuration of our clusters and their different pools.

Now that we have the architecture explained, let’s see how it has been implemented.

Baking our AMIs and our containers

While creating a GitHub Action runner baked in Docker for Fargate was an easy task, ECS brings us some challenges when it comes to manage the Docker daemon and the host resources.

The container for the runner is shared between Fargate and ECS with some extra addition for ECS. Our runner Dockerfile includes all the required binaries described in the GitHub Actions documentation and look like that:

Dockerfile runner

Our runner container is baked for both x86_64 and arm64 CPU architectures. The ECS runner has an extra stage including the Docker client.

When the runner starts, it registers itself to our GitHub organisation to the “fargate” or “ecs” group and it uses the --ephemeral flag. Every time a runner finishes a job execution, we trap the exit signal and unregister the runner from our organisation. Here is how our entrypoint looks like:

Later in this post, we will see why we are using rsync.

For ECS, we had to find a way to run the Docker daemon isolated without running the GitHub Action runner using the root user or in a privileged mode to prevent malicious actors to interact with the host. We decided to create an ECS task with 2 containers. One container runs the worker, the second one runs a Docker daemon and is privileged. To allow the access to our Docker daemon from the worker, we share a virtual volume between the daemon and the runner which grants access to the Docker socket.

Initially, we ran the daemon in rootless mode using Rootlesskit. We got to the level of isolation we were expecting but it ended up being extremely slow because of the storage driver and its backend as we couldn’t mount overlay2 in a user namespace because of the version of our Linux Kernel and the cgroups version used. Because of that, we couldn’t either have the containers running in their own user namespace limiting the isolation that we want.

Cgroups or Control Groups provide a mechanism for aggregating/partitioning sets of tasks, and all their future children, into hierarchical groups with specialised behaviour. In our case, we want to make use of the cgroups in order to isolate the main Docker process for the ECS agent as well as all the future children (here, Docker instances created and used by our ECS task instances). Every children will receive its own subsystem allowing us to reach the level of isolation we are looking for as well as the system features we need for each child such as overlay2.

When we started, the Linux kernel was in version 4.x for the optimised AMI for ECS. Therefore, we decided to rework our AMI in order to use the Linux kernel 5.10+ and to be able to run Docker using cgroups 2 and Systemd to manage the containers. We started to experiment with an Ubuntu AMI used as ECS agent but we had to stop due to the fact that we needed the EFS agent running on the machine and it doesn’t work on Debian based operating system. Luckily, AWS announced at the end of November 2021 the availability of a newer version of the Linux kernel for Amazon Linux 2. Therefore, we switched back to the optimised AMI. However, even if the Linux kernel version we wanted was available, it wasn’t coming with the cgroups 2 nor with a recent version of SystemD required for what we wanted to do. We tried to reconfigure the kernel manually and to recompile a new version of SystemD but our tries failed. By googling around, we found out about Amazon Linux 2022. Even if still in preview, we took the bet of running our ECS agent on this AMI. It comes out of the box with all the features we were looking for. It allows us to configure our Docker daemon on the host as follows:

Docker daemon configuration

As you can see, we are using overlay2 as storage driver. For interacting with cgroups 2, we use Systemd. By using private for the default-cgroupns-mode, every new container will have its own user namespace in which we can mount overlay2 for the Docker daemon we run as sidecar container for our worker. We also specify our registry mirror URL to benefit from an internal cache layer for the Docker images. The ECS/CloudWatch team implemented recently the support of cgroups 2 for ECS and they also provide now the optimised AMI ECS on Amazon Linux 2, but only for the x86_64 CPU architecture. Therefore, until they provide ARM64, we will keep working with our custom AMI.

Our AMI also installs the CloudWatch agent to collect metrics, NewRelic agent (as it is our tool for monitoring and alerting), FluentBit to collect - process and send logs to NewRelic.

Now that we have our setup in place for the ECS agent, we can focus on running our Docker daemon for GitHub Action. The kernel has the unprivileged_userns_clone functionality enabled allowing us to use overlay2 inside of our containers without having to be fully privileged. A non rootless Docker daemon on the EC2 instance can be run with the following command:

Command starting the Docker in Docker container
Docker run command

As you can see, we do not need to start the container in fully privileged mode, instead, we are granting the Linux capabilities SYS_ADMIN and NET_ADMIN. A volume is mounted at the path /var/docker-storage. The volume points to a dedicated EBS volume mounted as XFS which allow our daemon to mount its own overlay2 filesystem. Here is the Dockerfile used to create the dockerd container referenced in the docker run example:

Docker in Docker Dockerfile

The script entrypoint.sh contains the command running the Docker daemon:

entrypoint.sh for our Docker in Docker container

We create a temporary folder which will be used to store all the data for the daemon. We use a random string in order to ensure that only one daemon will access this folder. We use trap again to catch the exit signal of the docker daemon and delete the folder when the tasks is killed.

We start the Docker daemon with 2 sockets, one TCP socket that we use for configuring the health checks for the container and one Unix socket created on a shared Docker volume that the GitHub runner can use with the Docker client. The daemon doesn’t allow privileges escalation, it uses its own directory for the data-root and exec-root folder which are on a XFS filesystem allowing to use overlay2. Every new container executed by our embedded daemon will be created in its own user namespace, it allows us to run Docker in Docker in Docker (inception!) without slowing down our pipelines. To keep performances up to the level of our daemon running inside the ECS agent, we also use our Docker mirror registry to pull through cache. However, one issue with this approach is that we need to manually mount a volume in our remote daemon if we are using Docker containers in our steps which interacts with our current workspace. This logic doesn’t come out of the box with the current implementation of the runners. To get it fixed, we created an issue for the action/runner repository: https://github.com/actions/runner/issues/2023.

For the logs, we configure the daemon to output its logs to FluentBit using Syslog. The logs are then being sent to a third sidecar container running Fluentbit (again) to collect the logs from our embedded daemon and send them to the Fluentbit instance running on our ECS instance which itself forwards the logs to NewRelic.

Now, on pulling / extracting images, our runners are 10 to 15 seconds faster than the ones offered by GitHub and offer twice as much resources. Our service for EC2 creates a task with 4 vCPU and 16Gb of memory. For the storage, our AMIs are receiving an extra EBS volume of 120Gb shared between the two instances. It allows us to run up 2 instances per EC2 instance. Our Fargate runners are using the same amount of resources and can be scaled vertically on request.

Here is a diagram showing our architecture for ECS:

ecs infrastructure

With our runners in place, we can now focus on an other important topic for our runners: the caching mechanisms.

Persisting data in our infrastructure

You might already know that GitHub Actions offers a caching mechanism for your project dependencies via the actions/cache action. Under the hood, GitHub Actions also caches runtimes that you install with actions like actions/setup-node.

The different layers of cache that you use do have limits defined by GitHub. We tried the default cache action with our runners but we had really poor performance. Our runners are running in eu-central-1 and, I think, the storage used by GitHub Actions are somewhere in Azure in the US regions. Therefore, in order to use our own caching rules and to be able to host the data in our infrastructure, we decided to fork the original action and replace the storage backend by a S3 bucket living in our infrastructure next to our runners. To allow the downloads from our S3 bucket from GitHub Actions, we decided to use an AWS IAM OIDC provider using “token.actions.githubusercontent.com” to issue JWT token. Compared to the original cache action, we will have to add the permissions attribute to our job definition to allow the action to use the JWT token (read more about it here).

For the runtimes, as we are working with 2 CPU architectures, we decided to create 2 EFS storages. One for each CPU architectures, both shared between Fargate and ECS instances. Every time a container starts, it retrieves the cache available from the EFS using rsync. During the job execution, the code delivered by GitHub will read and write files in the folder we synced. Once the runner completes, in the trap of our entrypoint for the worker, we sync the EFS with the latest modifications from the runner. We decided to use rsync in order to avoid direct read / write of small files on the EFS as it performs poorly.

For Docker, as mentioned earlier, we are using our own mirror registry running in Fargate. The images downloaded are being persisted in a S3 bucket living in the same account as our clusters.

Here is a diagram showing how our different cache layers are being used by our runners:

caching system

We now have our self-hosted runners with a caching system similar to what GitHub provides. Our next big step will be the implementation of a custom autoscaler for our workers and our ECS cluster for being able to adapt dynamically the amount of instances based on requests instead of using a fixed amount of instances as we do currently.

Automating the releases of our runners

As I mentioned earlier, at DAZN, we try to standardise our approaches. For the computing part, I already mentioned our preference for Fargate. For the infrastructure automation, we usually work with Terraform.

Therefore, for our runners, we used Terraform to automate the release of our runners. Our projects is divided into 6 projects. 1 project manages our ECR repositories. 2 projects are managing the S3 bucket for our repo artefacts (dependencies cache mainly) and the shared EFSs for the runtime cache. 1 other project takes care of managing our clusters. It creates our Fargate cluster with capacity providers set to allow both spot and non-spot Fargate instances to be deployed. The ECS cluster is also created by the same project and it takes care of creating our different pools of runners. Thanks to the latest versions of Terraform, we are making an extensive usage of dynamic blocks inside our configuration allowing us to simply edit a map in order to add or to remove a pool of workers. Our configuration map for ECS looks like that:

Along with the the instance type configuration, our map allows us to define if the pool is a spot pool or not. We can also define the maximum, minimum and desired amount of runners per pools. For our ECS cluster, every pools are getting a scheduled scaling mapped on our office hours. At the beginning of the day from Monday to Friday, our pools are scaling out to the value defined by max_size. In the evening, the pools are scaled out to the min_size value.

The last 2 projects are taking care of deploying our runners. They are using a similar configuration map as our clusters. Both Fargate and ECS runners have scheduled autoscaling working similarly to the one for our ECS cluster. For the runners, the same rules are applying, scale out during working hours from Monday to Friday, scale in to min_size for any other time. To prevent issues with tasks scheduling, our autoscaler for the runners kicks in with some delay, after our agents have been scaled out. For scaling in, we first scale in our runners to gracefully kill our ECS instances then the ECS agents.

To simplify our infrastructure as code, our runners are sharing a common module taking care of creating the service and task definitions.

As we think our infrastructure might be interesting for people outside of our organisation working with GHA and AWS, we are now looking at how we will open-source our solution under the DAZN organisation. If you have some feedback or questions about our architecture, feel free to reach out to us in the comments and to keep track of the next news about our runners, follow our DAZN Engineers account on Medium !

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store