Lights, Camera, (GitHub) Action(s)

Published in

tech-gwi

9 min readJul 8, 2022

The story so far

As with most tech companies these days, GWI had its fair share of different CI solutions implemented. In the beginning, it was CircleCI. Then it was Drone. Over the last quarter however, one of DevOps' goals was to run a POC around our next CI — identify a candidate, implement, evaluate and repeat until we are happy with the results.

The shortcomings of the current solution

The main idea behind Drone, our current CI solution was very promising. Don’t waste your engineer’s time to learn the in-and-outs of a stiff configuration of just another CI (looking at you Jenkins). Instead, assuming that people who write the pipelines are either DevOps or have strong DevOps skills, run everything on containers and disguise the rest of the configuration to look Kubernetes-like, and you are good to go.

The following code is a very good example. Anyone familiar a bit with Docker and Kubernetes can guess what it does.

kind: pipeline
type: kubernetes
name: defaultsteps:
- name: shell
  image: debian
  volumes:
    - name: shared
      path: /shared
  commands:
  - df -hvolumes:
- name: shared
  claim:
    name: received-data-claim
    read_only: false

Your engineers live and breathe containers. Most of them are comfortable and know their way around Kubernetes. They don’t need to learn how to take advantage of your CI from scratch, as it shares logic and best practices with your cluster setup. In theory, it’s brilliant.

In reality, things didn’t work out that well for us. Drone’s solution to run inside a Kubernetes cluster is not yet stable (we found out the hard way) and our current implementation does not scale in multiple machines. Eventually, we would have to rewrite a significant part of our pipelines (>130 as we speak), and the team’s consensus is that if we have to rewrite stuff either way, we may as well try out something closer to our original needs.

Enter GitHub Actions

GitHub Actions (GA) is not new to us. In fact, it is heavily used as an alternative solution internally, for shorter jobs (mostly for linting etc), by many of our teams. The remains of such a decision, if ever existed, are lost to us, however each team is responsible to sensibly use GA for things that require quick feedback loops — the only restriction is not to exceed the free minutes provided by GitHub itself each month.

With all the above in mind, it made sense for us to try GA as our first candidate since we would have fewer pipelines to worry about migrating in the long run. Moreover, we wanted our new, shiny solution to have some, if not all, of the following attributes in place:

Be able to scale along with us (we know that down the road our current CI solution will block fast iterations).
We would like to have a big community behind it.
We advocate for ownership when we can, so ideally we’d like to host it internally. Because of that, most SAAS products are not good candidates.
We’d love some flexibility on it — like scaling it down when we don’t use it. Also, we’d appreciate not having to pay a license, especially since we are the ones hosting it.

GA ticks most of the above requirements, when you take into consideration their support for self-hosted runners. So far so good, since pitching a switch to their hosted solution would be difficult. By using self-hosted runners:

We can use cloud services or existing machines that we already pay for.
We can customise our hardware, software and security updates, giving a lot of flexibility to the engineers to scale their pipelines.
It comes will all the cool features of the original GA offering, meaning CI integration within your PRs without any additional cost. Having the ability to get rapid feedback on the state of your pipelines through the GitHub UI, without having to click and navigate to an external source, even though it sounds small, it’s a big plus for us.

GitHub makes it super easy to send a job to your runners instead of its own. Just add `runs-on: self-hosted` in your build, and voilà, this small label is enough to trigger your infra.

The last piece to the puzzle is how we can scale those self-hosted runners. GitHub says it supports VMs, but it’s up to you to figure out the scaling part. We are in GCP, so autoscaling a GCE group sounded like a valid option, however the community had something better out there for us to use.

Actions runner controller to the rescue

The ARC is an open source project that enables running self-hosted workers on top of your Kubernetes cluster. It has many moving parts, but as the name suggests it offers a Kubernetes controller in order to track and scale its own custom resources. If you are familiar with pods and deployments, then those customer resources — Runner, RunnerDeployment and HorizontalRunnerAutoscaler , shouldn’t seem strange to you. Combined with Kubernetes RBAC, you can also build simple self-hosted runners as a service. Did I mention that is also a breeze to be installed?

Putting everything together, our setup is as follows:

User opens a PR with some pipelines.
GA triggers a new job and sends a JSON payload to a webhook in our cluster.
A GitHub Webhook server listens to these events, and sends the payload to ARC.
ARC checks if there are any Runner pods available to pick up the job. If not, it triggers a new one.

If the Runner resource can’t fit in the existing cluster, fear not — cluster autoscaling plays its part and adds more nodes to support our spike.

Why ARC is so cool? Because it supports a ton of flexibility and functionality out of the box for you. Because you can combine it with GCP’s preemptible or spot instances and gain a 60–90% discount on your load. Because it’s just another cluster under the hood for you — monitoring and alerting is plug and play for your monitoring stack.

What’s the catch?

If you are in this business for a while, you know that there is no perfect solution for everything, and this case is no different. In fact, I want to make this clear — there is no perfect CI. So what should we be aware of in our case?

Not everything is production ready

One red flag, is the status of ARC itself. According to the official documentation:

Even though actions-runner-controller is used in production environments, it is still in its early stage of development, hence versioned 0.x.

We don’t take statements like that lightly. We did a lot of research on its performance and the community behind it. We read a lot about how other people used it and what they thought about it. We got in touch with other companies that use it and got feedback. Even though everyone had positive things to say, we still have in the back of our heads that we may come across bugs or future breaking changes.

A pod’s quality of service

Not a catch per se, but we need to be extra careful with our Kubernetes resources when it comes to requests and limits. Resource management in a cluster is a very interesting concept, and you need to be aware at least on a high level about how internally Kubernetes manages resources and when decides to kill what, when your cluster is struggling for resources.

In our case, the last thing we want is a job to be interrupted half way through, because Kubernetes killed the pod due to an OOM event/high MEM or CPU limit. The solution to this problem is simple, but it requires some good planning. Kubernetes supports a few quality of service (QoS) classes. When it creates a pod, it assigns one of them based on its requests and limits, and it makes decisions about scheduling and evicting pods based on those classes.

For example, if your pod has the following resources:

resources:
      limits:
        memory: "200Mi"
      requests:
        memory: "100Mi"

the pod will be given the QoS class of burstable. The above configuration seems alright after all — our pod may need 100Mi for most jobs, but it can use up to 2X this value if the pipelines need some more juice. Unfortunately this means that under special circumstances, Kubernetes may evict your pod to place it on another node where it fits better.

The above statement makes sense for our application workloads if we think about it. In a world where everything can be described by deployments and replicas, Kubernetes can reshuffle its load as it thinks better, based on some simple rules. But our pipelines are not deployments and don’t have replicas. If you force delete one, it’s gone for good.

The simple solution for this it to aim for the guaranteed QoS class. For this to happen, your requests and limits need to be the same. That way, Kubernetes guarantees those resources for your workload, so you can at least ensure that you won’t have any surprised evictions on those pods.

However, such strict request/limits configurations mean less flexibility on your pipelines. Give your pods too many resources and you waste them. Give them too little, and you’ll have often OOM events. Here is where careful planning pays off. Both self-hosted runners and ATC support labels, which means support for multiple Kubernetes worker groups. In our case, we have created 3 node pools:

Default one: It can support most trivial pipelines and small jobs. This is where most pipelines will end up on. Mem: 3Gb, CPU: 3
CPU optimised: This is for pipelines that CPU is a bottleneck and will run faster on a better machine. Mem: 3Gb, CPU: 6
MEM optimised: This is for memory heavy pipelines. Mem: 10+Gb, CPU: 3

The above numbers are for reference, as we don’t have yet real time production data. However adjusting your cluster to support different numbers is trivial with Terraform and Atlantis.

Our biggest pain: caching

We knew about this when we run our Drone on Kubernetes experiment last year. When you move your pipelines to run on a cluster, then you have to forget about the conveniences of other stateful solutions (like Jenkins for example). Everything that needs to be persisted, must be externalised. Unfortunately this means that Docker caching is not supported in a native way.

An example for its implications: Usually, jobs build a service (using docker), then run some tests based on this image, then run some pre-deployment steps, and finally the actual deployment takes place.

Problem number 1: Docker’s own layer caching won’t work out of the box. Since its pod lives only while it’s running, cached layers cannot be reused.

Problem number 2: Each time your image has to be pulled from your registry. This can be very expensive if the image itself is huge in size.

For some teams, I expect this to be a real blocker. For us, it’s not that bad since 95% of our services are lightweight and not that big in size. There are various approaches to solve this problem:

Discussions around GA’s own docker caching mechanism
Docker’s GA action for build and push images, that supports buildx (so cache-from /cache-to) out of the box.

There is a very interesting article that discusses the problem in depth, and runs performance tests on all the different approaches. A followup article experiments with the second option (docker’s own GA action), but all solutions have a clear downside for us: they expect deep knowledge of how Dockerfiles can be fine-tuned to take advantage of layer caching. With 60+ engineers and hundreds of Dockerfiles laying around in our repositories, any type of refactoring existing code would be very expensive for us. Of course if a team wants to speed up the build process and take care of caching, this is something that we can support, however until GA improves its caching capabilities, the extra cost for running our pipelines on stateless machines is something that we can afford (and in certain cases we don’t mind jumping in and improve).

The aftermath

So far, other than the infra, we have converted and migrated 2 of our biggest pipelines to GA, and switched most of the smaller ones that were using the free runners to our self-hosted ones. In the case of the migrations, both pipelines have 50–80% less code, which wasn’t a surprise for us since Drone isn’t supporting shareable pipelines (only through jsonet which makes everything even more complex). In terms of performance, our QA pipeline saw an improvement of 400%, which wasn’t a surprise either, since scaling isn’t easily supported in our current solution.

Overall we are very happy with how everything fits together — it suits most of our needs, it scales, it has no hidden costs, a rich community, and good momentum. Caching may be an issue for certain use cases, but it isn’t for us yet and in any case we would have the same problem with any other native Kubernetes solution.

GWI’s DevOps current verdict: