Save cost by running GitHub Actions on Spot Instances inside an autoscaled EKS Cluster

Shivam Agarwal
Engineering @ Chargebee
9 min readJul 8, 2022
Photo by micheile dot com on Unsplash

Introduction

GitHub Actions is a very useful tool for implementing developer workflows such as CI/CD (Continuous Integration and Continuous Delivery) pipelines for your application. By default, GitHub Actions jobs are run in the cloud, on machines that are hosted and managed by GitHub.

Self-hosted Runners

However, sometimes, you may want to run your GitHub Actions jobs in your own machine. One reason could be that GitHub-hosted machines do not have the minimum hardware resources required to run your app. For such cases, GitHub gives you the option to use Self-hosted runners. As the name suggests, self-hosted runners are machines which are hosted by you and are capable of running GitHub Actions. However, it requires provisioning and configuration of virtual machine instances to set up these self-hosted runners (Problem A).

Kubernetes Cluster

Containerising apps and deploying them inside Kubernetes cluster is also very common these days. If you already have a Kubernetes cluster, it makes more sense to run self-hosted runners on top of it. Also, self-hosted runners should be able to automatically scale up/down based on the demand for running tests. For example, if two different developers push code to the same branch few seconds apart, the second developer should not have to wait for the first developer’s workflows to finish execution before second developer’s workflows start. Rather, a second runner should be scaled up automatically to serve the second developer. Also, this second runner should scale down automatically when there are no further workflow jobs to run. Having this ability to automatically scale up and scale down based on demand, will enable you to run workflows in parallel in a cost-efficient manner. (Problem B).

Spot Instances

Some cloud providers like AWS offer spot instances. Amazon EC2 Spot instances are spare compute capacity in the AWS cloud available to you at steep discounts (upto 90%) compared to On-Demand prices. You can use Spot Instances for various stateless, fault-tolerant, or flexible applications. However, the catch is that AWS can take back this instance anytime by giving a notification, two minutes before it actually does that. This is called Spot Interruption. Using Spot Instances reduces cost significantly, but you need a way to manage this spot interruption. (Problem C).

In this article, the intent is to provide you a solution to run autoscaled self-hosted runners inside a Kubernetes cluster using spot instances. But before that, let us compare the costs incurred in running GitHub-hosted runners and self-hosted runners on AWS EKS.

Cost Comparison

Let’s say that we need to execute 20 jobs per day and each job takes 1 hour to run. GitHub-hosted runners have the following hardware specifications for a Linux based machine:

  • 2-core CPU
  • 7 GB of RAM
  • 14 GB of SSD space

We will use a t3.large instance for calculating cost for Self-hosted runner. It has the following specifications:

  • 2-core CPU
  • 8 GB of RAM

GitHub-hosted Runners

Cost of running Linux based GitHub-hosted runner — $0.48/hr

Total time for running all jobs each day — 20 * 1= 20 hrs / day

Total cost per month — 0.48 * 20 * 30 = $288 / month

GitHub Pricing Calculator

Self-hosted Runners

Cost of running 1 EKS cluster — $73/month

Cost of running 1 t3.large spot instance with 14 GB storage ~ $23.40 / month

Total cost ~ 73 + 23.40 = $96.40 / month

AWS Pricing Calculator

Clearly, for this example, self-hosted runners are less expensive as compared to GitHub-hosted runners. GitHub-hosted runners are priced based on the time duration for which the job is actually running. On the other hand, self-hosted runners on AWS EKS, are priced based on the time duration for which the host machine is running. This has the following interesting consequence. If we use self-hosted runner, we might be able to run more number of jobs at the same cost, depending on the job length and job hardware requirements. For example, let’s say a certain job requires 0.5 core CPU and 2 Gi RAM. This means that one t3.large will be able to run at-least three such jobs in parallel, without any extra cost. However, GitHub-hosted runner will charge us for these extra two jobs.

Autoscaling

Autoscaling the runners based on demand, is important for reducing costs. Autoscaling will work at two levels.

Node Level Auto-scaling

Nodes will be auto-scaled based on the resource requirements of the pods. This will help us in reducing costs as idle nodes would be automatically scaled down and new nodes will only be created if there is a need for them. It can be implemented using Kubernetes Cluster Autoscaler or Karpenter Open-Source Project.

Pod Level Auto-scaling

Pods will be autoscaled to run on different nodes. If there are more jobs queued by GitHub Actions, more pods will be created to run the jobs. As jobs get complete and queue gets empty, these pods will be scaled down automatically. It is implemented using HorizontalRunnerAutoscaler resource of action-runner-controller.

Architecture

Architecture Diagram

We will use actions-runner-controller (ARC) with AWS EKS to solve problems A and B. ARC is an open-source project that operates and manages self-hosted runners for GitHub Actions on your Kubernetes cluster.

The architecture comprises a single Kubernetes cluster with two namespaces — one for the controller and other for the runners. AWS ECR contains the custom images that will be used to create the runners. Using custom images is useful if your application or jobs need some dependencies to run.

Also, note that there are 4 types of pods:

  • Runner — this pod will actually listen to GitHub for any pending jobs and subsequently run them
  • RunnerDeployment — this helps us in managing sets of runners so that we do not have to manage them individually
  • HorizontalRunnerAutoscaler — this helps the RunnerDeployment to scale up/down Runner pods. Autoscaling can be driven from a webhook event or pull based metrics, based on how the HorizontalRunnerAutoscaler is configured.
  • ActionRunnerController — this is responsible for registering the runner with GitHub and managing other tasks, needed for everything to run smoothly.

We will also need to store certain secrets in Kubernetes Secrets Manager. By default, it is unencrypted. We will use AWS Key Management Service (KMS) keys to provide envelope encryption of Kubernetes secrets.

Handling Spot Interruption

AWS only notifies you 2 minutes before it terminates spot instances. This might be not enough for gracefully terminating runners that are running moderately big jobs (that can easily take minutes to complete). Hence, there is not much that you can do to handle the jobs if they get interrupted due to spot interruption. The best that can be done is to create a script which can restart any job that was interrupted due to spot interruption.

However, if we follow the following practices while setting up the cluster, we can minimize Spot Interruption to a great extent.

  • Regularly scale down the nodes where the runner pods are scheduled when there are no jobs running.
  • Configure the EKS node group with as many different instance types as we can, preferably from different markets (instance classes, availability-zones)
  • Set the node group to use “capacity-optimized allocation strategy” to ensure that the most-likely-to-survive instance type is picked every time there’s a scale-up. This is enabled by default if we use managed node group to create the cluster.

Let’s get started with the implementation.

STEP 1: Install utilities

Visit this link for detailed instructions for installing the utilities.

A — Install and configure AWS CLI

B — Install kubectl Utility

C — Install eksctl Utility

STEP 2: Set Up AWS EKS Cluster

A — Create AWS KMS key

1 — Execute the following to create it

aws kms create-key --description "ActionRunnerKey" --region us-east-1

2 — To view the key,

aws kms list-keys

Copy the keyARN of the key that you just created.

B — Create a cluster of spot instances

1 — Copy the following configuration in a file called cluster_config.yaml . Use the keyARN that you copied in the previous step. You can also find it using AWS Console.

2 — Execute the following command to create the cluster

eksctl create cluster -f cluster_config.yaml

STEP 3: Set Up Action-Runner-Controller

Before you set-up ARC, if you want to autoscale nodes, you can use Kubernetes Cluster Autoscaler or Karpenter. To know more, you can see Autoscaling in AWS EKS.

A — Install cert-manager.

Follow the steps mentioned in this link for installing it using Kubectl.

B — Install the custom resource definitions and actions-runner-controller.

It can be done using kubectl or helm. This will create actions-runner-system namespace in your Kubernetes and deploy the required resources.

1 — Download the yaml file using the following command

curl -L -o actions-runner-controller.yaml https://github.com/actions-runner-controller/actions-runner-controller/releases/download/v0.21.0/actions-runner-controller.yaml

2— Now, deploy this. With kubectl, it can be done with the following command

kubectl create -f actions-runner-controller.yaml

C — Set Up Authentication with GitHub API.

You can use PAT based authentication or GitHub App based authentication to authenticate the runners with GitHub. For the purpose of this tutorial, we will use PAT based authentication.

1 — Log-in to a GitHub account that has admin privileges for the repository, and create a personal access token with the appropriate scopes listed below:

Required Scopes for Repository Runners

  • repo (Full control)

Required Scopes for Organization Runners

  • repo (Full control)
  • admin:org (Full control)
  • admin:public_key (read:public_key)
  • admin:repo_hook (read:repo_hook)
  • admin:org_hook (Full control)
  • notifications (Full control)
  • workflow (Full control)

Required Scopes for Enterprise Runners

  • admin:enterprise (manage_runners:enterprise)

For the purpose of this tutorial, we will deploy the self-hosted runner at the repository level.

2 — Once you have created the appropriate token, deploy it as a secret to your Kubernetes cluster that you are going to deploy the solution on:

kubectl create secret generic controller-manager \
-n actions-runner-system \
--from-literal=github_token=${GITHUB_TOKEN}

D — Deploy Runners on EKS

1 — To create the runner in custom namespace first create custom namespace using command: kubectl create namespace action-runner-runners

2 — To launch an autoscaled self-hosted runner, you need to create a manifest file that includes the RunnerDeployment resource. In this file, we have mentioned ubuntu:latest as the image that will be used to create the runners. However, you can also use a custom image.

3 — Apply the created manifest file to your Kubernetes in the specified namespace: kubectl --namespace action-runner-runners apply -f runner-deployment.yaml

4 — Then create another manifest file that includes the HorizontalRunnerAutoscaler resource as follows.

We are using PercentageRunnersBusy metric to autoscale the pods. However, there are additional metrics, offered by ARC, to autoscale the pods. Visit this link to know more.

5 — Apply the created manifest file to your Kubernetes cluster.

kubectl --namespace action-runner-runners apply -f horizontal-runner-autoscaler.yaml

6 — The runner you created must now be registered to your repository. To check, open your repository on GitHub. Go to Settings -> Actions -> Runners. It must list a runner with a `self-hosted` tag.

7 — Configure GitHub actions workflows to use self-hosted runner

To specify a self-hosted runner for your GitHub Actions job, configure runs-on field in your workflow file to contain the labels that you mentioned in the RunnerDeployment resource file. Note that GitHub attaches the label self-hosted to self-hosted runners by default

runs-on: [self-hosted, large]

Now, GitHub Actions should run on your self-hosted runner in AWS EKS.

Tear Down

You now know how to setup self-hosted runners using AWS EKS.

To free up all the AWS resources, you can use the following command:

eksctl delete cluster --name my-cluster --region us-east-1

To de-register the runner with your GitHub Repository, open your repository on GitHub. Go to Settings -> Actions -> Runners. You can now delete the runner to de-register it.

For comments or feedback, you can get in touch with me over LinkedIn.

This was an internship project, mentored by Priya Sebastian.

If you are interesting in our work and want to solve complex problems in SaaS products, platform & cloud infrastructure engineering — we are hiring!

--

--