Optimizing costs with GitHub Actions and AWS Fargate Spot

Dickson Armijos
Building La Haus
Published in
12 min readDec 19, 2023

The La Haus Platform team embarked on a substantial project to enhance the efficiency of CI (Continuous Integration) workflows, with a particular emphasis on reducing the time that automated unit tests take to run which we considered to be very long. This article explores specific unit testing workflows, highlighting a remarkable 60% improvement in time efficiency, resulting in a highly beneficial optimization, including an impressive reduction of more than 90% in our monthly costs.

AWS Fargate plays a pivotal role in our solution. This serverless service, managed by AWS, enables container execution without the need to manage the underlying infrastructure. Similar to EC2 instances with different purchasing options (On-Demand, Spot, etc.) Fargate offers Spot instances, a cost-effective solution providing up to a 70% discount on the standard Fargate price for Amazon Elastic Container Service (Amazon ECS) tasks tolerant to interruptions. While tasks in Fargate Spot may face interruptions, AWS provides a two-minute notice when claiming capacity. This service is ideal for fault-tolerant workloads, offering significant cost optimization.

There are various tools for creating CI/CD workflows, and in our case, we use GitHub Actions, a service provided by GitHub that enables the automation of custom workflows for tasks such as running tests and deploying applications. A crucial component of GitHub Actions is the self-hosted runners. Unlike GitHub’s cloud-hosted runners, which are virtual machines managed by GitHub, self-hosted runners are machines that you own and manage locally or in the cloud. They can be physical machines, virtual machines, or containers, providing flexibility to customize the execution environment according to the specific needs of your workflows.

Context

One of the crucial challenges was the prolonged execution times in unit testing workflows. This situation created a significant bottleneck for developers creating Pull Requests (PRs), preventing their merge into the main branch until all unit tests ran successfully. As a result, the service workflow experienced delays and efficiency issues, impacting the smooth integration of code changes through PRs.

In the worst-case scenario, the workflow took up to 25 minutes to execute, while in the best-case scenario, it was 15 minutes. However, this variability affected development efficiency. To assess the impact, let’s delve into some metrics at our disposal. Our Platform team has diligently worked to expose this information, including DORA metrics (DevOps Research and Assessment), aiding us in quantifying the problem.

In image 1, we observe a total of 5.63 PR commits per day. This value needs to be multiplied by the number of open PRs in the month of April, as unit tests are triggered with each push event from a PR. This results in a total of 309 executions of our unit tests in April.

Image 1: Dashboard with GitHub repository metrics to analyze pain points.

Considering the best-case scenario, where the workflow takes only 15 minutes, and factoring in the monthly execution count (309) of our unit tests, developers would invest a total of 77.25 hours per month in running these tests, this provides a clear perspective on the monthly time burden developers might face when bringing their changes into production environments.

Solution

The La Haus Platform team dedicated itself to developing a comprehensive solution to address challenges impacting agility in the development and implementation of new functionalities in the system. This solution evolved over time, but we can break it down into three fundamental stages.

Image 2: Cost diagram for the execution of unit testing workflows.

In the first stage, we migrated our workflows, which ran tests sequentially to parallel execution using matrices. This approach resulted in a drastic reduction in execution times, significantly improving process efficiency.

The second stage marked a move towards experimenting with Fargate Spot instances on AWS. This initiative not only allowed us to cut costs but also complemented the time reduction achieved in the first stage.

In the final stage, we implemented a solution leveraging Fargate Spot instances to keep costs low without compromising the execution times of unit tests. This approach provided us with an optimal balance between operational efficiency and cost-effectiveness.

Optimization with the SemaphoreCI Matrix (May 2022 — August 2022)

In this initial stage of our quest for efficiency, we faced the challenge of drastically reducing the execution times of our unit testing workflows. The migration of our workflows from sequential to parallel execution using SemaphoreCI’s matrix functionality marked a significant milestone. We managed to reduce the execution times from 15–25 minutes to 6–8 minutes, achieving a 60% improvement.

We can use the month of May as an example to get an idea of what we were paying monthly on SemaphoreCI for the execution of our unit tests. This setup involved running tests sequentially, resulting in extended execution times. While this approach worked well, challenges with execution time and inefficiency became apparent as the development scale increased.

In June 2022, we decided to explore SemaphoreCI’s job matrix to address the challenge of prolonged execution times. This change brought a significant improvement in execution times, reducing them to a range of 6 to 8 minutes. However, this efficiency benefit came with an expected increase in monthly costs, with these initial tests reaching $327 and being further incremented in the months of July and August. The transition to executing our unit tests in a matrix increased the number of jobs, moving from one job per pipeline to 30 jobs per pipeline for each unit test execution. This resulted in higher resource consumption as 30 standard-sized instances ran simultaneously.

Image 3: Migration diagram from the execution of sequential workflows to a workflow in a job matrix.

While this approach was efficient in terms of execution times, it logically incurred increased costs due to the quantity of parallel jobs. This highlights the economic and operational challenges we faced in attempting to enhance our CI/CD workflows. Optimization required not only reducing execution times but also finding an efficient balance between costs and resources.

Initial Integration of Fargate and GitHub Actions (September 2022 — December 2022)

In the next stage of our quest for workflow efficiency, we decided to implement a solution that involved running our workflows on Fargate Spot instances. The core functionality is based on a reusable workflow that creates GitHub Actions runners to be used by other workflows. As depicted in image 4, the reusable workflow utilizes Terraform to create the necessary resources in the ECS cluster. We chose this solution for its flexibility and ease of implementation.

Image 4: Diagram of the first solution implemented with Fargate Spot.

However, as the demand for runners increased exponentially, reaching up to 100–150 workflows running in parallel, scalability and performance challenges began to emerge. Image 5 reflects these challenges and represents a crucial point in our ongoing assessment of solutions aimed at enhancing operational efficiency in high-intensity execution environments.

Image 5: Scalability errors presented in GitHub Actions workflows.

An essential aspect of this solution is how we manage to create GitHub Actions runners of different sizes and images. Fargate instances have specific resource compatibility, and based on this, we’ve established the following instance groups.

  • nano: 0.5 vCPU, 1GB RAM
  • micro: 1 vCPU, 2GB RAM
  • small: 1 vCPU, 4GB RAM
  • medium: 2 vCPU, 4GB RAM
  • large: 4 vCPU, 8GB RAM
  • xlarge: 4 vCPU, 12GB RAM

This instance group can have an n:n relationship with our images stored in AWS ECR. This enables us to customize our runners by equipping them with various pre-installed tools/services, resulting in reduced time in our workflows. There's no need to create steps to install heavyweight tools, which can take a long time, or specific versions that we need to test. At this point, you can let your imagination soar.

Image 6: Structure of the Docker images used in Fargate instances.

As depicted in image 6, we have a base image containing our fundamental tools, including the GitHub Actions agent. From this base image, we can create customized images for different teams. This achievement in the process is crucial, as many of the elements implemented here will be utilized in our subsequent phase.

Adoption of Fargate Spot Instances (January 2023 — Present)

In this phase, it became evident that continuing to leverage Fargate Spot instances was the right path, although the implemented solution (referring to the stage 2 implementation) posed occasional challenges. Meanwhile, we had been testing an alternative solution for other workloads, which was proving highly effective. The key distinction was that this alternative utilized EC2 Spot instances instead of Fargate Spot.

The Terraform module philips-labs/terraform-aws-github-runner is a project that performs quite well. We use this project for several of our workflows, and it functions seamlessly. This module simplifies the creation of all the necessary infrastructure for leveraging EC2 Spot instances. It operates by establishing an application on GitHub that directs all workflow-related events to an AWS Lambda. This Lambda processes specific events, queues them and dispatches them, ultimately creating an EC2 Spot instance. Having comprehended the functionality of this solution, we opted to fork and modify the module to tailor it to our requirements, successfully deploying Fargate Spot instances with the same agility demonstrated with EC2 instances.

📝 Note

You may need to refer to the Terraform module architecture diagram for more context.

Image 7: Diagram of the solution implemented by refactoring the Terraform module.

This implementation provided us with the ability to create custom runners for our teams with different images and sizes (see image 7). This aspect is crucial, as it directly contributes to cost reduction. By configuring custom instances, we can opt for smaller resources in workflows that do not require extensive capabilities. In our case, we recognized that running unit tests did not required 2 CPUs and 7GB, which is the standard GitHub Actions instance. Therefore, we decided to create smaller instances that wouldn’t impact workflow performance while keeping costs low.

At this point, it’s crucial to mention an additional key factor in maintaining low costs in this implementation. These runners typically need to download many packages (pip install, npm install, bundle install, etc.) from the Internet, incurring internet-facing costs that should not be overlooked. This means that if we deploy all our runners within private subnets, we must factor in the cost of transferring data through our NAT Gateway resource. A straightforward alternative is deploying the runners within public subnets. While this incurs a cost, it is comparatively lower than the expense associated with a NAT Gateway, which is particularly beneficial when dealing with substantial data transfers. It’s important to emphasize that, for security reasons and to avoid triggering alerts with security team, we need to include a security group that blocks all incoming traffic to the Fargate instances since there is no need for other services to reach them.

This solution brought significant economic benefits. To provide an overview, we can compare the average costs of the months of June and July (the months when the job matrix was implemented) from the first stage against the months of the third stage (see image 2), resulting in a savings of 97.49%. As we have observed the stability of the solution, we have been migrating new workflows to this solution, so this percentage may even be slightly higher.

⚠️ Warning

While this solution brings many benefits, there is a crucial consideration to keep in mind. Fargate instances do not support running containers with Docker-in-Docker (DinD) capability. This implies that commands such as docker build cannot be executed, and as a result, any workflow involving image construction will not be compatible with this solution.

Results

Cost Analysis

To better understand financial efficiency, we compared the estimated costs of running the same scenario on SemaphoreCI, GitHub Actions (in their cloud), and Fargate Spot instances. Let’s take the previous example and attempt to analyze the costs, considering an average of 5.65 PR commits per day and a total of 55 PRs created throughout the month of May. In the implemented solution, we have an average of 30 jobs running in parallel to execute all our unit tests, and we need to use this value for our calculations.

It’s important to note that the cost of Fargate Spot constantly changes; in this case, we are using values taken at the time of writing this article ($0.014577 per vCPU per hour, $0.00160066 per GB per hour), while for GitHub Actions and SemaphoreCI, we will use the values found in their documentation up to today. For an estimation of the cost difference, we can employ the following Python script

⚠️ Warning

The following calculations assume that we will use 100 GB of internet-facing; you may have to update the GB transferred depending on your case.

Output:

| Platform                    | Monthly Cost   |   Nro. of Executions | Total Time per Execution   | Resources Used   |
|-----------------------------|----------------|----------------------|----------------------------|------------------|
| GitHub Actions (Standard) | $372.88 USD | 9322 | 5 min | 2 vCPU, 7 GB RAM |
| SemaphoreCI (e1-standard-2) | $349.57 USD | 9322 | 5 min | 2 vCPU, 4 GB RAM |
| Fargate Spot (Custom) | $40.35 USD | 9322 | 5 min | 2 vCPU, 7 GB RAM |

The total monthly cost of Fargate Spot is $40.35 USD, representing a significant reduction of 89.1% compared to GitHub Actions (Standard Instance) with a total cost of $372.88 USD, and an 88.5% reduction compared to SemaphoreCI (e1-standard-2 Instance) with a total cost of $349.57 USD. It's important to note that the actual cost of Fargate Spot may vary depending on internet-facing requirements. The provided percentage is based on using a Fargate instance size comparable to the standard instances of GitHub and SemaphoreCI. However, opting for smaller Fargate instances could potentially increase the percentage of savings even further.

Time Savings

Image 8: GitHub repository metrics dashboard to analyze implementation benefits.

Let’s take a comprehensive look at the results gained over time (May 2022 — October 2023). The PR Cycle Time by Day graph illustrates the evolution of the time it takes to merge PRs into the main branch over time. There is a noticeable trend of decreasing cycle time, indicating an improvement in process efficiency.

A significant decrease is evident in June and July 2022, followed by sporadic peaks. However, starting in September 2022, continuous improvement will be observed. It is crucial to consider external factors that may influence merge times. Some days exhibit exceptionally long cycles, which could indicate potential areas for improvement.

Reducing the PR Cycle Time by Day not only signifies improved process efficiency but also translates into quantifiable benefits. This includes an increased number of deployments and faster deployment speeds, fostering agility and minimizing waiting periods for developers. These enhancements not only boost overall productivity, but also have the potential to contribute to cost savings through efficient resource utilization.

Recently, we have been implementing further optimizations to reduce the execution time of our tests. If you’re interested, you may want to take a look at this article.

Conclusions

  • The transition from sequential to parallel workflow execution using SemaphoreCI matrices resulted in a significant 60% improvement in execution times. This optimization substantially enhanced operational efficiency, ensuring a faster integration of code changes through PRs. The same solution could be translated to GitHub Actions while maintaining low execution times.
  • The implementation of Fargate Spot instances, coupled with meticulous optimization, resulted in a remarkable cost reduction, plummeting from an average of $1,500 per month to less than $50 (around a 97.4% reduction). This achievement was realized by optimizing Fargate Spot resources and customizing instances according to workflow needs. This substantial cost-saving measure aligns with the overarching goal of achieving maximum efficiency with minimal expenditure, prompting the migration of more workflows to this solution.
  • The adoption of Fargate Spot instances, combined with GitHub Actions, introduced a new level of flexibility. Custom runners with different sizes and images were implemented, allowing teams to precisely adapt their environments. This not only optimized resource usage but also ensured that workflows aligned seamlessly with team-specific requirements, contributing to cost reduction by establishing fewer resources than some workflows actually require.
  • Visualizing the reduction in PR Cycle Time provides a clear and measurable representation of the impact on development speed. A consistent decrease suggests that the implemented solution yields positive results, indicating increased efficiency. This not only accelerates the delivery of new functionalities but also contributes to a more agile and streamlined development process.
  • The final adopted architecture leverages the open-source project terraform-aws-github-runner, which was adapted to generate GitHub Actions runners on-demand in Fargate Spot tasks but has certain limitations when running workflows that require DinD.

--

--