Implement Seekable OCI for our GitHub Actions workloads

Dickson Armijos
Building La Haus
Published in
8 min readNov 24, 2023

In the world of container-based deployments, optimizing startup times is crucial to ensure efficiency and agility in application deployment. In our case, we’ve significantly improved task startup times in our Amazon Elastic Container Service (ECS) cluster by implementing Seekable OCI (SOCI), an Amazon Web Services (AWS) technology that allows for deferred loading of container images. The impact that SOCI had on both speed and costs far exceeded our initial expectations, achieving accelerations of up to 3 times in the launch of containers in addition to reducing costs by more than 24%.

SOCI is a technology developed by the AWS team that allows applications to scale faster by allowing them to run without needing to download the entire container image. Rather than pulling the full image to start a container, SOCI allows extracting only the specific metadata needed. By selectively seeking the required image metadata, SOCI avoids transferring unnecessary data. This “lazy loading” approach provides significant time savings in container startup times (see more).

Context

From the Platform team at La Haus, we have created an ECS cluster that is designed to create Fargate Spot tasks that deploys GitHub self-hosted runner instances. This configuration suits a variety of use cases, including private database backup tasks, resource-intensive tests, or tasks that require communication with services on private networks. The flexibility of the ECS cluster and Fargate Spot capabilities enables secure, customizable runtime environments for different teams and their specific needs. With custom container images, our solution seamlessly integrates GitHub Actions into workflows, ensuring optimal performance for tasks ranging from resource-intensive testing to runtime environments with pre-installed tools.

It’s important to note that the Docker images we use vary in size, with the smallest being around 1.7 GB and the largest reaching approximately 2.9 GB. This variability in image sizes can impact task start and deployment times in the ECS cluster.

SOCI Dependencies

It’s important to highlight that for SOCI to function correctly, certain dependencies are required, including nerdctl.

This choice is due to the fact that Docker does not store the compressed image on disk (open issue), but rather compresses it during the push process. This means we don’t have access to the compressed layers to build the SOCI index until after the push. By using nerdctl, a tool compatible with SOCI, we can overcome this limitation and make the most of SOCI's capabilities.

Here is an example of how we can build the SOCI index of an image, assuming that you have all the necessary dependencies:

To create an index of your images, you can update your CI workflows by adding the commands shown above or implementing this solution created by the AWS team that generates image indices with Lambda functions. This way, you can avoid modifying all your workflows.

In our case, we have a GitHub Actions workflow that builds all our images, so we are going to update our workflows by adding the steps shown above.

Data Analysis

In order to analyze these, we need the following Python script that collects data on the creation time and execution start time of the ECS tasks and then stores these results in an SQLite database. The purpose of this is so that we can collect data before and after the implementation of SOCI in order to analyze the results obtained from the startup times.

The following script uses two datasets to compare the implementation without SOCI (ecs_data_v1.sqlite) and with SOCI (ecs_data_v2.sqlite), taking into account the family of tasks that we have in our cluster. Now we are ready to analyze our implementation.

Comparison of Results

All the tasks we run on our ECS cluster have different Task Definitions assigned, which allows us to create different shapes and flavors for our GitHub Actions workloads, the names of the Task Definitions are formed as follows:

sh-runner-<INSTANCE_TYPE>-<IMAGE_NAME>

Where:

  • <INSTANCE_TYPE> indicates the size of the Fargate instance (nano, micro, small, etc)
  • <IMAGE_NAME> is the name of the used Docker image (base, teamS-ruby, etc).
Image 1. List of how the Task Definitions that we use in the ECS cluster are made up, each task definition has its own size and image.

For example, for sh-runner-small-teamS-ruby we can see that it uses a small instance of Fargate with the teamS-ruby image.

We have defined a standard for the size of the Fargate instances that we will use in our tasks:

  • nano: 0.5 vCPU, 1GB RAM
  • micro: 1 vCPU, 2GB RAM
  • small: 1 vCPU, 4GB RAM
  • medium: 2 vCPU, 4GB RAM
  • large: 2 vCPU, 6GB RAM

Likewise, our tasks use different types of Docker images:

  • base: base Linux image of 1.5GB
  • teamS-ruby: Ruby image of 2.9GB
  • teamS-nodejs: NodeJS image of 1.8GB

This nomenclature allows us to quickly identify the resources and types of images assigned to each Task Definition.

Image 2. Analysis of the results of the task startup times in the ECS cluster before using SOCI and after using SOCI.

The image above was generated with our Python script (show_metrics.py) to visualize the implementation results, now the results analysis for each:

For sh-runner-micro-base we see that it uses a micro instance of Fargate (1 vCPU and 2GB RAM) with a base image of 1.5GB. The 64% optimization in startup time demonstrates how quickly SOCI can extract only the necessary metadata even from medium-sized images on small instances.

In sh-runner-nano-base we have an even smaller nano instance (0.5 vCPU and 1GB RAM) but with the same base image. Despite having fewer resources, SOCI achieved a 61% reduction in startup time.

For sh-runner-small-base, a slightly more powerful small instance was used (1 vCPU and 4GB RAM), but with the same base image. The benefit of SOCI was similar to the "micro" case, with a 66% improvement.

In sh-runner-small-teamS-nodejs we see the impact of SOCI on a small instance with a heavier image of 1.8GB. Still, an acceleration of 67% was obtained.

Finally, in sh-runner-small-teamS-ruby we have the combination of a small instance with the heaviest image of 2.9GB. Despite this, SOCI produced the greatest optimization, reducing the average time by 72%.

Performance 📈

On average, SOCI produced an acceleration of approximately 3 times in task startup times, with improvement percentages ranging from 61% to 72% for the different evaluated Task Definitions. We have to emphasize that for small layers of our images, it is possible that there is not a big difference in the execution times. If this is the case, we can follow SOCI’s suggestion:

We skip building ztocs for smaller layers (controlled by --soci-min-layer-size in nerdctl push) because small layers don’t benefit much from lazy loading.

This demonstrates the benefits that deferred loading of container images offers through techniques like Seekable OCI for serverless computing environments like Fargate Spot. The ability to extract and load only the necessary metadata to start a container, rather than downloading the entire image, results in much faster starts and a more efficient use of computational resources.

Costs 🤑

Well, it’s time to talk about 💰. Although it is somewhat complicated to define how much we are going to save with this implementation due to the unpredictable nature of our workloads, the variety of instance types we are using, and the implicit savings from avoiding the developer having to wait for that period of time, we can conduct a quick analysis by selecting a specific GitHub Actions workflow and analyzing it.

We can start with the definition shown on the AWS website:

AWS Fargate pricing is calculated based on the vCPU, memory, Operating Systems, CPU Architecture, and storage resources used from the time you start to download your container image until the Amazon ECS Task

which means that if we have a task that is very heavy and takes around 5 minutes to pull an image, we will have to pay for those 5 minutes, and this is where SOCI can help us reduce costs.

Let’s do a quick exercise based on an average of the number of executions we have in one of our workflows that uses the Task Definitions called sh-runner-small-teamS-ruby. This workflow runs about 150 times a month, running a matrix of 30 jobs/tasks in parallel that take approximately 4 minutes each, plus an additional 2 minutes for task start (see image 2).

Considering that the Fargate small instance has 1 vCPU and 4GB of RAM, and using AWS rates of $0.04048 per vCPU hour and $0.004445 per GB-hour (Linux/X86), the cost per hour for this configuration is $0.05824.

Writing a lot of numbers seems to be a bit tedious to read, so let’s write some Python code to make it easier to understand, and it can also be useful if you have a case similar to ours and want to have an idea of what you could save with this implementation:

Output:

Total monthly cost without SOCI: $26.22
Total monthly cost with SOCI: $19.93
Estimated monthly savings: $6.29

In these results, we assume that we are working with the price of Fargate on-demand instances, but our actual workload uses Spot Instances, so real-world savings may vary. However, this is a reasonable estimate of potential cost optimizations with SOCI.

Conclusion

  • As demonstrated in the analysis, the implementation of Seekable OCI in our ECS cluster resulted in significant improvements in task startup times, with reductions ranging from 61% to 72% for the different evaluated definitions. While we are not currently leveraging horizontal scaling in our ECS cluster, these optimizations in startup times could facilitate faster horizontal scaling events for services running on ECS.
  • Regarding the impact on costs, it is worth mentioning that performing an accurate analysis is complex in our case since our Fargate Spot workloads run on demand and task volumes can fluctuate significantly from day to day, but we managed to demonstrate that a simple workflow that executes around 4500 tasks/jobs per month where the monthly cost is around $29, with the new implementation the approximate cost is $19, which means a saving of 24% of only a GitHub Actions workflow. Now imagine it for all the workloads you are running on your cluster 😉
  • Currently, we are analyzing carrying out this same implementation for all our workloads that run in our EKS clusters in order to speed up the automatic scaling of our services, we will probably have to deal with custom AMIs.

We will continue to explore ways to further improve our CI/CD processes by leveraging innovative technologies. However, these results demonstrate that achieving much more agile and cost-effective deployments in Fargate Spot with the use of SOCI is already possible.

In an upcoming article, we will analyze in detail how we have configured GitHub Self-Hosted Runners on Fargate Spot instances as part of our strategy to effectively run CI at lower costs.

--

--