How We Reduced Our Autoscaling GitLab Runner Costs by 58%

Published in

mamitech

12 min readSep 28, 2022

At Mamikos we use GitLab Runner for Continuous Integration (such as unit tests, API tests, etc.). Every merge request should pass its pipeline which ran by our GitLab Runner to be able for merging the code to the destination branch, it ensures our code is correct and behaves as expected. Not just the merge request, GitLab Runner is the backbone of our scheduled pipeline for feature tests and code coverage measurement.

That makes GitLab Runner important and we use it much during development activities. GitLab Runner offers an autoscaling feature for instances that are going to run jobs for the pipeline, at first of its implementation we use Cloud Provider D servers as our autoscaling instance even though there are still stability issues.

Then we found out that we could use another alternative including the use of AWS EC2-Spot which might save our cost for Cloud Provider D and gain more performance from it, so here’s our journey.

The Background

Every journey has a background on why it’s started.

As you know, we’ll talk about why we are moving from Cloud Provider D, what effort we already take to try alternatives such as AWS Fargate, and our expectations for AWS EC2-Spot.

Old Runner

We’ve been using Cloud Provider D for a quite long time, and as you might know from the explanation above we’re using GitLab Runner much. So, to help you with the context here’s approximately our cost in a month to run our development production workload on the runner:

It cost us about 154 USD/Month.

With that cost in mind, there is still an issue with the stability of the system. It’s because sometimes instances can’t be properly resized down after it exceeds idle timeout and is stuck making a waste of money on that:

As you can see, the above message shows instances unable to scale down.
This is why, docker-machine is used by GitLab Runner Docker Machine Executor to control autoscaling instances, especially on the cloud platform then make use of it.
At this time Docker already deprecate the tool, then because GitLab Runner Docker Machine Executor still uses it for the autoscaling feature, the GitLab team decided to fork it and continue to fix critical bugs. Unfortunately, it only focused on GCP, AWS, and Microsoft Azure drivers, so that is why that kind of stability issue isn’t being resolved.

On the cloud administration, there’s also some issue we found out about later. It was SSH Keys that isn’t deleted properly making it left behind on our account with hundreds number of SSH keys that are unused:

There are 61 pages of SSH Keys and mostly are left behind from the autoscale runner.

Why we’re moving?

After we know the old runner and some of the issues, now let’s talk about the reason for moving:

Cost saving opportunity
Still, Cloud Provider D servers are working for us, but we found that other alternatives are offering service at a low cost. Then we try to take a shot at it.
Get a better performance
Besides saving the cost, we hope these alternatives are more stable and give more performance than our old runner.

What move we have taken?

This is why we called it “alternatives” because our journey to AWS EC2-Spot actually begin by trying to implement AWS Fargate as our autoscale runner.

AWS Fargate is basically working for us, but it isn’t as far as we expect it. There are some conditions and performance issues that we thought we can’t use AWS Fargate as our autoscaling instances:

Not reusable
Fargate container is provisioned per task, the GitLab AWS Fargate Driver translates 1 job is 1 task making it time-consuming. Every time a new job in the pipeline run we need to wait for a minimum of 1.5 Minutes.
Got less performance
Fargate container needs more time to run job than previous runner instance at least 1.5 times slower.

Even though, the technical difficulty of implementing AWS Fargate as an autoscaling runner isn’t easy, because AWS Fargate is using tasks to define the behavior of the container that will run on it. We need to translate our jobs configuration from the gitlab-ci configuration file to the AWS Fagate task including the database container to support unit testing, the networking for containers, etc. We also need to customize our tester image to be installed with GitLab Runner which they call ci-coordinator. In the end, we’re giving up with AWS Fargate, and saying goodbye.

Our expectation toward AWS EC2-Spot

After we’re giving up on using AWS Fargate as our autoscaling instances, we’re trying the idea of using AWS EC2-Spot. With below expectations:

Save the cost
Up to 90% off from on-demand instances. Again, we are still expecting a low cost of using EC2-Spot this time. Because of the nature of its service, it’s capable of providing EC2 instances at a cheap price.
Increase runner performance
a. More stability using AWS and GitLab AWS EC2 Driver, as mentioned before the driver for AWS is still maintained by the GitLab team so any bugs are fixed by them.
b. Less running time than AWS Fargate, at least with AWS EC2-Spot we got more short execution time and are able to be reused like a normal runner.

The Journey

Yes, literally the journey of implementing and making it cost-effective is what we’re going to talk about.

How it started

At first, we just configure it as default/minimum configuration. And we actually don’t really know the architecture at this point yet, but as we’re solving problem by problem and digging more into the documentation we finally know. Below is what the architecture looks like in the first place:

In the beginning, we place a GitLab Runner manager instance on AWS with an on-demand EC2 instance considering the data transfer out costs that we ‘thought’ will be a big spend on our cost from it.

As you might know, AWS charges for traffic that goes outside their network and it’s a concern for us, without that we can save more cost. Then for the artifact, we’re still using GitLab server local storage for it because that’s the default configuration. For the cache, we’re using an AWS S3 bucket to store it, same as on Cloud Provider D we use their object storage for cache storage.

After we implement and tested it, turns out that what we were concerned about and thought we already avoided it by placing the GitLab Runner manager instance on AWS, which is the data transfer out costs. Now still being charged.

Problem
There’s a data transfer out cost from AWS EC2 to GitLab Server.

Turns out data transfer cost is caused by the runners that need to transfer artifacts to the GitLab server which is outside AWS. In this case, we tried to use a solution involving AWS S3 as artifact storage.

Then…

As mentioned above we move our artifacts storage from local storage on the GitLab server to AWS S3 to avoid data transfer out costs.

Solution
Move artifact storage to S3 (with Direct Upload), so it should look like the one below.

The picture above actually shows what architecture we’re expecting on moving artifacts to AWS S3 using the direct upload option. Because at the time we did not know well about the architecture and that made data transfer costs still occur.

Then we also tried another solution using AWS S3 Gateway Endpoint that we provision on our VPC, and AWS S3 Regional Endpoint that is available for each regional also didn’t work.

Problem
Turns out, it isn’t direct at all. Now some of the data transfer out costs moved to AWS S3, and below is the actual architecture.

So, after we migrated artifacts to AWS S3 (with Direct Upload) the runners still upload artifact parts to the GitLab server especially targeting the GitLab Workhorse service for processing the upload.
As you can see in the picture the cost isn’t only from runners to the GitLab server but also from the GitLab server to AWS S3, it’s because GitLab Workhorse uses a multipart upload method. The multipart upload is charged with data transfer out costs because it needs to download the original object in order to upload the part.

Then after we know that the GitLab server also downloaded objects from AWS S3 for multipart upload, we try to deny the GetObject operation for the GitLab server, It also didn’t work for us. Runners are unable to upload the artifacts.

The odd thing is, instead of adding more cost it makes the costs split up, some of the data transfer out now moved to AWS S3.

Another Approach

After we’re trying the possible solutions we thought of before and didn’t work, we try to reach AWS people that support us. Some of their suggestion is using a private subnet for runner instances so the data can’t get out from the AWS network.

Now the architecture should look like the above picture, and then a problem occurs.

Problem
Can’t connect to the internet for pulling images and uploading artifacts.

After we use a private subnet for the runner instances it can’t connect internet and also the GitLab server for uploading artifacts, again it’s because of a lack of architecture knowledge from us.

Now let’s move to the final solution, don’t give up!

Revision Final Fix Used…

Now we’re trying the last solution that’s actually not involving the infrastructure, we try a treatment targeting the cause of data transfer out which the artifacts themselves.

Basically, we’re back to early architecture but reducing artifact generations on gitlab-ci so the data we need to process isn’t much. It is done in the .gitlab-ci.yml file, on each repository that uses the runner.
As it is applied our data transfer out cost reduces much and we’re confident about using it in production.

Shoutout to our CTO who’s come up and executed the final solution. 🙌

The Performance

With that kind of journey let’s talk about the performance of the system, is it meet our expectations? is it worth the effort?

Stability

Let’s see the stability of the system here.

Stuck On Removing
As you know with Cloud Provider D servers we often experience issues with the servers can’t be resized down and running as zombies, we can see that on the stuck-on-removing graph of our grafana.

The graph is for the last 7 days when it’s captured, it’s showing a flat line which means we didn’t get any stuck-on-removing instances by the time. So, the system can be considered stable. 🎉

Undeleted SSH Keys
Lately, after we used Cloud Provider D we found out that there’s also an issue with SSH keys administration that doesn’t delete SSH keys maybe after servers terminated forcefully because of stuck-on-removing.

The SSH keys count matches with the current active instances on docker-machine. The driver is doing great with SSH keys administration.

Execution time

It’s important to get good time execution for our efficiency because we use it on every merge request to be processed in the next development stage.

We can see in the above table from 3 different systems we’ve tried before which are Cloud Provider D servers, AWS Fargate, and AWS EC2 Spot. As we expect AWS EC2-Spot is faster than AWS Fargate, the difference is about 1 hour for running all of the jobs within the pipelines, which is a green flag for us. 👌🏻

The moment we note for execution time is different for each system, so it could be workload difference among them. But still, we can use the data for considering results.

The Cost

Here is what we’re waiting for, wondering about the performance and stability, and also the efforts. Now it’s time to reveal how much cost we need to pay for the system. But before that, we want to know what kind of specifications and capacity of the runner we provide to support GitLab CI right?

Specification

On our system, we use different classes for different workloads, each class has a different instance specification. So, jobs with more workloads can be assigned to a more powerful instance class.

Class #4 is introduced in the EC2-Spot era, and because of the price and billing scheme, we can afford it.
Class #5 is used for the shared runner, this type of runner can be used for all of the projects across our GitLab.
Class #6 is used for another service that needs an independent runner.

Capacity

It shows what component we’re using in the system, including runner manager, classes with the max autoscaling instance count for each class, storage system we use. In the end, it’s to know how much we cover with the cost we spent.

For classes max autoscaling instance count there’s so much difference between the previous system and AWS EC2-Spot, we can serve more instances on AWS EC2-Spot, it’s possible because of the price and billing scheme of AWS EC2-Spot.

There’s a difference for Class #5 which was previously just 1 instance all the time, now it can be a maximum of 12 instances on autoscaling, then for Class #4 which was introduced in the AWS EC2-Spot era, it’s available all the time with 2 instances. And also capacity upgrades up to 3 times more for other classes than in the previous system.
We can see below graph how much capacity we’re using at some point in time:

It’s better that the previous system. 🎉

How much?

Now it’s the time to reveal the cost, who’s excited?

Yes, we save about 58% of our cost now while supporting much greater capacity! After all the efforts, and time we put there finally we can get a cheaper cost for our runner instance with AWS EC2-Spot. 🎉

The data is from July 2022.

That is all possible because AWS EC2-Spot billing scheme is different from Cloud Provider D which bills for a minimum of 1 hour. This means even though we’re using it for only 1 minute we need to pay the bill for 1 hour and that’s bad for autoscaling, considering most of our jobs only ran in the minute range.

With AWS EC2-Spot billing scheme we’re billed per second and the minimum bill is 60 seconds for new instances, that’s also why we’re able to provide more capacity for our autoscaling system because it is still cheap even though we have more instances simultaneously and frequently resize the instances. You can check AWS page for more information about the pricing here.

Thanks!

Finally, thanks to Mamikos DevOps teams, our CTO, and Engineering Managers who help us with this project. Hopefully, it can motivate others with this article, and feel free to give feedback to us! Thank you! 🙌🏻