Deploying Serverless GitLab Runners on AWS Fargate with Terraform

A complete setup of secure and scalable serverless GitLab runners on AWS Fargate via Terraform IAC

Diagram of Gitlab Runner using ECS on Fargate

As GitLab becomes widely used in GovTech, we need a way to manage our CICD runner fleet in the most secure and scalable way. A serverless setup with AWS Fargate becomes attractive with its simplified infrastructure management and effortless scalability.

In this post, I will share a setup for a fleet of serverless GitLab runners on AWS Fargate managed via Terraform for ease of reproducibility in any environment.

Background

GitLab has a guide on Autoscaling GitLab CI on AWS Fargate, with runner manager hosted on an EC2. Thus it’s not completely serverless. Others have shared a full ECS on Fargate setup for both managers and workers, such as Serverless GitLab CI/CD on AWS Fargate by Daniel Coutinho de Miranda and A serverless approach for GitLab integration on AWS by Damiano Giorgi.

However, these examples rely heavily on manual configuration and support only 1 manager profile in the EC2/Fargate Task.

The main motivations for the setup described in this post are:

  • Use Terraform Infra-as-Code for a more secure, reproducible and configurable setup.
  • Support multiple runner manager profiles in one Fargate instance.

Architecture Overview

Architecture diagram for a complete setup of GitLab runner using ECS on AWS Fargate

Key elements:

  • ECS Service on AWS Fargate to host the manager service, this service can register multiple runner managers.
  • Each runner manager registers itself as a runner with GitLab upon creation of new ECS Task under this ECS Service.
  • When a job from GitLab is triggered, an appropriate runner manager is assigned the job by GitLab, it then creates a new worker ECS task in specific subnet and security group to perform the job.
  • Each worker task has a predefined role and container image.

Deployment Process

Step 1: Build and publish container image for managers

Code: https://github.com/GovTechSG/fargate-gitlab-runner

  1. Set IMAGE_NAME, IMAGE_TAG , AWS_ACCOUNT_ID, and AWS_REGION environment variables as desired. IMAGE_NAME should start with the ECR domain ${AWS_ACCOUNT_ID}.dkr.ecr.${AWS_REGION}.amazonaws.com to publish to ECR later.
  2. Use docker-compose build to build the image.
    Alternatively run: docker build -t ${IMAGE_NAME}:${IMAGE_TAG}.
  3. Login to ECR
    aws ecr get-login-password --region ${AWS_REGION} | docker login --username AWS --password-stdin ${AWS_ACCOUNT_ID}.dkr.ecr.${AWS_REGION}.amazonaws.com .
  4. Publish to ECR using docker-compose push or docker push ${IMAGE_NAME}:${IMAGE_TAG} .

Note that this single image spawns multiple managers using profiles from the MANAGERS_CONFIGS environment variable.

If you use podman instead of docker, install podman-compose as well.

Step 2: Build and publish container image for worker tasks

Sample images are available at https://github.com/GovTechSG/fargate-gitlab-runner-worker.

  1. Update Dockerfile to install new tools as necessary.
  2. Build and publish to ECR using the steps for managers image above.

Step 3: Set up additional resources required by the ECS Service

Following resources are required to be set up either manually or via another set of Terraform code (recommended):

  1. Secret for GitLab Token: Obtain token for runners from GitLab, create a new secret in either AWS Secret Manager or System Manager Parameter Store. Take note of the KMS key used.
  2. VPC, subnets and security groups required for both managers and workers. Take note that SSH communication via port 22 need to be allowed between managers’ and workers’ network ACLs and security groups.
  3. Workers’ ECS Task roles with appropriate policies for the worker tasks to access AWS services.

Step 4: Deploy ECS Service using Terragrunt

Code: https://github.com/GovTechSG/fargate-gitlab-runner-terraform

  1. Copy from environments/sample-dev to a new env environments/<env> , replacing <env> with your desired environment name.
  2. Update your environment settings at environments/<env>/env_inputs.hcl .
  3. If there’s no existing ECS Cluster for either managers or workers, create new one(s) following the samples at environments/<env>/ecs-cluster-for-managers or environments/<env>/ecs-cluster-for-workers.
  4. Update variable values in environments/<env>/ecs-fargate-gitlab-runner-service/terragrunt.hcl matching your environment.
  5. Use cd environments/<env>/ecs-fargate-gitlab-runner-service && terragrunt apply to deploy the ECS Service.
  6. After a few minutes, review ECS Service logs, and check GitLab to verify that new runner(s) have been registered. The number of runners should be equal to manager_instance_count * length(keys(managers_configs)).
  7. Test the new GitLab runner(s) with your own job or a simple script like this:
test:
tags:
# these should match all the tags set in the manager configs or a subset (note that a subset may mean other non-Fargate runners can pick up the job, depending on your setup)
- dev
- tool1
script:
- echo "It works!"
- for i in $(seq 1 30); do echo "."; sleep 1; done

If all is successful, you should get an output similar to this:

Sample output of a successful run
Output of a successful run

If you prefer to use Terraform instead of Terragrunt for a quick test:

- Create terraform.tfvars file in folder terraform_modules/ecs-fargate-gitlab-runner-service with all variable values found in environments/<env>/ecs-fargate-gitlab-runner-service/terragrunt.hcl

- Copy content of versions.tf and provider.tf from environments/terragrunt.hcl into their respective files in terraform_modules/ecs-fargate-gitlab-runner-service.

- Finally, run terraform apply .

Resources Created by Terraform

These are the main resources created by the Terraform module:

  • an ECS Service for the manager container
  • its container definition with required environment variables
  • its IAM role that allows running and stopping the new ECS Tasks using the worker ECS Task definition(s)
  • worker ECS Task definition(s), one for each manager profile in managers_configs

What happened?

During Registration:

  1. Once ECS Service is deployed by Terraform, it creates a task to host the manager container image.
  2. When this container is up, it reads environment variables, most notably MANAGERS_CONFIGS variable, and registers each manager as a runner to GitLab, generates the full GitLab runner config.toml file as well as individual runner’s fargate_worker.toml file containing the worker’s security, network settings and its task definition.

You should see this in GitLab’s list of runners, 1 runner for each key in MANAGERS_CONFIGS (assuming manager_instance_count is 1).

Running job:

  1. Once a job whose tags match a subset of the tag list of those runners is triggered, GitLab sends the job to the appropriate runner manager.
  2. The runner manager then starts a new worker ECS task using the task definition that have been passed in MANAGERS_CONFIGS , adding its SSH_PUBLIC_KEY as environment variable in the process.
  3. The worker ECS task is started with the sshd process with the manager’s SSH_PUBLIC_KEY added to its authorized_keys .
  4. Finally, the manager runs the job in the worker ECS task via ssh , the job output and status are shared back to GitLab as usual.

Troubleshooting

Followings are some issues that can occur:

Unable to start Fargate Task due to No Container Instances were found in your cluster error

  • Check your ECS Cluster for workers and make sureDefault capacity provider strategy is set to FARGATE

Manager unable to connect to ECS to start a task

  • If the managers are hosted in private subnets, create VPC endpoints for ECS and ECR and make sure the managers can access them.

Manager unable to connect to worker ECS task via ssh

  • Make sure your worker container image has openssh installed and SSH_PUBLIC_KEY is added to the right user’s ~/.ssh/authorized_keys .
  • Check that the subnets and security groups of both managers and workers allow traffic on port 22. Use the VPC Reachability Analyzer to confirm.
  • If the error is signature algorithm ssh-rsa not in PubkeyAcceptedAlgorithms , enable ssh-rsa by adding this to worker container image:RUN echo “PubkeyAcceptedKeyTypes +ssh-rsa” >> /etc/ssh/sshd_config .

Worker ECS task has no credentials to access AWS

  • Share the variable AWS_CONTAINER_CREDENTIALS_RELATIVE_URI with the SSH session by adding this to sshd run:
    -o "SetEnv=AWS_CONTAINER_CREDENTIALS_RELATIVE_URI=\"$AWS_CONTAINER_CREDENTIALS_RELATIVE_URI\""

Known Limitation:

Conclusion

We have set up a complete set of serverless runners for GitLab using AWS Fargate. With Terraform, the setup can be configured easily for different environments.

In addition, we have full freedom to create runners for a wide variety of needs: array of worker images with different tools installed, multiple worker ECS clusters using different spot instance strategies to optimize costs, custom worker roles to access cross account resources, …

The possibilities are boundless. It’s up to you to configure and experiment.

Thank you for reading. Do comment below to share your thoughts.

The main project for this article is hosted on Github.

Credits to my colleagues for sharing resources and proof reading.

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Quy Tang

Quy Tang

1K Followers

A drop in a river, a part in a community, a student of mindfulness and compassion, towards a kinder, wiser global community