Buildkite event driven agents on AWS and Kubernetes

Published in

Beam Benefits

9 min readMar 28, 2022

The Site Reliability Engineering team at Beam Dental migrated our Continuous Integration (CI) system to Buildkite midway through 2020 to reduce our monthly CI costs and gain control over our CI infrastructure. We chose to deploy our agents as Kubernetes Job resources running on an EKS cluster and chronicled some of our decision making in a previous post. Initially this worked very well for us (and there was much rejoicing), but over time we saw our cost savings diminish as the complexity of our environment increased. We decided to start 2022 with a hard look at our architecture and how we could once again reduce cost and remove some unintentional complexity.

Pools and Pools of Agents

Our first iteration of our Buildkite EKS cluster involved pools of agents per queue waiting to accept jobs. Each queue was deployed as a separate Autoscaling Group (ASG) so the queues could be scaled independently. We tightly managed how many agents of each queue we ran per node and scaled the cluster based off of the percentage of agents busy for a given queue. This worked well for us and went largely untouched for more than a year, but some aspects of it were less than ideal.

To start with, adding a new queue was painful. It involved adding a new worker group to the terraform code we used to deploy our EKS cluster as well as some additional terraform changes for the autoscaling policies specific to the new queue. It was in no way self service for our development teams and wasn’t any fun for the SRE either.

Having a pool of agents was sub-optimal as well. At any given time you could be dedicating resources to idle agents until the cluster recognized the need to scale in. And the Buildkite scheduler was not aware of the placement of our agents (nor should it be). When a job was ready to be scheduled it would be assigned to the next available agent without any knowledge of the underlying node that agent was running on or how busy it may be.

Finally, troubleshooting the agents was cumbersome. It was difficult to align the pods that were running with the tasks they were responsible for. Maintenance activities like updating agents required waiting for the running agents to either idle timeout or finish the task they were assigned to and exit to pick up changes. It just felt like there had to be a better way.

Event Driven Agents

During the initial migration to Buildkite we came across the --acquire-job option and we thought wouldn’t that be a fun way to schedule our agents. Sadly, we couldn’t come up with a way to use this feature that didn’t take a jack hammer to the scope of the project. The missing piece for us that unlocked this capability was Buildkite’s Amazon EventBridge integration. We discovered the EventBridge integration in our quest to track agent wait time as a metric which just so happened to be the example use case for it. This became the gateway to our event driven approach to scheduling agents.

With our AWS account already receiving real-time events from Buildkite, we were able to add an EventBridge rule to send all of our Job Scheduled events to an SQS queue with a Lambda trigger. Our EventBridge rule:

{ "detail-type": ["Job Scheduled"] }

The Lambda processes the Job Scheduled events containing the job ID and launches ephemeral Kubernetes jobs with the --acquire-job <job-id> option set. Now our agents are launched on demand by a simple push of code to Github. Each agent a reflection of the number of concurrent steps in flight.

The agents are scheduled by the Kubernetes scheduler based on their resource requirements, allowing us to increase our agent per node density. Amazingly, we have not seen an increase in wait time over having agents pre-provisioned. An event driven agent can be running a job after just 7s of waiting. Using this approach to agent scheduling has also unlocked new abilities to dynamically configure our agents based on step metadata, which we’ll discuss in the next section.

Dynamic Agent Configuration

Now that each of our agents are destined to perform one specific Buildkite step (and only that step), the agents.queue setting has almost no meaning. Without the --acquire-job option, the queue was used to select an available agent as well as any other Agent Targeting Rules defined in the agents section of a given step. With --acquire-job option set, an agent can only process the step with the matching job ID even if it does not match the targeting rules. Since the Agent Targeting Rules are part of the Buildkite event, but no longer needed for scheduling, we can use them in new and exciting ways.

We wanted all of the configuration of the Lambda to live external to the Lambda code itself so we could make changes to our agents and metadata without having to make any code changes to the Lambda. Our scheduling Lambda relies on DynamoDB and S3. We store our agent job configuration in S3 as a Jinja template. The lambda downloads the job template from S3 and renders it using step metadata. The metadata is passed in via the job scheduled event before deploying the job to the cluster. And we use a DynamoDB table to store the default values of any metadata fields our job templates support.

What this unlocks for us is the ability to configure our Buildkite agents from within our Buildkite pipelines.

steps:
  - label: ":cloud: Run E2E Tests"
    agents:
      agent_tag: "v1.0.1"
      cpu: "4"
      memory: "4G"
      git_mirror: "false"
    command: .buildkite/steps/run_e2e_tests.sh

In the above example step, we are configuring our Buildkite agent Kubernetes job to request 4cpu and 4Gmemory of the Kubernetes scheduler. We are also setting the image tag version of the agent container tov1.0.1 and disabling the git-mirror Buildkite experiment for this step. And these are just a few examples of how we might configure an agent. We can also pass timeout values, TTL’s for the job after completion, what repo to pull the agent from and any other agent configuration we want to template into the job. Oh, the possibilities…

DynamoDB Queues

Having the flexibility to configure your agent on the fly is great, but if you have multiple steps that all require the same customization, you’re not going to want to continually define the same agents block. To solve for this, we created a DynamoDB table of “queues” where each row contains all the metadata values that a given queue has. There is a default queue that has 1 column per supported metadata field and a default value for that field. Then there are additional rows for each queue we define that has 1 or more columns of values to use for the metadata variables when that queue is requested. So, we can have a queue named e2e that is defined as:

queue=e2e,agent_tag=v1.0.1,cpu=4,memory=4g,git_mirror=False

And then our step from before is now configured as:

steps:
  - label: ":cloud: Run E2E Tests"
    agents:
      queue: "e2e"
    command: .buildkite/steps/run_e2e_tests.sh

Adding a new queue is as easy as adding a row to DynamoDB (assuming what you are configuring is already supported in the template). Our in house CLI tool provides commands to add and update these queues, including validation of the data that is being passed in. The CLI is now providing the self service queue creation we lacked in the past.

AutoScaling

Our autoscaling was complex in our old environment with a collection of CloudWatch alarms, autoscaling policies, SNS topics and Lambda’s used to perform and notify of scaling actions. We used a Lambda to run a custom job that would gracefully scale in a node. The Lambda would taint the node with NoSchedule and then wait for all agents to finish what their work. Once all the agents bound to the node exited the node would be removed.

Now that all of our agents are no longer bound to a specific node and only run when there is work to be done, we can simply rely on the cluster-autoscaler. The cluster-autoscaler is capable of both scaling up when there are not enough resources for scheduling and gracefully scaling down nodes by using PreferNoSchedule and NoSchedule taints. We annotate all of our agents with:

apiVersion: v1
kind: Pod
metadata:
  annotations:
    cluster-autoscaler.kubernetes.io/safe-to-evict: "false"

Setting safe-to-evict=false prevents the cluster-autoscaler from removing a node if there are still Buildkite agent pods running on it. The cluster-autoscaler has been fully capable out of the box for handling all of our scaling needs.

To avoid having jobs wait while the cluster scales up, we implemented an N+1 style over provisioning. We create a Kubernetes k8s.gcr.io/pause pod of the lowest scheduling priority that consumes 1 node worth of resources. When we start to schedule agents onto the last node they take priority over this pod. The agents are able to run, while the over provisioning pod goes into a Pending state, triggering the cluster-autoscaler to add more nodes. It’s simplicity itself and it’s lovely.

Concurrency and Agent names

Since we can now render our agents dynamically based on event data from Buildkite, we can name our agents using the pipeline name, job id and timestamp. When we look at the agents running or just completed in our Kubernetes cluster, we can see what pipelines they were assigned and which step they were assigned to. It’s glorious to behold. I’ve put kubectl get pods in a loop and just watched the pipeline steps dance across the screen while I work on other things. Troubleshooting agents is already so much easier not having a pile of agents all with the same prefix and a random string.

Adding the timestamp was critical to get the event driven architecture working with Buildkite concurrency_groups. For steps with concurrency_group defined, the Job Scheduled event fires when the job is initially scheduled even if there is no remaining concurrency in the group. The step then goes into a “waiting for concurrency” state and the agent that gets scheduled for the step exits immediately because Buildkite tells the agent that the step it is assigned to is not ready to be processed. When using--acquire-job if the job is done, cancelled, waiting on concurrency or otherwise un-runnable, the agent exits immediately because there is nothing to do.

When there is concurrency available in the concurrency group, Buildkite fires (or re-fires) anotherJob Scheduled event identical to the original one. Before we added the timestamp to the job name, the second event would create another Kubernetes Job with the same name. Since the job was already defined and in Completed state, the agent would not get scheduled. We had to add a timestamp suffix to the name to make sure these were treated as separate events.

Since the Buildkite events are identical, we used the SQS timestamp in the event to make them unique based on when the event was received. It’s working very well so far.

Next Steps

Sadly, Kubernetes is not yet aware of all of our resources. We do not use Docker in Docker, instead we rely on the hosts Docker system shared between all agents on the same node. When our steps rely on plugins like docker or docker-compose, those steps get scheduled in containers that the Kubernetes scheduler is not aware of. As a work around, we have the agent jobs request the amount of CPU and Memory that the docker container(s) will need. We have affinity with the host since we are all sharing the Docker system and we can set limits on the Docker containers so that they stay in bounds with what the scheduler has allocated.

In the future, it would be nice to convert these docker and docker-compose plugins to a single Kubernetes plugin that schedules these tasks as Kubernetes jobs that the scheduler is aware of. I’ve got a working POC to replace docker-compose that is based off of the EmbarkStudios/k8s-buildkite-plugin. The POC schedules all the containers in a single pod and injects a wait script from the plugin so the main application container waits for all other pods to be ready before running the BUILDKITE_COMMAND. The same solution should work for the docker plugin as well I imagine. I would love to spend some more time on this, but it is a lower priority for now.

We are also considering what our backup agent scheduling strategy might be if one of the new required services to schedule agents is having an outage. We now rely on Buildkite delivering Events, AWS EventBridge, SQS, Lambda, S3 and DynamoDB. If one of those services is down, our agents will not get scheduled. In our old architecture, we only relied on our EKS cluster availability and the ability of the agents to connect to the Buildkite API. How much engineering goes into an alternative strategy will likely be motivated by how painful our first outage is. Is it enough to just halt CI until the outage is resolved or do we need a break glass in an emergency mechanism? I’m sure we’ll find out soon enough.