Scaling Continuous Integration with Azure DevOps

6 min read4 days ago

For some months, I’ve been in charge of developing the Cloud and SRE practices for my client. Good practices are ruled by DevSecOps methodologies, including automation. Continuous Integration is an implementation.

Concrete scales - Photo by David Bartus on Pexels

My client uses Azure as his main Cloud provider and Azure DevOps (ADO) as its software forge. Some developers already worked with Microsoft-hosted agents, but cost was starting to be a problem as we increased our CI usage.

Microsoft-hosted agents are equivalent to GitLab or GitHub Runners: they are programs that run the CI jobs on a machine.

We also wanted to make it possible for CI jobs to reach endpoints hosted under our AD-authenticated VPN, which requires adding complex and numerous steps in our pipeline when using a Microsoft-hosted agent.

In large organizations, you also only have one chance a year to contract the right resources for the year to come. And the number of contracted Microsoft-hosted runners are part of a fixed annual contract, which was a second — less-technical — problem not allowing us to scale with this solution.

In the meantime, we could pay on-demand, all year along, for any other resource in Azure. Now, having a dedicated team able to maintain organization-wide services, I chose we would go self-hosted.

Choosing the right solution

There are plenty of solutions to install Azure DevOps agents.

Single VM

Pros: easy maintainability.
Cons: can run only one job at a time. Multiplying VMs result in maintenance difficulties.

Docker image agents (on a VM)

Pros: fair maintainability, can run a fixed number of multiple agents per VM, semi-standardized administration.
Cons: internal teams don’t know Docker.

Kubernetes

Pros: scales quickly depending on usage, standardized administration.
Cons: initial setup time (cluster), new dependency to a new technology (internal teams don’t know Kubernetes), nodes auto-scaling is to be additionally configured.

VMSS

Pros: scales depending on usage, can scale to 0 when unused (i.e,: at night), internal teams know how to manage VMs.
Cons: initial setup time.

When I say “internal teams know how to manage VMs”, it means both our SysOps and SecOps personnel know how to manage and secure them. It also means they are able to intervene in case of a problem.

Cost reduction strategy and pool classes

To improve software maintainability and raise the security level of our services, I’ve implemented multiple controls through CI pipelines. Each commit now induces a dozen of parallel CI jobs that need to go fast to avoid impact developers velocity.

Choosing VMSS, here is a price comparison for our use case:

100x Microsoft-hosted agents: fixed 5382.2$/month (current expense)
VMSS 100x D2ads v5 running 10h/day on-demand: up to 4641.0$/month
VMSS 100x D2ads v5 with annual saving plan: fixed 7668.0$/month
VMSS 100x D2ads v5 running 10h/day spot instance: up to 651.0$/month

D2ads v5 machines (2 vCPUs, 8GB RAM, 75GB temp) are used to match Microsoft-hosted agents capabilities (2 vCPUs, 7GB RAM, 10GB temp. storage, Dv2-series family).

It would be perfect if we had only CI jobs that could tolerate failure and retry, so we can use spot instances all the time. But we have some jobs that require stability, for instance when deploying IaC to production.

So we defined two pool classes:

pool-linux-spot: for jobs tolerant to failure ;
pool-linux-stable: jobs requiring temporary stable storage and compute.

Around 95% of our pipelines are stateless and can tolerate failure (linters, tests, scanners, SAST tools etc.). So we’ve decided to set 9 out of 10 VMs to spot instances. Considering the low price and in order to get pipelines faster, we chose F4s v2 instances for spot.

1x VMSS 90x F4s v2 running 10h/day spot: up to 939.6$/month
1x VMSS 10x D2ads v5 running 10h/day on-demand: up to 464.1$/month

Considering VMSS can scale to 0 and that production deployments are maximum several times a day, that’s maximum 1403.7$/month instead of 5382.2$ that we’re talking about — or a 383% cost optimization.

A 383% cost optimization?! Yes, but… I didn’t include ADO self-hosted machine licenses because we already own multiple Visual Studio Enterprise licenses, including them. If you don’t have them, it would translate to an additional 1816.5$, which would lower the optimization ratio to 167%.

How to do, with which parameters

In Azure, create a VMSS with the desired machine characteristics. Use uniform orchestration and set scaling to manual with instance count of 0: the Agent pool extension will manage it for us.

At this moment, Ubuntu 24.04 is unsupported as the Agent extension works with the imp module, which was deprecated in Python 3.12. Use Ubuntu version 22.04 maximum.

Go to Azure DevOps > Settings > Agent pools and create a new agent pool. This will require a service connection bound to your subscription (with a service principal) on which is running your VMSS. This will install the agent on your VMSS.

For spot agents: set minimum number of agents to 1 (it’s cheap and is important for a quick dev workflow). No automatic tear down to avoid VMs spin-up time. Enable agent maintenance job every day.
For stable agents: set minimum number of agents to 0. Automatic tear down enabled.
For all agents: delay of 15 minutes before deleting excess idle agents.

Our specific use case involves connecting our VMs to our VPN. For both easy reproducibility (maintenance) and reliability (fail-safe), we have to design a custom boot script that configure new machines to connect our VPN and install the appropriate CI dependencies using VMSS’ Custom Linux extension.

Our infrastructure includes a VPN gateway that routes the VMSS network to our internal network. So it was not complex for this part.

To host this bash script, we need to set up a Storage Account and import it. Install the “Ubuntu Custom Script” extension and add your bootstrapping script. It may include PowerShell for Linux and the Azure CLI, for instance.

Agents/VMs are getting automatically added to our pool

Auto scaling in action, increasing or decreasing the number of VMs in the pool

Auto scaling in action in Azure, increasing or decreasing the number of VMs in the VMSS

Is it worth it?

Yes, if you have the competent people in your organization. And luckily VMSS are pretty stable.

However, consider some adjustments might be necessary in your devs pipelines, such as changing the Python version used because only one is available on the self-hosted machine (although you could avoid that with container-enabled testing). You will need to update your bootstrapping script for Linux to install desired dependencies (i.e., PowerShell for Linux, the Azure CLI etc.).

Configuring jobs cleanup is also important, as you have to consider an agent is not isolating jobs resources across runs. Removing the VM at the end of each run is an option, but it’s slow to restart. Another option is to set up the workspace.clean instruction to “all”, but devs will need to update pipelines’ code. Finally, always enable the daily agent cleanup option.

I would recommend monitoring usage of pipelines over time to eventually increase or decrease the maximum number of agents to spin up.

Anyway, dividing your bill by nearly 4 while minimally impacting devs velocity is always something good!