Custom Azure VM Scale Sets with Terraform and Cloud Init

Published in

Microsoft Azure

10 min readMay 7, 2021

I recently had to deploy a Virtual Machine Scale Set (VMSS) in Azure needed for a self-hosted agent pool for Azure DevOps. I decided to use Terraform to implement that and even though this should not be a big deal at all, I came a cross a couple of challenges i would like to share with you in this blog post.

Before we dig deeper into the details, let us start with a short overview about the involved components:

Virtual Machine Scale Sets (VMSS) are a powerful service in Microsoft Azure that lets “[..] you create and manage a group of load balanced VMs. The number of VM instances can automatically increase or decrease in response to demand or a defined schedule. [..]”(see here).
Terraform is “[..] an open-source infrastructure as code software tool that provides a consistent CLI workflow to manage hundreds of cloud services. [..] created by HashiCorp.
Azure DevOps (ADO) is a suite of tools and services like for example Repos, Pipelines, Boards and others that help teams to collaborate, develop, build and test.
Azure Pipeline Agents (or build agents) are used to build your code, execute scripts or deploy your software. An agent is computing infrastructure using Windows, MacOS or Linux with installed agent software that runs one job at a time.

Last but not least, the self-hosted agent pools itself, why do i need them?

Azure Pipelines by default uses Microsoft-hosted build agents. These build agents are hosted and maintained by Microsoft and they can be used for a wide variety of scenarios. But, some scenarios might require specific configurations that can only be achieved using self-hosted agents. Typical scenarios for example require agents to be hosted on-premises (the agent then needs only outbound Internet access) or locked down, private deployments in virtual networks where certain services are not accessible via the public internet. Other scenarios require specific capabilities for the build process like for example VM SKUs that provide GPU capabilities.

These are typical scenarios where self-hosted build agents can help. In the next chapter we will take a deeper look into how to deploy and configure VMSS in Azure using Terraform.

Deploy VM Scale Sets using Terraform

The first challenge I came across when trying to deploy a VMSS via Terraform was the fact that the Terraform resource called azurerm_virtual_machine_scale_set that was used to deploy VMSS in Azure in the past was superseeded by the azurerm_linux_virtual_machine_scale_set resource for Linux-based VMSS and the azurerm_windows_virtual_machine_scale_set resource for Windows-based VMSS. The original resource does still exist and will continue to be available throughout the 2.x releases of the AzureRM Terraform provider, however the syntax is slightly different and new features will not be added anymore. I would therefore recommend to use the new, OS-specific Terraform resources instead.

The further deployment of our Linux-based VMSS via Terraform is pretty straight forward — here is my Terraform definition:

This will result in a Linux-based VMSS using Standard_F8s_v2 VMs with Ubuntu Server 20.04 LTS. I would like to highlight a couple of things configured above:

overprovisioning and single_placement_group are set to false to align with what Azure DevOps sets when managing the VMSS.
admin_username and admin_password are used instead of SSH only to have a break-glass option for troubleshooting. The password itself is auto-generated and stored in Azure Key Vault. It is not supposed to be used at all. Configuring username and password required me to set disable_password_authentication to false.
custom_data in combination with Terraform’s local_file data resource helps us to inject a Cloud Init configuration file into our VMSS intances. We will go through the details of Cloud Init in the next chapter of this blog post.
option in diff_disk_settings in the os_disk block is set local to use Ephemeral OS disks. These Ephemeral OS disks are created on the local virtual machine storage and not saved to the remote Azure Storage for lower latency, fast reset and reimage.
boot_diagnostics with storage_account_uri is set to null. This enables Azure’s “Boot diagnostics” using a managed storage account. The Serial console is not supported in this configuration, but it provides us with a Serial log of our VMSS instances. We will see the value of it further down in this blog post.

And that is already pretty much it. I have now set up a new Azure Resource Group, containing an Azure Key Vault, Virtual Machine Scale Set and a Virtual Network.

Managing a VMSS via Azure DevOps

In my case is my newly deployed VMSS supposed to be used as a self-hosted agent pool for Azure Pipelines. This was a tricky beast to set up and configure in the past, but the Azure DevOps team did an amazing job making this a much smoother experience. To use my VMSS as an agent pool, managed by Azure DevOps, I just have to add it via the Azure DevOps portal.

Go to Project Settings -> Pipelines -> Agent pools
Click on Add pool and select New
Select Azure virtual machine scale set as the Pool type

This will bring up a dialog where we can easily select our VMSS:

Use an existing VMSS as agent pool for Azure DevOps

What you can see in the screenshot above is how easy it is to use an existing VMSS in Azure as a self-hosted agent pool for Azure DevOps (ADO). ADO lets you configure a couple of things like the max. number of nodes, how many nodes should continiously run and wait for new jobs and the time before ADO scales the agent pool down.

I usually do not set “Grant access permission to all pipelines” under “Pipeline Permissions” for any of my configurations in Azure DevOps to have more granular control. This will then require you to grant access to this agentpool when a pipeline is trying to use it for the first time. Here’s an example of one of my pipelines using the new agent pool for the first time:

That was the easy part, ADO can now use this VMSS as an agent pool to run build jobs. But, what happens when you need specific tools installed or customizations to the used image (in our case Ubuntu 18.04-LTS) to be made? This is where Cloud Init comes to play.

Customize Linux using Cloud Init

There are several ways to configure and customize VMs in Azure, the most obvious and well known ones are of course using a CustomScriptExtension to run scripts at startup time or using a custom images for exampe using the Azure Image Builder (which is build on HashiCorp Packer).

While the CustomScriptExtension is a viable way, it’s at least in my experience an error prone way of doing it. And the downside of a custom image is the custom image itself, I prefer building on top of existing images. And this is what Cloud Init can help us with.

Cloud-init is the industry standard multi-distribution method for cross-platform cloud instance initialization. It is supported across all major public cloud providers, provisioning systems for private cloud infrastructure, and bare-metal installations. cloud-init Documentation

For my setup the requirements are pretty clear. I need:

a customizable set of installed packages (like azure-cli, unzip, terraform, powershell and others)
packages from various sources including 3rd party repositories like Microsoft, HashiCorp and others
build on-demand based on the latest Ubuntu 18.04 image (or newer) available in the Azure Marketplace

With this requirements in mind, I’ve put together the following cloud-init.conf file:

What you can see in my cloud-init.conf are two main sections. The first section apt contains custom configuration settings for apt in my case three additional repositories including their public keys. These 3rd-Party repositories are added as sources to my VM’s apt configuration. They’ll be from now on used when installing packages via apt:

[30.763488] cloud-init[1773]: Get:32 https://apt.releases.hashicorp.com bionic/main amd64 Packages [22.8 kB]
[30.836116] cloud-init[1773]: Get:33 https://packages.microsoft.com/repos/azure-cli bionic/main amd64 Packages [13.4 kB]
[30.908373] cloud-init[1773]: Get:34 https://packages.microsoft.com/ubuntu/18.04/prod bionic/main amd64 Packages [179 kB]

The second section is called packages. This is basically a list of additional packages I need to be installed in my VMSS instances. In my example above we’ve for example npm, terraform, azure-cli and powershell. These are tools that are needed for our build agents to do their work. Cloud Init takes care of installing them using the additional sources provided above:

[51.829131] cloud-init[3200]: Preparing to unpack .../074-azure-cli_2.22.1-1~bionic_all.deb ...
[51.831957] cloud-init[3200]: Unpacking azure-cli (2.22.1-1~bionic) ...

We can easily extend the list of packages in our cloud-init.conf file later, apply the configuration again via Terraform, and the new packages and tools will be available in our build agents.

Troubleshooting

The setup we have so far is pretty much intended to be treated as a “black box” that should not require too much maintenance. We’re using an up-to-date marketplace image, we’re installing packages at boot time of each individual instance and our VMSS instances do not have a public IP address that can be accessed or attacked. On top of that is our VMSS manged by Azure DevOps. But there might be nevertheless situations where things go wrong and you need to learn more about what’s happening inside of our VMSS instances.

Let’s assume that our build process complains that a package isn’t available though is should be installed:

##[error]Azure CLI 2.x is not installed on this machine.
##[error]Script failed with error: Error: Unable to locate executable file: 'pwsh'. Please verify either the file path exists or the file can be found within a directory specified by the PATH environment variable. Also check the file mode to verify the file is executable.

When you go back up to our cloud-init.conf file, you’ll see that azure-cli as well as pwsh (aka powershell) should be installed:

To check if that has really happened to our build agent, we can check the serial log of our VMSS instance. The “Initialize job” task in our pipeline tells us which agent was used:

The agent (aka VMSS instance) should still exist in our VMSS in Azure:

Select the VM, go to Support + troubleshooting and click on Boot diagnostics. Within Boot diagnostics you’ll see Screenshot (which shows the current state of the console) and Serial log. Serial log is the more interesting one for us for now. Within the Serial log you’ll find all details about the cloud init process and which packages were installed:

cloud-init[1566]: Get:39 https://packages.microsoft.com/repos/azure-cli focal InRelease [10.4 kB]
cloud-init[1566]: Ign:40 https://packages.microsoft.com/ubuntu/18.04/prod focal InRelease
cloud-init[1566]: Err:41 https://packages.microsoft.com/ubuntu/18.04/prod focal Release
cloud-init[1566]:   404  Not Found [IP: 13.81.215.193 443]
cloud-init[1566]: Get:42 https://packages.microsoft.com/repos/azure-cli focal/main amd64 Packages [5339 B]

We can see that i did something wrong — focal (which translates to 20.04) is not available in the 18.04 repository. Therefore was the installation of powershell not successful:

cloud-init[2789]: No apt package "powershell", but there is a snap with that name.
cloud-init[2789]: Try "snap install powershell"
cloud-init[2789]: E: Unable to locate package powershell

Another common issue i had is that the /var/lib/apt/lists/lock file is already in use.

cloud-init[1597]: Reading package lists...
cloud-init[1597]: E: Could not get lock /var/lib/apt/lists/lock. It is held by process 1679 (apt)
cloud-init[1597]: E: Unable to lock directory /var/lib/apt/lists/
cloud-init[1597]: Cloud-init v. 21.1-19-gbad84ad4-0ubuntu1~20.04.2 running

Which will result in required packages not being installed.

Challenges

Even though the solution I have outlined here is viable and hopefully useful for several scenarios are there a couple of challenges I would like to call out.

Terraform reverts changes ADO makes to the managed VMSS

While setting this up I recognized that Terraform will revert a couple of changes that were made by the Azure DevOps agent each time it runs. That means, that whenever I change anything to my infrastructure like for example the Cloud Init file to install new packages or other things, Terraform reverts changes that were made in the meantime by ADO. This includes for example the VMSS extension used to install the ADO agent.

Terraform plan for a VMSS managed by Azure DevOps

To work around that, I’ve added a lifecycle and ignore_changes section to my definition in Terraform:

lifecycle { ignore_changes = [] } in Terraform

This takes care of ignoring changes that were made by ADO to avoid that Terraform plan/apply overrides them and breaks the relationship between ADO and our VMSS. If you would like to learn more about the lifecycle meta-argument, visit the Terraform documentation.

ADO agent becomes available before Cloud Init is done

The second challenge was the customization of my build agents via Cloud Init itself. As you have seen above is Cloud Init used to specialize my VMSS instances by installing additional packages like Terraform, PowerShell, Azure CLI and others. Given the fact that the ADO agent is installed as a VMSS extension, it has happened to me, that Cloud Init was not done when the agent became available for Azure Pipelines and started to run jobs. This lead to situations where my agents have not had the required packages installed:

Build Agent without Azure CLI and PowerShell installed (yet)

The best solution I’ve found to solve that was posted in September 2020 here:

Azure virtual machine scale set agents with containers · Issue #2866 ·…

You can't perform that action at this time. You signed in with another tab or window. You signed out in another tab or…

github.com

bootcmd:
  - mkdir -p /etc/systemd/system/walinuxagent.service.d
  - echo "[Unit]\nAfter=cloud-final.service" > /etc/systemd/system/walinuxagent.service.d/override.conf
  - sed "s/After=multi-user.target//g" /lib/systemd/system/cloud-final.service > /etc/systemd/system/cloud-final.service
  - systemctl daemon-reload

Which will delay the start of the ADO agent till the Cloud Init script has completed.