Easy Terraforming of an Azure Batch Service with an Auto-Scaling Pool

Published in

Microsoft Azure

3 min readJan 13, 2021

Azure Batch service, a job scheduling and computing power allocation managed solution, allows running jobs on pools of virtual machines of our choice. Meaning, if certain jobs require high memory or GPU processing for example, the instance type used by the pool that will be running that type of job can be adjusted to fit our needs.

The available instance types, are the same offered for virtual machines on Azure, divided into different series, each with resources and underlying hardware that match specific task types (such as machine learning for example).

To optimize consumption of resources in the cloud, it would be ideal to only have virtual machines allocated to a pool when there are jobs that need to be processed, and when there are no jobs, deallocate those resources, so there is no charge for resources sitting idle.

In this article, we will be using Terraform to deploy an Azure Batch service with an auto-scaling pool, which has those automatic allocation and deallocation of virtual machines already built-in, based on a code-based formula defined by us and executed by the pool, which determines the logic for allocating and deallocating.

The Example

A complete example of a Terraform script which deploys a batch service, a storage account, which is required by the batch service and an auto-scaling pool, can be found in this GitHub repository:

GitHub - ItayPodhajcer/terraform-azure-batch-autoscale

Contribute to ItayPodhajcer/terraform-azure-batch-autoscale development by creating an account on GitHub.

github.com

The Script

For this article, we will be deploying the following resources:

A resource group which will include all the resources.
A storage account which is required by the batch service.
An Azure Batch service.
An auto-scaling pool for the batch service.

We will start by defining local variables for the deployment name and the region we will be deploying to:

Next, we will create a resource group to hold all the resource we will be creating:

Then the storage account which will be associated to the batch service:

Note that we are using Terraform’s random string provider to create a name for the storage account.

Once we have the storage account, we can define the batch service as follows:

And then, the last piece, a Windows Server 2019 based auto-scaling pool, which will have a formula, that allows a nodes count between zero and a maximum of four nodes, evaluated once every five minutes:

Note that we also defined a start task with a simple echo command, which will run each time a node is allocated using the non-admin task scope auto-generated user.

Run a Job

Once our deployment script is executed using terraform apply, we can test the batch service by creating a new job using the portal, that will use the pool created by the script, and then add a simple task with the echo command, so we can see the automatic scaling adding a node, executing the task and then remove the node once the task is complete.

Note that it might take a few minutes until a node is created, due to the evaluation interval we defined (five minutes, the minimum allowed).

Conclusion

The deployment discussed in this article is simple, but for tasks that run with just commands, or even resource files related to a job, this deployment should prove more than enough. More elaborate deployments might include usage of application packages, multistep batches, and code-based job creation (with .NET for example).