Azure Batch: Auto-Deploying Batch Applications

A no nonsense walkthrough on deploying applications to Azure Batch Pools

7 min readAug 13, 2019

One of my current projects is working with Azure Batch, which we use to run an ML application. When I moved to this team, deploying this application was a manual process. We had to run a script, manually set the version, and do this every time we needed a push to our dev or prod environments. One of my jobs is to automate this process — among others — and I’ve mostly gotten it done, but not without some work. There weren’t too many Python-centric documentation available, so perhaps this might help others on their journey.

This documentation follows the Python API.

My goal was to automate deployment, and to do this, we need to look at Batch clients. Microsoft splits the API structure into two different clients:

Management: Manage Batch accounts, storage, and account keys
Service: Manage compute nodes, pools, tasks and jobs.

Requirements

We need a couple of Azure Resources to get this to work.

Here I’m assuming you have both already, and if not, the language-agnostic documentation is linked.

From these three Azure resources, we need a couple of things.

Tenant ID
Subscription ID
Batch Resource ID
Batch Management URL: https://management.core.windows.net/
Batch Resource URL: https://batch.core.windows.net/
Authority URL: https://login.microsoftonline.com/
Batch Account URL
Batch Pool Name
Batch Application Name
Resource Group Name
Active Directory Secret: To authenticate with AD. This is just a secret created in a key vault.

Flow

Here’s how the basic flow’s gonna look:

Implementation

Mediums’ code implementation isn’t the best, so I’ve included this code as copy-able bits at the end of the article to maintain readability.

I have a config.json file setup that maps the above required variables, which I then read into an object. This lets me easily access the variables without cluttering code. It looks like this:

An example of the config file I use for easy access

Creating the clients

Before the clients are initiated, we need to create tokens for the management and service clients. These require an authentication context object. There are more ways of acquiring tokens — manually for example — but to automate the process, we’ll need the acquire_token_with_client_credentials( ) method.

Create the authentication context, and the two tokens for the service and management tokens.

Now we can create the batch clients.

Note that the different clients need different parameters.

Preparing the Binaries

To prepare the application, all we need to do is zip it up with code. There are multiple resources on how to zip up files in Python, so I’m not including that here. We have a custom function that lets us zip the files in the format our compute node expects.

Keep in mind how your application is started when in the compute node. You’ll have to structure the zip file in a way that allows your app to be called properly.

Deploying the Binaries

To deploy the binaries, we have a five step process, all done within one method:

0. Figure Out what Version you want to push

If this is always set manually, you can skip this step. To help automate, however, you will need to grab the currently set default version. I haven’t found a way to get this specific value from the API’s, so I resorted to using natsort on the list of applications that you can get from the API.

Natsort doesn’t work for all version formats. I had trouble sorting between — for example — 1.1.0_test and 1.1.0.test. The latter is a more easily understandable version, keep this in mind. Make sure all your application versions follow the same format. I created my own mini parsing function when we find this scenario and if natsort returns an unexpected or unparseable version.

Push the app package reference

We must let Azure Batch know that we’re going to push some new binaries. To do this, we use the application_package.create () function. In return we get:

file_name: The name of the application we should push
container: The blob container we need to push to
sas_token: A provided SAS token we can use to push to the container

Creating the application resource, and getting the required storage details

2. Push the binaries

We can now use the details we got above to push the binaries where Azure Batch wants us to.

I’m using a storage account key here because I’m creating a BlockBlobService client, but you can also use the SAS token provided by Azure.

3. Activate the application

Provided you have version number to deploy, you can now set the newly uploaded application as the default application that the pool will use.

First activate:

Then set as default:

allow_updates will allow you to overwrite the binaries for this application version.

4. Reboot the compute node

To get the latest binaries into the nodes, we need to reboot the compute node. To get the compute node(s), first call the compute_node.list() method:

Get the list of nodes in your pool. service_client_dev is the Batch service client

Then, you will need to get each compute nodes’ id. We have only one in our pool, so it’s easy in our case. Your code will differ if you have more than one node:

Get the node id’s for all the compute nodes in your pool

Now you can reboot individual nodes:

If you’re worried about rebooting nodes while they’re running a task or not available, you can wait for them to get ready:

_wait_for_pool( ) is a simple custom function that waits for the node to become ‘idle’.

You can get the state of a compute node with this method:

You can also resize the pool to include more nodes, and specify what the compute node should do with the tasks it’s currently running:

That’s pretty much it. The pool reboot will take some time, so you might want to add some logic to wait for it to finish it’s start task.

As a bonus, you can tie this automation to your CI/CD platform and auto deploy code during nightly or PR builds. For our nightly, we just deploy, test, and revert versions. For PRs, we deploy and retain the new state if tests pass.

Full Copy-able Code List

Get the latest version available in the compute node

def _get_latest_app_version(batch_client, app_id):
    """
    Given the name of the batch application, returns the latest version in use.
    Note: The assumption here is that the latest version is the one set as the default version. Natsort will sort the
    versions regardless of most formats, and return the ordered  list.
    :param batch_client: the batch client
    :param app_id: the name of the application
    :return: the latest version present in the list of application packages for the application
    """
        pool_item = batch_client.application.get(app_id)
    logger.debug(natsorted(pool_item.versions))
    return natsorted(pool_item.versions)[-1]

Create the Application Resource

app_object = mgt_client.application_package.create(
                                               <ResourceGroupName>,
                                               <BatchAccountName>,
                                             <BatchApplicationName>, 
                                               <VersionToDeploy>)items = app_object.storage_url.split('/')
sas_token = items[-1].split('?')[1]
file_name = items[-1].split('?')[0]
container = items[3]

Upload Application Binaries

blob_client = azureblob.BlockBlobService(
<StorageAccountName>, <StorageAccountKey)# Use the data from the previous code block here.
blob_client.create_blob_from_path(<ContainerName>,
                                  <BlobName>,
                                  <PathToZipFile>)

Activate Application

mgt_client.application_package.activate(
                                        <ResourceGroupName>,
                                        <BatchAccountName>,
                                        <ApplicationName>, 
                                        <VersionToActivate>, 
                                        format='zip')

Set Application as Default

mgt_client.application.update(<ResourceGroupName>,     <BatchAccountName>, 
<ApplicationName>, parameters={
            'default_version': <VersionToSetDefault>,
            'allow_updates': True,
            'display_name': <VersionToSetDefault>
        })

Get List of Nodes in the Pool

batch_service_client.compute_node.list(<BatchPoolName>)

Reboot Compute Node

batch_service_client.compute_node.reboot(
<BatchPoolName>, <ComputeNodeID>)

Resize a Pool

def _resize_pool(batch_service_client, pool_id, pool_vm_count = 1, node_deallocation_option = 'taskCompletion', resize_timeout = 15):
    """
    Resizes a given pool to the given number of nodes. 

    :param batch_service_client: The batch service client
    :param pool_id: The ID of the pool to resize
    :param pool_vm_count: The number of nodes to resize to
    :param node_deallocation_option: What to do with the tasks that may be currently running on the node. 
    :param resize_timeout: How long to wait before throwing an exception when resizing.
    """
    
resize_settings = batch_client.models.PoolResizeParameter(target_dedicated_nodes = pool_vm_count, 
resize_timeout = resize_timeout, 
node_deallocation_option = node_deallocation_option)    try:
        old_pool = batch_service_client.pool.get(pool_id)
        batch_service_client.pool.resize(pool_id, pool_resize_parameter = resize_settings)
        time.sleep(2)
        _wait_pool_allocation(old_pool, batch_service_client, resize_timeout)
    except batchmodels.BatchErrorException as err:
        _print_batch_exception(err)
        raise

Helpful Links

If anyone has a better solution, improvements or comments, I welcome them. Hope this was helpful.