How we use AWS Batch at Zendesk to Build All The Machine Learning Models

In the first part of this epic trilogy, Wai Chee Yau discussed why we chose AWS Batch as the solution to our problem: scaling the building of Machine Learning models for our product Content Cues to be able to run 50k jobs at once.

In this instalment, I’ll be describing in more detail the exact technical solution we designed, covering:

  • our deployment pipeline
  • how we submit AWS Batch jobs, and
  • configuration options

Hopefully this will give some insight into how our team works with AWS Batch, both in terms of development workflow and engineering architecture, and our experience with AWS Batch so far.

Laying the groundwork

To start with, we had a Python library which contained all of the magic imbued by our machine learning researchers for building Content Cues models. The library read from and wrote to S3 buckets, and it would be able to work in precisely the same way if invoked using AWS Batch.

The next goal was not only to get the code working with AWS Batch, but also to create the pipeline so that freshly merged code would automatically be deployed to the AWS Batch instance in our test environment, and subsequent Content Cues jobs in AWS Batch would run with the new code.

Step 1: Containerise

Our model building Python code ran wild and free as a command line application, but in order to be able to run it with AWS Batch, it had to be encapsulated into a self-contained Docker image with all of its dependencies. In doing so, we also gained the benefits of (and peace of mind from) ensuring our code was being delivered in an immutable fashion.

One of the unknowns was that one such dependency was an external text embedding service. Keeping the same architecture and calling the service from our container in AWS Batch would have required significant additional networking complexity, so our solution was to bake the text embedder into the image. This allowed us to simplify our architecture and keep all of our model building in a truly self-contained image.

Happily, the remainder of the containerisation work was straightforward. Now we had our image which held all of the ingredients it needed to build a Content Cues model.

Step 2: Define

The next step was to define for AWS Batch what “a Content Cues job” meant. In AWS Batch, this is done by creating a job definition, i.e. a JSON file specifying various job configuration parameters, including:

  • AWS Elastic Container Registry image (and version) for the job container
  • command to run inside the Docker container
  • parameters which may be passed to the command
  • IAM Role as which to run the job; this controls which resources (e.g. S3 Buckets) the job has access to
  • resource configuration (number of vCPUs reserved for the container, container memory limit)

Since we required a family of very similar job definitions for running in various environments and regions, we used Jinja for templating the job definition to prevent egregious code duplication.

Step 3: Automate

Once we had figured out how we would deploy new code and had it working by manually kicking it along, it was time to automate it.

At Zendesk, we use Google Cloud Build to automatically build Docker images for a repository upon merge to master and tag them with the corresponding release version.

From that point, we added a subsequent stage in our deployment pipeline to upload the image to Amazon ECR and update the AWS Batch job definition to point to the new version of the Docker image.

So, after not too much engineering effort, we were all set to run batches of Content Cues model building jobs in a scalable fashion.

Scalable job submission

Having defined “a Content Cues job” in AWS Batch, the natural next question is: how do we actually trigger one?

The easiest way to do so is using the AWS CLI to submit a single job, using a command such as the following:

aws batch submit-job \
--job-name <job-name> \
--job-queue <job-queue> \
--job-definition <job-definition> \
--parameters <parameters> \
--region <region>

where job-name is a name given to the job by the job submitter. This command returns a job-id, which is a unique string generated by AWS that can be used to query the status of the job.

Invoking an AWS CLI incantation is fine for running jobs ad hoc, but we needed a reliable solution which:

  • could scale to run a daily load of 50k jobs at once
  • made it easy to track job submission failures
  • and preferably didn’t require installation of an AWS CLI

So, the next step was to build a simple Scala service which would act as an interface to AWS Batch (using the AWS SDK for Java), so that we could trigger a build with a HTTP request. We chose Scala for its concurrency awesomeness and also our team’s existing familiarity with the language.

This service also empowers us to easily monitor the statuses of submitted jobs, and if that excites your curiosity, you will have to stay tuned for the final thrilling instalment of this trilogy!

Finally, to automate these calls and trigger 50k jobs daily, we chose Airflow as a scheduler to kick off the builds once our input data has landed in our S3 features bucket, along with AWS Batch’s Array Job functionality to be able to trigger up to 10k jobs at once with a single call to AWS Batch.

Tweaking the knobs and dials

AWS Batch provides a plethora of configuration options to support various use cases, so we spent some time playing around with them to see what worked for us.

Since we aimed to reduce costs to a reasonable level, it was important to be able to accurately monitor the costs incurred by our daily model building run. To find out how we did that, see Derrick’s upcoming post!

Queue priority

Queues can be configured to point to different compute environments. AWS mentions an example queue configuration to achieve cost-efficient job prioritisation by using a high priority queue running jobs on on-demand instances, and a low priority queue with spot instances.

In practice, we’ve found queue priority useful in ensuring that jobs triggered ad-hoc are run, but the costs of running each queue with a mixture of spot and on demand instances have been sufficiently low to eliminate the need for conducting further analysis on adjusting each queue’s compute resources. Consequently we have three different queue priorities (i.e. low, medium and high), but each queue runs with the same compute environment.

Resource configuration overrides

Not all jobs are created equal. Some of our jobs have to process enormous quantities of data that are many orders of magnitude larger than others, and hence consideration is required when defining the resource requirements for jobs.

We found that the costs incurred by the simple solution of establishing three tiers of jobs corresponding to small, medium and large accounts were sufficiently reasonable, so that we didn’t need a more sophisticated resource allocation algorithm. Each job tier was configured with its own CPU and memory allocations which were selected after rigorous load testing.

And when things go wrong…

As is often the case in life (particularly in lives where 50k things are triggered at once), sometimes things don’t go according to plan. In such cases, AWS Batch provides some help for finding out where and why jobs have failed:

  • application logs for all jobs are available in CloudWatch
  • a “Status reason” when containers fail, which in practice we’ve found useful if and only if the container has run out of memory

By adding detail to our logging, which allowed us to identify the input data used in any given job, we’ve been able to reproduce all application errors.

We’ve also encountered some more cryptic behaviour which was harder to explain. For example, we found that very occasionally, AWS Batch was seemingly retrying jobs which had already succeeded, and it took a genius brainwave to discover that this was due to spot instances being reclaimed by AWS. Hence we now ensure that our jobs are idempotent so that they can be resilient under such scenarios.

Conclusion

After a few months of engineering work*, we were able to take our machine learning model building code and transfer it to AWS Batch, allowing us to build many tens of thousands of models with a simple trigger.

In this post we’ve aimed to shed some light on some of the technical details about how our AWS Batch solution works, from our deployment pipeline to how we submit jobs and the way we’ve configured things in AWS.

The not-to-be-missed conclusion to this series by Derrick Cheng will feature monitoring of jobs, costs and also load testing!

* with thanks to Soon-Ee Cheah, Wai Chee Yau, Eric Pak, Derrick Cheng and Ai-Lien Tran-Cong