Zendesk ML Model Building Pipeline on AWS Batch: Monitoring and Load Testing

A couple of months ago, we started this trilogy, aimed at sharing our journey with AWS Batch to scale a model building pipeline for a Machine Learning (ML) product—Content Cues at Zendesk.

The series started with Wai Chee Yau’s exploration of AWS Batch and why we ended up choosing it among many options. It’s then followed by the second post from Dana Ma, where she dived into details of technical design, such as how we built the deployment pipeline and configured job submissions. If you haven’t read them yet, I highly suggest you do!

In this episode, I’d like to discuss:

  • building the monitoring pipeline for AWS Batch
  • load testing, performance tuning, cost optimisation and some tips we’ve learnt along the way.

Monitoring

Why do we need monitoring?

In order to maintain the freshness of Content Cues models, we need to rebuild them on a daily basis; which are ~50k models to build in one go.

As is always the case, things can go wrong in different ways, especially with so many jobs being run. Hence we need a reliable way to debug and troubleshoot failures.

Moreover, job change events (more details in a second) carry interesting insights that can help us better understand our model building performance and algorithmic efficiency. Hence we need a way to collect these statistics.

In fact, some tools do exist for inspecting AWS Batch jobs, but they don’t really meet all our needs. For example:

  • AWS Batch Dashboard

The dashboard provides an intuitive view of what is going on over different job queues and job states, with the ability to drill down to individual jobs.

However, the UI is quite limited and doesn’t allow us to filter or query jobs by parameters (e.g. account id). More importantly, AWS Batch only keeps job history for around 24 hours, any jobs older than that would disappear from the UI, which becomes a real blocker when we want to debug failed jobs that ran more than a day ago.

Another option is the AWS command line tool:

  • AWS CLI describe-jobs
aws batch describe-jobs --jobs <job-id job-id> --region <region>

It can be a quick way to inspect submitted jobs, but it is not ideal for monitoring at large scale when it comes to tens of thousands of running models.

Build a monitoring pipeline

Therefore, we decided to build our own monitoring pipeline that could:

  • keep a longer job history
  • reflect job state changes in a timely manner
  • support filtering and sorting on historical jobs

A submitted job will experience a few state changes, and when it happens, AWS Batch sends an event to CloudWatch. A job state change event looks like:

{ 
"version": "0",
"id": "c8f9c4b5-76e5-d76a-f980-7011e206042b",
"detail-type": "Batch Job State Change",
"source": "aws.batch",
"time": "2018-12-15T17:56:03Z",
"region": "us-east-1",
"detail": {
"jobName": "content_cues_job",
"jobId": "c8f9c4b5-76e5-d76a-f980-7011e206042b",
"jobQueue": "arn:aws:batch:us-east-1:acc_id:job-queue/high",
"status": "RUNNABLE",
"attempts": [],
"createdAt": 1508781340401,
"retryStrategy": {
"attempts": 1
},
"dependsOn": [],
"jobDefinition": "arn:aws:batch:us-east-1:acc_id:job-definition/content_cues:1",
"parameters": {},
"container": {
"image": "content_cues",
"vcpus": 2,
"memory": 2000,
"command": ["python", "./build_me_content_cues.py"],
...
}
}
}

Although the log retention period is configurable in CloudWatch, it is difficult to process the events and to search through them. Our goal is to persist events into our own data store and make it queryable on the UI.

Monitoring Pipeline for AWS Batch Jobs

As illustrated, the workflow in the monitoring pipeline is as follows:

  • the Airflow scheduler triggers our daily model building
  • the Scala service acts as an AWS Batch client that submits jobs
  • AWS Batch runs jobs and sends events to CloudWatch
  • a CloudWatch rule is created to capture AWS Batch events and to propagate events to SNS
  • an SQS queue subscribes to the SNS topic

The benefit of using SNS is that you can have multiple SNS subscribers reacting to the same events if needed.

  • the Scala batch service now acts as an SQS consumer that reads and transforms events to a desired format

In particular, each job will generate at least 5 state change events. With 50k jobs to be run per day, the approximate throughput for the system is 250k events ( 50k jobs * 5 state changes). This load profile is easily supported by SQS.

  • the Scala batch service persists job data into an Aurora DB

Records are maintained on job level rather than event level for faster querying. In addition to the standard fields, some useful values can be derived e.g. the build time for individual job attempts.

  • jobs are surfaced to the front-end UI

Finally, we created a UI table for the jobs, with fields like job name, state, date time, account etc. It enabled us to easily filter on the job history for a given set of parameters and pick up their log streams.


The above wraps up our complete pipeline from model building, model serving to job monitoring, now it is time to exercise our whole infrastructure.


Loading Testing and Performance Tuning

To simulate the potential load, our load testing was conducted in the form of submitting ~50k jobs to AWS Batch and letting them run through the pipeline. The aim was to find out the right settings for completing all builds without errors and within reasonable time.

Load testing

To start with, we focused on two variables that largely contributed to a job’s completion time: the number of tickets to process (we cluster tickets into topics; given the same algorithm, the more tickets, the longer a job will run) and resource allocation for the containerised jobs.

We profiled our data to understand the ticket quantity distribution, and found a threshold number that most accounts fell under. We then built for these normal accounts with different resource configuration to estimate the appropriate settings (i.e. vCPU and memory) for the majority.

In the end, we moved on to include large accounts and identify what resources were required to finish all of them.

Lessons Learned

  • Dynamic resource allocation

Assign different resources based on job size rather than a one-size-fits-all assignment. Otherwise you’ll find yourself over-provisioning most of the time just to handle a small chunk of oversized jobs. It greatly reduces wastage, e.g. achieves better CPU utilization.

CPU utilization with static resource allocation
CPU utilization with dynamic resource allocation based on job size
  • Understand the data you’re dealing with

Sometimes, a tradeoff between performance and accuracy is necessary. e.g. sampling data to make sure model building is computationally efficient yet still representative.

  • Pick the right instance type for compute environments

AWS Batch runs containerised jobs in EC2 instances. That means a job needs to fit into a single instance. For example, aC5.xlarge instance (4 vCPUs , 8GB) is not suitable for jobs requiring 10GB memory.

It is worth noting that some overhead on top of a job’s memory consumption should also be allowed for, otherwise jobs can still be stuck at theRUNNABLE state with insufficient memory.

Alternatively, you can also choose to use the optimal instance type provided by AWS Batch, so it automatically determines which instance to use based on the resource requirements of jobs sitting in the job queue. The downside is you won’t know which instance will be picked upfront, and if you’ve set min vCPU to 1 and it selects a giant instance to run your jobs, you will have a long running giant instance.

  • Isolate compute resources for different products

Here products can be those data products that leverage AWS Batch for scaling batch computing jobs. It is a good idea to create separate job queues and compute environments for each new product. The reason for this is to prevent the head-of-line blocking issues if another product submits a large number of jobs.

Having separate compute environments also provides nice isolation to ensure that we can scale/size the cluster accordingly for individual products.

  • Configure AMI (Amazon Machine Images) storage

AWS Batch uses the Amazon ECS-optimized AMI to bootstrap instances. By default, a 30G volume is attached with 8G for the OS and 22G for Docker images & metadata etc. Depending on requirements, the default setting may not suit your needs and you may want to up the AMI storage to avoid Docker being stuck at launch time.

  • Cost optimization 💰

At Zendesk, we used CloudHealth to monitor cloud bills. It supports all the major cloud providers e.g. AWS, Azure and GCP.

To isolate our spending from other Zendesk products, we tagged the AWS resources related to our model building pipeline such that we could filter cost by the tags. It was revealed by our load tests that we spent on average $3 building 1000 ML models on AWS Batch, and we were really happy with that cost!

Ok, what have we learnt that could reduce cost ?

AWS Batch’s cost comes from its associated compute resources, e.g. EC2 instances. It is generally a good idea to employ cheap EC2 spot instances whenever possible. Namely, point job queues to a spot environment first and use an on-demand environment as fallback.

Is it always the case ?

It depends. The spot and on-demand compute environment setting works particularly well for jobs which complete within minutes (e.g. 5~20 minutes in our case); because the faster a job can finish, the lower chance that the spot instance will be terminated or interrupted by EC2.

On the other hand, for long running jobs (e.g. hours), spot instances are more likely to be interrupted, leading to multiple job retries and possibly preventing completion of jobs. In this situation, we decided that it wasn’t worth the potential savings of spot instances and continue to run on-demand to gain a greater consistency in our model building as it powers a feature we want content moderators to fully trust and drive their daily workflow from.


Conclusion and Future Work

This blogpost concludes the series that shares our experience of building machine learning models on AWS Batch at Zendesk. We are proud to have a model building pipeline working in production that refreshes thousands of Content Cues models everyday.

In the future, we’d like to explore opportunities of automating some of our AWS Batch workflow, such as a one-click solution to provision everything required (e.g. job queue, compute resources, monitoring capability etc) to more easily onboard a new ML product.


Thanks to my teammates Wai Chee Yau, Dana Ma, Soon-Ee Cheah, and Ai-Lien Tran-Cong, their awesome work made this blogpost possible!

Thanks to Wai Chee Yau, Dana Ma, Jason Smale, Nick Chieng, Ryan Seddon for reviewing and providing valuable feedback to this post!