The Dawn of Zendesk’s Machine Learning Model Building Platform with AWS Batch

When we worked on Content Cues, one of Zendesk’s machine learning products, we encountered the scalability challenge of having to build up to 50k machine learning (ML) models daily. Yes, 50K models. Yes, that’s a lot! Looking at the data was initially nerve-wracking.

This article focuses on the new model building platform we designed and built for Content Cues, and has been running on AWS Batch in production for a few months. From conception to implementation, the process has been a challenging yet rewarding experience for us, and we would like to share our journey with you.

This is the first of a 3 part series, covering how we evaluated different technology options (AWS Batch, AWS Sagemaker, Kubernetes, EMR Hadoop/Spark), ultimately deciding on AWS Batch.

Content Cues

Content Cues combines multiple ML models using natural language processing and deep learning techniques to summarise customer support tickets into meaningful topics. The algorithms behind Content Cues are so fascinating that they deserve a blog series or set of academic papers. Perhaps my data scientist friends (Soon-Ee Cheah and Ai-Lien Tran-Cong) will share more details in the future.

Building The Content Cues Model

Before exploring different compute platform choices, it’s important to understand how a single Content Cues model is built. Content Cues models are built on a per-customer basis (i.e. 1 model per customer) and the process of building a model can be simplified as below:

  • Aggregate last 2 months of support tickets for a customer
  • Apply ML algorithms (written in Python) on the tickets to summarise them
  • Generate a JSON file representing those summaries

We utilised our existing Hadoop infrastructure to aggregate the tickets (also known as training data), and stored them in an S3 bucket. This offline processing approach was used for most of our prediction models that don’t require online processing.

The Requirements

  • Scalability, scalability, scalability!
  • Support on demand and recurrent building of models
  • Fault tolerant and job retry support
  • Flexibility to train on CPU or GPU instances
  • Good debugging and monitoring capabilities

Introducing AWS Batch

AWS Batch is a fully managed service for executing batch workloads. The jobs are executed inside Docker containers. AWS Batch uses Elastic Container Service (ECS) for orchestrating Docker containers for running tasks. AWS Batch monitors the ongoing jobs and job queues, and can auto-scale cluster capacity depending on workload. AWS Batch consists of:

  • compute environment: ECS containers that launch into a specified VPC and subnet. We can define different instance types (CPU/GPU) for different compute environments.
  • job definition: defines a type of job. It specifies the Docker image, resource requirements, environment variables, number of retries, etc.
  • job queue: is the work queue for triggering jobs. It supports priority levels from 0–10. Different types of jobs can be submitted to the same queue.

Using AWS Batch to Build Models

This is how we used AWS Batch in production:

AWS Batch Model Building Platform

Building a model entailed:

  • Submission of model building job(s) to AWS Batch
  • AWS Batch launching the Docker container for the model building job
  • Model building container command running the model building code (Python) to retrieve the training data from the S3 features bucket, and generating the model. The Docker image and command to use is defined in a job definition. Failed jobs are retried by AWS Batch up to the defined retry limit.
  • Model building container uploading the model to the S3 models bucket. The new object creation triggers an S3 event which queues into SNS then propagates to SQS.
  • Model serving service subscribing to the SQS, ensuring it is notified whenever new models are created.
Cost Optimisation through mixture of spot and on demand instances

Pro Tip: If you like to save cost you can use spot instances: spare EC2 instances offered by AWS at a discounted price. However, those instances may be terminated if a higher-priority user requires the resource (priority is determined via bidding or request type), which can cause jobs to be stuck in the queue when no spot instances are available. To overcome this, link the job queue to a spot compute environment as the first choice, then configure a similar on-demand compute environment as a fallback option for the queue.

Other Options Considered

Besides AWS Batch, we considered the following options:

  • AWS Sagemaker: a managed service from AWS designed for training and serving of ML models. It uses the similar concept to AWS Batch of having the training data located on S3 and generating the models in an S3 bucket. Sagemaker was not selected as there appeared to be various limits imposed on the number of concurrent training jobs allowed. By contrast, AWS Batch allows you to define as many vcpus as required for your compute environments.
Sagemaker limits as of 4th Aug 2018
  • Kubernetes: an open sourced system for container orchestration. AWS provides managed version of Kubernetes via EKS. For our needs, the lack built-in support for specifying job dependencies and priorities is a significant limitation of Kubernetes (Jobs).
  • EMR Hadoop/Spark: distributed data processing frameworks on AWS. For map reduce (MR) frameworks such as Hadoop, all models in a batch need to be re-run when building any model fails, resulting in a lot of wasted resources. Also, tuning individual models can be difficult given they are executed in a batch by the MR job, as opposed to a queue-based approach adopted by AWS Batch.

Why We Chose AWS Batch

  • Out of the box autoscaling of the compute resource based on workload
  • Out of the box support for job priority, retries and job dependencies
  • The queue based approach is a good fit for our use case to build recurrent and on demand jobs
  • Supports CPU and GPU workload. The ability to use GPU for building ML models are critical as GPU can speed up the training process significantly especially for deep learning models.
  • Flexibility to define custom Amazon Machine Images (AMI) for the compute environment

Searching For Gremlins

There are so many things that can go wrong in building a new product. At Zendesk, we strive to explore the problem space early, well before the product is generally available. I conducted a series of experiments and proof-of-concepts to explore the capabilities of AWS Batch. A Docker image with a simplified version of the model building code was used for the tests, with the following scenarios:

  • Launched tens of thousands of jobs simultaneously to validate the ability to scale the compute environments depending on load. It should use spot instances primarily, only falling back to on-demand if there are no spot instances are available, or the spot cluster reached its maximum cpu utilisation.
The CPU and memory usage of the Spot and On Demand compute environments during the job run clearly shows that both clusters are being utilised.
  • Ability in handling failure scenarios. This was tested by running jobs expected to fail, e.g. due to running out of memory, error in code etc, and ensuring that failure notifications via SNS email works.
  • Ability in assigning job priority. This is tested by flooding the low priority queues with a number of jobs, ensuring that a high priority job is run immediately upon submission.
Build time when different number of stubbed jobs submitted at the same time.

AWS Batch did well in all scenarios, and has continued to performed as expected.

Summary

Through our prototyping and experimentation, we identified AWS Batch as our preferred choice for building model building platform. Easy setup and out of the box support made it attractive for running bulk processing tasks with Docker containers.

In parts 2 and 3 of our series by my respected team mates Dana Ma and Derrick Cheng. They will walk through our model building platform implementation, and tips and tricks learnt.

Spoiler alert: through rigorous optimisations, we managed to reduce the compute cost of building 1000 models on AWS Batch to less than the price of a cup of coffee! Stay tuned ;)

I would like to thank my colleagues Eric Pak, Soon-Ee Cheah and Jonathon Belotti for helping out in various stages of the proof-of-concept.