The Dawn of Zendesk’s Machine Learning Model Building Platform with AWS Batch
When we worked on Content Cues, one of Zendesk’s machine learning products, we encountered the scalability challenge of having to build up to 50k machine learning (ML) models daily. Yes, 50K models. Yes, that’s a lot! Looking at the data was initially nerve-wracking.
This article focuses on the new model building platform we designed and built for Content Cues, and has been running on AWS Batch in production for a few months. From conception to implementation, the process has been a challenging yet rewarding experience for us, and we would like to share our journey with you.
Content Cues combines multiple ML models using natural language processing and deep learning techniques to summarise customer support tickets into meaningful topics. The algorithms behind Content Cues are so fascinating that they deserve a blog series or set of academic papers. Perhaps my data scientist friends (Soon-Ee Cheah and Ai-Lien Tran-Cong) will share more details in the future.
Building The Content Cues Model
Before exploring different compute platform choices, it’s important to understand how a single Content Cues model is built. Content Cues models are built on a per-customer basis (i.e. 1 model per customer) and the process of building a model can be simplified as below:
- Aggregate last 2 months of support tickets for a customer
- Apply ML algorithms (written in Python) on the tickets to summarise them
- Generate a JSON file representing those summaries
We utilised our existing Hadoop infrastructure to aggregate the tickets (also known as training data), and stored them in an S3 bucket. This offline processing approach was used for most of our prediction models that don’t require online processing.
- Scalability, scalability, scalability!
- Support on demand and recurrent building of models
- Fault tolerant and job retry support
- Flexibility to train on CPU or GPU instances
- Good debugging and monitoring capabilities
Introducing AWS Batch
AWS Batch is a fully managed service for executing batch workloads. The jobs are executed inside Docker containers. AWS Batch uses Elastic Container Service (ECS) for orchestrating Docker containers for running tasks. AWS Batch monitors the ongoing jobs and job queues, and can auto-scale cluster capacity depending on workload. AWS Batch consists of:
- compute environment: ECS containers that launch into a specified VPC and subnet. We can define different instance types (CPU/GPU) for different compute environments.
- job definition: defines a type of job. It specifies the Docker image, resource requirements, environment variables, number of retries, etc.
- job queue: is the work queue for triggering jobs. It supports priority levels from 0–10. Different types of jobs can be submitted to the same queue.
Using AWS Batch to Build Models
This is how we used AWS Batch in production:
Building a model entailed:
- Submission of model building job(s) to AWS Batch
- AWS Batch launching the Docker container for the model building job
- Model building container command running the model building code (Python) to retrieve the training data from the S3 features bucket, and generating the model. The Docker image and command to use is defined in a job definition. Failed jobs are retried by AWS Batch up to the defined retry limit.
- Model building container uploading the model to the S3 models bucket. The new object creation triggers an S3 event which queues into SNS then propagates to SQS.
- Model serving service subscribing to the SQS, ensuring it is notified whenever new models are created.
Pro Tip: If you like to save cost you can use spot instances: spare EC2 instances offered by AWS at a discounted price. However, those instances may be terminated if a higher-priority user requires the resource (priority is determined via bidding or request type), which can cause jobs to be stuck in the queue when no spot instances are available. To overcome this, link the job queue to a spot compute environment as the first choice, then configure a similar on-demand compute environment as a fallback option for the queue.
Other Options Considered
Besides AWS Batch, we considered the following options:
- AWS Sagemaker: a managed service from AWS designed for training and serving of ML models. It uses the similar concept to AWS Batch of having the training data located on S3 and generating the models in an S3 bucket. Sagemaker was not selected as there appeared to be various limits imposed on the number of concurrent training jobs allowed. By contrast, AWS Batch allows you to define as many vcpus as required for your compute environments.
- Kubernetes: an open sourced system for container orchestration. AWS provides managed version of Kubernetes via EKS. For our needs, the lack built-in support for specifying job dependencies and priorities is a significant limitation of Kubernetes (Jobs).
- EMR Hadoop/Spark: distributed data processing frameworks on AWS. For map reduce (MR) frameworks such as Hadoop, all models in a batch need to be re-run when building any model fails, resulting in a lot of wasted resources. Also, tuning individual models can be difficult given they are executed in a batch by the MR job, as opposed to a queue-based approach adopted by AWS Batch.
Why We Chose AWS Batch
- Out of the box autoscaling of the compute resource based on workload
- Out of the box support for job priority, retries and job dependencies
- The queue based approach is a good fit for our use case to build recurrent and on demand jobs
- Supports CPU and GPU workload. The ability to use GPU for building ML models are critical as GPU can speed up the training process significantly especially for deep learning models.
- Flexibility to define custom Amazon Machine Images (AMI) for the compute environment
Searching For Gremlins
There are so many things that can go wrong in building a new product. At Zendesk, we strive to explore the problem space early, well before the product is generally available. I conducted a series of experiments and proof-of-concepts to explore the capabilities of AWS Batch. A Docker image with a simplified version of the model building code was used for the tests, with the following scenarios:
- Launched tens of thousands of jobs simultaneously to validate the ability to scale the compute environments depending on load. It should use spot instances primarily, only falling back to on-demand if there are no spot instances are available, or the spot cluster reached its maximum cpu utilisation.
- Ability in handling failure scenarios. This was tested by running jobs expected to fail, e.g. due to running out of memory, error in code etc, and ensuring that failure notifications via SNS email works.
- Ability in assigning job priority. This is tested by flooding the low priority queues with a number of jobs, ensuring that a high priority job is run immediately upon submission.
AWS Batch did well in all scenarios, and has continued to performed as expected.
Through our prototyping and experimentation, we identified AWS Batch as our preferred choice for building model building platform. Easy setup and out of the box support made it attractive for running bulk processing tasks with Docker containers.
Spoiler alert: through rigorous optimisations, we managed to reduce the compute cost of building 1000 models on AWS Batch to less than the price of a cup of coffee! Stay tuned ;)