slsML: Towards elastic ML infrastructure on AWS Lambda

Problem statement

Adjusting ResNet50 training to run on AWS Lambda

  1. Data-parallel — different nodes train the entire model using different portions of the data (and then the results are aggregated)
  2. Model-parallel — different nodes train different portions of the model (using same or different data)
  1. Calculate gradients (based on the previous model and current labeled input)
  2. Update model weights
Figure 1: Data Flow of Distributed Training on Lambda
Figure 2: Control Flow of Distributed Training on Lambda
  1. The workflow is triggered by a (new version of the) model file uploaded to S3, generating an SNS message (ModelUpdated topic) which in turns triggers the WorkLauncher function.
  2. WorkLauncher is triggering a set of Worker functions by generating an SNS message for each of them, published to the Work topic (Lambda is configured to consume one message per function invocation).
  3. Each worker is reading the model and the training data from S3, computes and writes gradients back to S3, partitioned per reducer (so that each reducer can then easily access the relevant portion of the gradients). Depending on the number of images in the training set, this can take between few seconds up to several minutes (notice that Lambda is limited to 15 minutes, recently increased from 5 minutes). On completion, each worker notifies ReduceLauncher via SNS topic WorkDone.
  4. ReduceLauncher keeps track of finishing workers by maintaining a counter in DynamoDB, and triggers reducers by sending the desired number of SNS messages to the Reduce topic once workers are done.
  5. Reducers read the model and the respective parts of the gradients from S3, compute new weights, write them back to S3, and report completion via ReduceDone topic.
  6. Triggered via SNS, Merger reads the updated weights and uploads the new model, which triggers WorkLauncher again. Depending on the desired number of iterations (maintained in DynamoDB), WorkLauncher either launches more Workers (step 2), or exits.

Working around Lambda runtime limits

Package size

Disk space


S3 as distributed shared memory (DSM)

Observability of complex serverless workflows


Figure 3: Lambda-based distributed training dataflow in Epsagon

Mitigating stragglers

Summary of experiments

Figure 4: Performance Evaluation. Notice that the X axis is exponential.


  • Time: We can see on the graph that the time (dotted line) indeed decreases and looks like 1/x, although at around 32 workers it slows down, and stops decreasing at about 256 workers. Looking under the covers, there are two main reasons for such behavior. First, the overhead of maintaining the distributed pipeline and exchanging intermediate results between iterations increases with the number of workers. Second, in order to ensure training convergence, workers need to exchange weight updates in bulks of up to the size of the pre-defined mini-batch (8K or 32K in our case). Therefore, as the concurrency increases, the amount of work each worker can do in an iteration decreases. Hence, we need relatively more iterations, and the per-iteration overhead becomes more significant.
  • Cost: Looking at the cost (blue and green lines), it indeed remains flat up to a certain point (32 workers for the blue line, 128 for the green one), and then begins to climb. Apparently, there are two main reasons for this. First, due to mini-batch size limit (explained above), the overhead of running the pipeline is higher — which also increases the cost of the corresponding Lambda functions (spending more time waiting). But, interestingly, the main reason for the extra cost is actually the cost of S3 calls, which becomes very significant as the number of workers/reducers increases (notice that the traffic within a region is free, and the only cost is for API calls).

Comparison to a VM-based solution





Cloud Expert

Love podcasts or audiobooks? Learn on the go with our new app.

Recommended from Medium

Machine Vision For Test Automation — Part 2

How do Neural Networks work? And how can they be used to help with COVID-19 pandemic?

Heartbeat Newsletter: Volume 12

[Quick-ML] Introduction To Pose Estimation

The Intuition and Applications Behind Autoencoders & Variants

Comparison of different edge detection methods for real time uses on FPGA.

Dog Breed Classification Using CNN

Working With Sklearn Pipeline-2

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Alex Glikson

Alex Glikson

Cloud Expert

More from Medium

How to create SageMaker Processing job from Step Functions

Calling AWS SageMaker Endpoints from Lambda

Serverless Deployment of Machine Learning Models on AWS Lambda

Towards Open Options Chains Part III: Getting Started with Airflow