Experiments with AWS Lambda and Amazon Sagemaker Endpoint

Vikas Rai
Thomson Reuters Labs
5 min readNov 4, 2022

Authors: Vikas Rai, Kofi Wang, Carolyn Liang, Krish Balakrishnan

When it comes to deploying machine learning models on AWS, users typically have two options: AWS Lambda and Amazon Sagemaker endpoints. One can debate many different aspects between these two options, such as cost, size, deployment cycle, etc. In this article, we’ll compare the latency in production between these two services, explain where it comes from, and provide methods/techniques to improve the latency in production.

To better illustrate the difference between Lambda and Sagemaker, we’ll use one of the use cases in Thomson Reuters (TR) Labs as an example. We have two models deployed in both services and will use them to test the performance in a production environment. Our deployment workflow is simple:

  1. Package the models as docker images.
  2. Push them to Amazon Elastic Container Registry (ECR).
  3. Deploy services from these images.

Next, let’s talk about what we observed when running the model on these services.

Lambda cold starts:

When the Lambda service receives a request to run a function via the Lambda API (Application Programming Interface), the service first prepares an execution environment. During this step, the service downloads the code for the function, which is stored in an internal ECR. If the function uses container packaging or S3 buckets, it creates an environment with the memory, runtime, and specified configuration. Once complete, the lambda runs the initialization code outside the event handler before finally running the handler code. The process of Creating the environment from scratch is called a “cold start”.

Once execution completes, the execution environment is frozen. The Lambda service retains the execution environment for a non-deterministic period to improve resource management and performance. During this time, if another request arrives for the same function, the service may reuse the environment.

This second request typically finishes more quickly since the execution environment already exists, and it is unnecessary to download the code and run the initialization code. This ability to reuse the environment is called a “warm start.”

How to improve Cold/Warm start:

There can be many ways, such as optimizing your code, model size, architecture, dependencies, lazy loading, etc., which could dramatically improve your “cold starts.”

Depending on the number of concurrent requests, the function may need to scale up or down. In case of high concurrency, multiple lambda instances will be initialized, each having a cold start. In such situations, services like CloudWatch can help monitor and analyze traffic spikes and calculate the lambda function’s potential load. For example, we can use the historical CloudWatch events to predict the expected load at a given time of day and provision concurrency beforehand for that period based on typical load, and not rely only on the auto-scaling inbuilt in lambda.

To get an idea of the impact of a cold start on the execution time, we show the following “response time distribution” chart along with the initialization and invocation time distribution chart, during the first and second calls, for one of our use cases running with lambda endpoint.

As the latency breakdown charts show, the first call to the Lambda endpoint requires an additional 12.34 seconds for initialization, which is caused by a “cold start”. Other than the cold start, the latency for both calls is very close.

Sagemaker endpoint Model latency:

Unlike the lambda endpoint, the Sagemaker endpoint provides real-time services, so there won’t be any “cold start” as the endpoint will be working 24/7. The charts below show the model latency and model overhead for different inference jobs; we can see that the overhead latency is negligible (<10 milliseconds).

The performance of the Sagemaker endpoint can be improved in two ways:

  1. add more instances
  2. change instance type

These two approaches serve different purposes. If the service experiences high traffic, one single endpoint instance will not be able to handle high volume and will slow down the model latency. Eventually, it will start responding with errors to some requests. In this case, users can increase the number of instances to handle the high concurrent request. Sagemaker provides auto-scaling configurations allowing users to select thresholds to scale up or down the number of instances.

Users can choose an instance type with higher computing power to reduce the inference time, thus cutting the model latency on a single request. In the experiment below, we ran five different models on two instance types: ml.m5.xlarge and ml.m5.12xlarge. The former instance provides 2 vCPUs and 8 GiB of memory, while the second instance provides 48 vCPUs and 192 GiB of memory. Each model will run for over 100 requests to get a robust result.

The charts below show the latency of the two Sagemaker endpoints.

As the charts show, the endpoint with ml.m5.12xlarge instance is faster than the endpoint with ml.m5.xlarge by ~35%. Since our model does not support parallel computation and is memory intensive, this difference is mainly caused by the larger memory size from the ml.m5.12xlarge instance. Different types of machine learning models may require different hardware configurations to speed up, so users need to choose the appropriate instance type to balance cost and efficiency.

Conclusion:

Both, Sagemaker/Lambda endpoints have their pros and cons. It requires a fair bit of understanding and analysis of the business needs like processing time, processing type, payload size, request distribution, image size, cost, etc., to decide on the endpoint type for a given use case.

  • AWS Lambda is a server-less computing service, costing depends on duration (when code begins executing until it returns or terminates) and memory allocated for execution.
  • Amazon Sagemaker has a fixed cost for each instance per hour and is independent of execution time or memory allocation for execution.

In our next article, we will come up with some cool methods and techniques which can help to deal with certain limitations of both the endpoints and in-depth analysis on how to select the best suited for a specific use-case.

Thanks for reading!

--

--