What to expect when you use AWS Lambda
Deploying Machine Learning solutions often introduce unforeseen challenges. When model outcomes are designed to be served as AWS Lambda endpoints, the solution is designed for SOA compatibility. AWS Lambda is a serverless computing service provided by Amazon to reduce the configuration of servers, OS, Scalability etc. In our trials using Lambda endpoints for the ML model inference pipeline, we noticed that we were getting different response times for the inference logic. This was affecting the cost as well as the user experience. So, we tried to understand the Lambda service execution, and how Lambda handles requests.
Lambda life cycle:
Let us look at the Lambda lifecycle to understand the latency and how lambda handles requests:
When the Lambda service receives a request, the service first prepares an execution environment:
- Download code: Pull code from the s3 bucket or image from ECR if the function uses container packaging
- Create env: Creates an env with memory, runtime and configuration specified
- Initialization code: Initialization code outside Lambda handler (linking to other AWS services etc.)
- Execute handler code: Executes the handle, and the service is ready to respond to requests
Cold start: The time taken by the service to prepare the function and set up the environment.
Cold start adds latency to the request, but AWS does not charge for this time.
Lambda execution environments handle one request at a time. After the invocation has ended, the execution environment is retained for a period. If another request arrives during this period, the environment is reused to handle the subsequent request. The requests handled during this time are warm requests.
Cold starts can be a bigger problem in scenarios like:
- Large image/code size (time for download)
- Large model size (time to load and initialize model)
- Big memory requirements
Lambda Invocation Patterns:
Let’s look at how lambda handles parallel requests. If requests arrive simultaneously then the Lambda service scales up and creates multiple execution environments. The cold start will happen for the first invocation in the lambda environment as each environment is set up independently. So, each request experiences a full cold start.
For example, if API Gateway invokes Lambda six times simultaneously, this causes Lambda to create six execution environments. The total duration of each invocation includes a cold start:
Depending on how many Lambda services are running available, the subsequent request is picked up by the existing Lambda execution environment, or a new environment is created to serve the request.
So, if API Gateway invokes Lambda 6 times sequentially with a delay between each invocation, the existing execution environments are reused if the previous invocation is complete.
Note: Normally, first requests have added latency due to the cold start, but in-between requests can also add latency because of the Lambda invocation pattern mentioned above.
Experimental Observations:
We created a python script to invoke the lambda at different time intervals. The objective of this was to notice response times for different durations.
We observed some of the requests had been billed for the duration of:
- 90sec (cold start requests)
- 6–7sec with init duration (semi-cold start requests)
- 65msec (warm requests, no added latency, billed duration and duration are almost identical)
Depending on how long the Lambda service is idle from the last request served, the service starts to deallocate the resources starting by killing the execution environment (undoing step 2 in fig 1) followed by removing the code downloaded (undoing step 1 in fig 1).
For Semi-cold requests, the Lambda service needs to rerun the downloaded code (in the case of the container Lambda images, the image is present in memory, and it is rerun). This time is mentioned in the init duration.
After some testing, we found out that Lambda has a fixed life of 5 minutes, which means that beyond 5 minutes, the Lambda is no longer lit. Therefore, after 5 minutes the Lambda must be initialized again. So, requests coming within 5 minutes are warm ones, and after 5 minutes interval requests have init duration.
Proposed Solutions:
Based on our observations, we identified two approaches
- Approach for enhancing cold start response time:
- Reduce container size: quantization techniques to reduce the model size, keep minimal dependencies, and code in the image.
2. Approaches for avoiding cold start latency:
- Reserved, Provisioned concurrency
- Using functions warmers: keep lambda warm by periodically sending dummy requests
Conclusion:
To conclude, we are trying to understand lambda’s unpredictable behaviors and lambda service execution because of the costs, and user experience. Here we mention how the service handles requests and our observations.
Among the proposed solutions, we have implemented the model optimization and reduced the container size, another teammate of ours has implemented the function warmers as the requirement in his case was to keep the functions always available.
As a key takeaway from this post, you should always try to create a package with minimal dependencies, and quantization to get the best response times.
Thanks for reading!
References: -