How We Decreased AWS Lambda Function Average Duration

Published in

Insider Engineering

5 min readJun 6, 2022

Abstract

In this blog post, the problem experienced in the lambda function, which is used for 2 different business purposes, will be examined. We have identified the root cause of the issue by collaborating with AWS. It is especially to prevent systems from being affected by this problem. In this direction, a method has been proposed, with this simple method, without the need for deployment, by changing the AWS region immediately, the same function can be run in a different region without any problems.

To give a little information about the running service, the App provides a service for sending messages using a serverless structure. There is a lambda function, which takes a payload and prepares the payload according to the third party then it will send and sends a request.

In addition, a different Lambda function reads a payload from kinesis and writes it to a DB of our internal system.

This problem, which occurs instantly and can take a long time, has been examined and the results obtained, have been evaluated. At the end of the blog post, a method has been proposed to prevent the systems from being caused by this problem.

Problem

As explained in the summary section, increasing the average duration at certain times and decreasing the invocation numbers in a lambda function, provides a service for sending messages. It can affect the delivery of pending messages and improve delivery latency. A case has been opened on the AWS side on the subject, but for now, no way has been found to solve our problem.

As the number of invocations decreased, the push messages that were ready to be sent by SQS were experiencing a problem while they were waiting to be consumed.

As shown in Picture-1, it is shown that the average durations increase at 2 different points.

Picture 1 — Increased Lambda Function Average Duration

The root cause is still unclear as not all lambdas have the same problem.

Picture 2 — Increased Lambda Function Average Duration (Different Lambda Function)

In Picture-2, a lambda function statistic that is not connected to the third party, we can see that there are occasional alarms, although the same problem is not always the case here.

Investigation Results

In this section, detailed research on the problem was made. Since there are no logs to identify the problem on the X-Ray and Cloud Watch sides, the communication with AWS Support continues on this issue.

In order to produce a quick solution to this problem, 2 main topics were created by detailing the problem:

Main Topics:

Push API returns the late response
Examining whether this problem occurs in a different region

Push API returns the late response

It has been proven that there is no delay by push API, one of the tools that we monitor the lambda function.

Examining whether this problem occurs in a different region

In this section, the same structure was created in different regions (us-east-2, us-west-2, us-west-1).

us-east-2: Ohio

us-west-2: Oregon

us-west-1: California

The same structure was tested with the same number of requests and no problems were observed.

Picture 3 — Lambda Function in Different Regions

As seen in Picture-3, the average duration continues at the level we expect.

Solutions and Suggestions

We have implemented a method to change regions quickly so that sent messages do not enter latency. The purpose here is that the messages do not exceed any delivery SLA. The problem on the general AWS side persists with AWS support. If there are valuable results obtained there, I will be sharing them in a different blog post.

How does the proposed method work?

Picture 4 — App Structure

As you can see in Picture-4, the app sends the message payload to SQS, then triggers the lambda, prepares the payload according to the Third-Party App, and sends the request.

As shown in the previous section, when there is a problem mentioned on the lambda side, we can switch it to a different region to prevent latency.

Picture 5 — Suggested Approach Schema

If one lambda function is not in the invocation, we are expected to wait 2 minutes or if it gives an error, improvements have been made so that the same structure will work in a different region.

According to the rule determined by Cloudwatch, if there is an abnormality on the lambda side or the average of messages waiting to be processed by SQS increases, the system automatically changes the region and starts sending requests to other regions.

Let’s assume Regions are A, B, C:

If the App is sending its requests to the A region at the time of T1, and in any unexpected case when the problem we have defined above occurs, CloudWatch is triggered and randomly sets one of the other 2 backup regions to Redis.

On the app side, the AWS SDK region gets the region value directly from Redis, so when the Cloudwatch gives an alarm, it waits for a certain threshold and starts sending the other region’s traffic directly, so the messages are not accumulated in SQS and the problem is solved.

As I mentioned above, it is not aimed to solve the lambda problem on the net AWS side, and a different fallback method is also suggested.

Conclusion

A problem encountered on the side of 1 lambda is examined in this blog post. As we mentioned the details in the summary section, the problem we experienced in lambda functions that do 2 different jobs was mentioned, the purpose of this blog post is to suggest fallback methods in order not to be affected by the problem. Since the problem originates from AWS, a method has been proposed and tested so that the systems will not be affected by this problem. With this simple method, you can change the AWS region immediately without the need for deployment, and run the same structure in a different region, so that your systems are not affected, you can quickly change the region and use it as a fallback.