AWS API Gateway — Ways to handle Request Timeout

Pravin Tripathi
6 min readOct 25, 2021

--

I have come across a problem where API request timeout is restricted due to vendor (AWS in my case) and can’t increase after max timeout value (30 sec). Daily multiple batches ran by different API consumers that experienced timeout for some requests amounting to 3–4% of the total request sent.

Ref:

Application Background:

Service Architecture

Node.js application based on the microservice pattern uses containerization for deployment in AWS K8s. It started as an MVP application, and its purpose is to accept orders from external consumer applications. It also sends a web-hook notification to the consumer application.

What is the problem exactly with the service?

Initially, the service didn’t show any problem, but as more consumer applications started using this service, high CPU usage was observed. After investigation, a problem was found in the webhook notification code. It is inefficiently checking for a message from RabbitMQ. If the webhook of the consumer application is down then this notification code instead of delaying the message it is checking that message repeatedly (😱). This inefficiency resulted in high CPU utilization and indirectly resulted in the delay in response to some of the requests sent in bulk.

Each request returns a unique identifier, and if that response data is lost due to many reasons like timeout, then the consumer has no other means to know about it. Also, the sent details are saved in the database.

Well, you will say that the service was not designed properly, and I agree with that. Since this service was developed a few years ago as MVP, and with time API consumers are increasing, now a new challenge is how to make that API request reliable.

Let’s discuss the solution to the above problem in 2 scenarios. Effort and time are considered while discussing solutions to the above problem.

Scenario A: The Team that developed the above application is available, has access to the service code, and can fix the problem with some effort.

Let’s list down the possible solution to solve the above problem. Here’s what I am currently thinking about,

Solution 1:

Increasing request timeout at the server will not help because the service is behind the API gateway, I cannot increment the timeout at the API gateway integration as the max value is 30 sec., and the vendor doesn’t allow us to increment the value.

I can increase the server instance to load balance the traffic.

It initially looks good, but that still doesn’t mean the above problem will not occur. If 2–3 consumers’ service is down which has a large number of orders to notify for then this problem is still there.

Solution 2:

Add queue and allow the application to retrieve the details later using the identifier.

It looks scalable but adds extra complexity to the design of the service, and the problem is still there if the consumer is down.

Solution 3:

Allow the consumer app to retry by sending the same request again.

It is not feasible in this case as the request can’t be sent twice due to service design constraints where certain field values in the request need to be unique.

Solution 4:

Extract notification code into an independent service that can scale based on load.

It appears to be a straight forward simple strategy. Given the application is event-driven, it is a possible solution.

😟 It requires splitting of the service into two separate sub-service that increases the effort of the team.

Solution 5:

Fix the notification processing code and make it delay the message instead of processing it multiple times.

RabbitMQ has a solution to the problem by just adding a plugin, and using an appropriate exchange agent for message scheduling. It could fix the problem with minimum effort.

Ref:

🙂 I don’t know whether this is efficient or not. But, this solution requires 2–3 lines of changes in service code and installing a plugin in RabbitMQ that doesn’t take more than a day to test and ship to the prod. It looks like a perfect solution currently for my problem.

Scenario B: Let’s assume that the above application is a legacy app. The development team has found that the effort required to make the above changes is more, and the team cannot invest that much time.

It is not an uncommon scenario when Project has lots of services, and some of them are older and written in technology on which the team has less experience.
Here is what I am currently thinking,

Solution 1:

Re-implement that service using the modern technology stack.

❌ Too risky.

Learning from History, Multiple teams have tried this approach, but they failed to deliver the service on time. This activity alone requires a dedicated team to handle the migration. If started without any proper strategy could degrade the reputation of the company/product in the market and give the upper hand to the competitor.

Solution 2:

Create a new API method (DELETE /v1/service/order/{id}) that could be used to delete the order based on identifier using the combination of client key and unique id in request. This new method can purge the record.

Create another new HTTP method that has service contract details similar to the original API. All the consumer has to send the same data to this new service API (POST /v2/service/order) which acts as a proxy that can forward the request to the original API (POST /v1/service/order) for the order to be created successfully. For timed-out requests, just purge the record using (DELETE /v1/service/order/{id}).

Solution 2 diagram

This pattern makes sure that if the request is timed out then the record is purged leaving no trace, and persisting the record for the completed request.
It allows the consumer service to retry the request after some time as it is a batch job that could enqueue failed requests for later retry.

Implementation for this new API method looks below,

@PostMapping(“/v2/service/order”)Public Response createOrder(@RequestBody Order order) {   // make request to POST /v1/service/order with order data   If (request successful) {       // send the success response   } else {      // call DELETE /v1/service/order/{id}      // we have removed the trace of data from DB      // SEND FAILURE RESPONSE  }}

Why is this solution perfect?

Given the situation, the above solution appears to work. This solution requires adding two methods, deleting order and the other acting as a proxy which on failure can delete the record to allow the consumer to retry again using already implemented methods.

What is the downside of this approach?

It adds extra pressure on DB. But, given that the above problem happens when consumer service is down, so it is manageable. It will not like all the consumers will be down at the same time.

✔ I am marking this as an acceptable solution but not perfect (I am still looking for a better one… 😋). It allows the team to allocate a few people to work on this problem, and the effort is less as implementation requires creating new resources instead of modifying existing implementation. It is comparatively safe to proceed.

The End.

--

--

Pravin Tripathi

Software Engineer | I like books on Psychology, Personal development and Technology | website: https://pravin.dev/