How to overcome API Gateway timeouts using WebSocket

Maarten Thoelen
HatchSoftware
Published in
11 min readMay 1, 2021

A how-to guide on integrating WebSocket into a Serverless Framework based AWS Lambda backend.

A while ago, one of our clients asked us to integrate a report generation module into their dashboard platform. The initial setup was simple. The client application requests a report to be generated by providing the necessary parameters. On the API layer a query is executed to fetch the necessary data, some transformation logic is applied and the result is written to a file. This file is stored in an S3 bucket and the download URL is provided back to the client application.

Problem

All these steps were performed in a synchronous fashion because the reports were small and the logic was straightforward. Generating a report only took a couple of seconds.
However, over time the amount of data began to grow and the reports became more complex (aggregation of multiple different sources, more complex calculations, report templating, …), causing the generation time to increase drastically. This was no issue for the lambda function that performs all the logic as its runtime can go up to 15 minutes. However, the API Gateway in front of it has a more aggressive integration timeout limit of maximum 29 seconds.

Potential solutions

It was clear that a synchronous request - response approach was no longer an option for reporting on these long running jobs. But what other options did we have? I’ll briefly list them below.

Short polling

In this scenario the client requests data from the server. If the data is not available, the server sends an ‘empty’ response. When the client receives the response, it requests the data again from the server immediately or after a predefined delay. These request - response cycles go on until the data is available on the server and sent back to the client.

In our use case, the client would request the report generation and the server would return a report id. The client would then use this report id to request the download URL from the server. It would do this as long as the report generation is not finished and it did not receive the download URL.

The cost and complexity of implementing a polling mechanism is low, but there is a big disadvantage. This technique is wasting a lot of resources because every time a new connection is established, a request is sent, a query for new data is performed, a response (usually with no new data to offer) is generated and sent back, the connection is closed again and any resources are cleaned up. Setting up a connection is the most expensive operation in this process as it implies the work of many actors like firewalls, load balancers, … and if you are using HTTPS (which you should do) it requires an expensive TLS handshake.

Long polling

This is an improved version of the short polling scenario explained above. In this scenario the server does not immediately return a response but waits until the data is available or a timeout would be near. Depending on the situation the response could include the data or not. When the client receives a response without data, it would sent a new request (immediately or after a predefined delay). If the response contains data, the process would stop.

In our use case the client would request the report generation and the server would stall its answer until the report is generated or the response time would near the integration timeout of 29 seconds.

Because the amount of connections / requests is lower compared to short polling, long polling is obviously the better choice. However the above disadvantages still somewhat apply. On top of that, long polling is more intensive on the server.

Server-Sent Events (SSE)

Server-Sent Events is a server push technology enabling a client to receive automatic updates from a server over an HTTP connection. After the connection is established the server can send events to the client until the client closes the connection.

When we apply this technique to our use case, the client would send an HTTP request to start the report generation, the server would respond with an event stream URL the client can use to setup an event source. The client would then start listening on the event source in order to get a push update whenever the report download URL is available.

Server-Sent Events is an elegant technique that has an auto-reconnect mechanism in case a connection is lost, but it also has a few drawbacks you should be aware of:

  • It only allows for unilateral communication (server -> client), so the client can not use this mechanism to send any information to the server
  • It does not natively support binary types
  • It’s not supported by Internet Explorer by default

WebSocket

The WebSocket protocol provides full-duplex communication channels over a single TCP connection. After the initial handshake, it allows for two way data transfer between client and server, and this with lower overhead than polling alternatives.

Applied to our use case the client would first setup a WebSocket connection to the server. Once the connection is established it sends a request for the report to be generated. The server on its turn would generate the report and only when the generation finishes send back the report download URL as a response, using the same connection.

Unfortunately WebSocket does not have an auto reconnect mechanism, unlike SSE. For this you need to write some code yourself or you can rely on one of the many libraries out there that can do this for you.

As you can see, there are multiple options to avoid API Gateway integration timeouts. Every option has its pros and cons, and picking the right one for your use case will depend on a number of factors (cost, complexity, need for bidirectional and/or realtime communication, browser support, existing infrastructure, …).

For our use case, we decided to go for WebSocket because it was the best fit (bidirectional communication, low network impact, supported by all browsers, … ) and it was easy to integrate with the existing tech stack of the project (API Gateway - Lambda - Serverless Framework).

Below you can find the different steps we performed to move from our original REST API based setup to a WebSocket setup. I’ll explain them by means of a sample project.

Original setup

You can find the slimmed down version of our original setup over here. For simplicity sake, I left out the parts that handle advanced connection management (connection loss, …) and authorisation.

The API

The sample API is a node.js / Typescript project that I created using the Serverless CLI.

It has one API Gateway endpoint that is backed by a lambda function.

The function code accepts a ReportDefinition request. Based on the request parameters it waits for the specified delay before generating a dummy report with the specified amount of rows. When finished it stores it on S3 and returns a signed URL as the response. This URL can be used by the client to download the generated report directly from S3.

Don’t forget to check the CloudFormation ‘s3.yml’ and ‘roles.yml’ files in the resources folder. The first file contains the necessary configuration to create a secured S3 bucket for storing the report. The second file contains the configuration of the role that is assigned to the report generation function in order for it to be able to store the report in the S3 bucket and generate the signed download URL.

The Serverless CLI makes it very easy to deploy this entire setup to AWS using the command below

sls deploy

When the deployment is done, you can find the HTTP endpoint in the console.

The client

The sample client is an Angular application that I created using the Angular CLI. It only has one page, the Report page.

In this page there is a button that triggers the report generation function.

This function will call the API endpoint, passing in the Report Definition (delay + amount of rows).

The problem

Using a delay of 5 seconds.

private reportParameters: ReportParameters = {delay: 5000, amountOfRows: 500};

When using a report generation delay of 5 seconds, our API call returns successfully after 5 seconds with a report download URL.

Report generation using REST API with a delay of 5 seconds

But what happens when we increase the report generation delay to 40 seconds.

private reportParameters: ReportParameters = {delay: 40000, amountOfRows: 500};

With a generation delay of 40 seconds we get an error after about 29 seconds.

Report generation using REST API with a delay of 40 seconds

If we check the network tab of our browser’s inspection window, we can see an error response after 29.20 seconds.

This error is caused by the Integration timeout setting on our API Gateway. Unfortunately this 29 seconds is a hard limit. When we try to increase it in the AWS console we get an error message.

AWS Console Integration timeout setting

When we look at the API Gateway limits in the AWS documentation we can also see that the Integration timeout can not be increased.

API Gateway Integration timeout limit

At this point it is clear that we reached the limits of the synchronous approach using a REST API, so let’s change the code in order to use WebSocket.

Migration to WebSocket

The API

Add a function to handle WebSocket requests

First thing we need to add to our API project is a new Lambda Function that will handle all WebSocket requests. In our ‘serverless.yml’ file we attach a handler and a role to the function and link it to 4 different websocket events, each handling a specific route.

  • $connect: this route key is a fixed route key that is used whenever a client opens a WebSocket connection
  • $disconnect: this route key is a fixed route key that is used whenever a client closes a WebSocket connection
  • generateReport: this route key is used whenever a client sends a WebSocket request with ‘generateReport’ as the action field in the request data
  • $default: this route key is a fallback key for all actions that are not a $connect, $disconnect or the custom ‘generateReport’ action specified above

Let’s see what the Typescript handler code looks like.

One of the first things we do is grab the ‘routeKey’ from the request context. Based on this key we will perform different actions:

  • $connect: When a connection is opened, we are storing the connection id into a DynamoDB table. This is done to keep track of all open connections to be able use them at a later stage to send data to the connected clients. We don’t want to use these connections forever, therefore we set a TTL of 1 hour on the DynamoDB item.
  • $disconnect: When a connection is closed, we are removing the connection id from the DynamoDB table.
  • generateReport: When this action is executed, we invoke another Lambda Function that will take care of the report generation (see below). We pass an event that contains the connection id and report definition. Because this other function can run for a long time, we invoke it with the ‘Event’ invocation type. This way the process will not wait for the function to end and we can send the response to the client immediately.
    In a more advanced setup, you would also write some logic here to store additional information (eg. client id, report generation id, …) in order to handle the case of potential connection loss.
  • $default: For all actions that are different from the ones above, we send a response to the client telling it that the action is not supported.

Add a function for report generation

We add a function to the ‘serverless.yml’ file.

This is how the function’s code looks like.

We grab the connection id and report definition from the event. The report definition is used to generate the report. When done, we use the connection id to send the report download URL to the client using the API Gateway management API.

Add a DynamoDB table to store connection ids

Because we are storing the connection ids of all active connections we need to set up a DynamoDB table. This is done by referencing the CloudFormation template file below from the ‘serverless.yml’ file in the root of the API project.

Now we’ve done all the necessary API changes, we need to update our infrastructure on AWS by running the command below.

sls deploy

Next to the HTTP endpoint, we can also find the WebSocket endpoint in the console output this time.

The client

First we add a button to generate the report using WebSocket.

The function that is triggered by clicking this button looks like this.

It creates a connection to the WebSocket, using the endpoint from the deployment output. When the connection is opened, it sends a ‘generateReport’ action together with the report parameters.

When the report download URL is received as a data message, it is used to show the download link in the UI. When an error occurs it is also shown in the UI. In both cases the WebSocket connection is closed afterwards.

The Result

Now let’s try to generate a report with a delay of 40 seconds using WebSocket.

private reportParameters: ReportParameters = {delay: 40000, amountOfRows: 500};
Report generation using WebSocket with a delay of 40 seconds

As expected, the report generation takes about 40 seconds and we successfully receive the report download URL.

You can find the complete code of the final setup over here.

Conclusion

As shown, it doesn’t take a lot of effort to switch from a synchronous REST API setup to an asynchronous WebSocket setup to overcome the API Gateway integration timeout of 29 seconds. But even when you are not hitting this timeout limit, WebSocket can be a great fit for many use cases (social feeds, location-based apps, multiplayer games, …). And on top of that, it can be set up on AWS in no time using the Serverless Framework.

Thanks for reading!

Maarten

--

--