WebSocket API: What does it mean that $disconnect is a best-effort event? How can I count active connections?

Jaewoo Ahn
5 min readApr 25, 2020

--

Amazon API Gateway’s WebSocket API provides $connect and $disconnect routes to handle the initial upgrade (handshake) request and the disconnect event. $connect is synchronous, so the actual connection would be established after $connect is completed successfully. In contrast, you could find this from the $disconnect route documentation.

The $disconnect route is executed after the connection is closed.

The connection can be closed by the server or by the client. As the connection is already closed when it is executed, $disconnect is a best-effort event. API Gateway will try its best to deliver the $disconnect event to your integration, but it cannot guarantee delivery.

What does it mean $disconnect is best-effort event? Does it mean $disconnect might not be triggered at all when the client is disconnected? Many people worry about this part when they use $disconnect to clean up the resource (e.g. delete the connectionId from DDB, decrease the connection count).

To help your understanding, here is a diagram to show what’s happening when the client is disconnected.

Before moving on, you have to understand all boxes *could* be separated and distributed components. If everything happens within a single server, it is very easy to solve.

Client and Server: How is the connection closed?

In WebSocket, there are multiple ways to close the connection. The server can close the connection due to the idle time out, max connection duration, deleteConnection request, or server’s maintenance purpose. The client can close the connection too. And there is “unclean” close without sending a close frame (e.g. client crashed, network issue).

When there is an explicit close frame (either of client and server), $disconnect would be triggered immediately. In contrast, if there was no close frame, server doesn’t know until it hit the idle time, so the $disconnect could happen after 10 minutes.

Server to $disconnect : Can it fail to deliver?

YES. Although the server triggered the event, it may not be delivered to $disconnect route. We do our best effort, but there is no 100% guarantee in the distributed system. However, the failure rate on this phase is extremely low (as a reminder, API Gateway has at least 99.95% availability in SLA).

$disconnect to Lambda (or your integration endpoint)

You configure $disconnect to invoke your integration, let’s say it’s Lambda. API Gateway invokes Lambda’s endpoint. Hold on, your function is not invoked yet. Even it may not reach to Lambda’s endpoint at all due to the network failure.

You may consider to use SQS for more reliable delivery. So instead of calling Lambda directly, you can put it to the SQS queue and trigger Lambda from it. Still API Gateway to SQS could be failed.

Lambda to your Lambda function

Lambda itself also consists of several distributed components. It is not 100% guaranteed to execute your function.

Additionally, your function may suffer the cold start more than 29 seconds, then the integration request would be timed out on API Gateway. Or, if you don’t have enough concurrency execution limit, Lambda will throttle the integration request.

Within your Lambda function

We don’t know what you’re doing within your Lambda function. If you’re trying to remove the connectionId from DDB, that can be failed too. Or your lambda function could have a bug. All developers does not suspect their own code.

Trust me, I saw the most failures here. Rather than worrying about the event is not being delivered, make your logic to handle it correctly when it delivered.

Keep your eyes on 410 Gone

Do not just rely on $disconnect. We recommend to implement “eventual consistency” along with @connections API (postToConnection, deleteConnection, and getConnection). When you get 410 from @connecton API, remove the connection.

Counting active connections

Commonly people asks why there is no active connections metric in WebSocket API. Also people feels that it is pretty hard and expensive to count the active connection with their own way.

So do we. It is very hard and expensive to count the active connections in the real time manner across the distributed systems. We will continue to seek whether we can provide it, but the disadvantage overwhelms it for now.

Please think again: Do you really need to track the active connection counts? Is it worth for spending efforts and resources? If you still have a reason (hopefully more than just displaying the number on the dashboard to show it to your executives), here are some ideas I can provide.

The simplest solution would be counting it against your connections storage. However, as you have more connections, count(*) would become more and more expensive. Simple is not the best in this case.

You can consider to implement a counter. Increase one on $connect, and decrease one on $disconnect. Hey, you already read about $disconnect. It depends how you make your $disconnect reliable. If you use $disconnect for other purpose, then it would be worth to add a counter along with it. Otherwise, I don’t want to recommend to use $disconnect just for the counter.

If you’re not looking for real-time, you can leverage the access logging. Get the $connect and $disconnect route logs periodically using CloudWatch Insight query to increment or decrement the counter. Similarly, you can leverage the metric instead , but route-level metric (you need this for $disconnect) is available in the paid metrics only.
This has an advantage as you don’t need to consider the failure between $disconnect and your integration or within your integration.

Alternatively, you can consider to leverage your own health check. For example, you expect all clients must report their health for each minute. Whenever you got a health check message from the client, you can increment the counter for the minute. Then use the last completed minute’s counter as an active connections counter. However, this would become more expensive if you’re looking for shorter interval such as each second. It’s trade-off between the accuracy and the cost.

If you have a better idea, feel free to share!

--

--