Introducing Pubby — our custom WebSockets solution
How DAZN is broadcasting messages to millions of users using WebSockets
In March 2019, we wrote about whether AWS was ready to provide serverless WebSockets at scale, and it concluded that “an off-the-shelf AWS solution won’t be suitable for our needs — yet”. We’ve been running our “custom solution capable of handling millions of users worldwide” for over 10 months and it’s had a lot of interest. I’m proud to announce that we’re sharing our tried-and-tested architecture of this service — Pubby.
What does Pubby do?
Pubby is a highly-scalable WebSockets service, designed for broadcasting messages to millions of users. Users can subscribe to any number of rooms and receive real-time updates, whilst also requesting a snapshot of the previous messages which were broadcasted. It uses many AWS managed services to achieve this, combining them into a globally available solution. Like most DAZN services, it follows a serverless-first approach.
Why WebSockets?
WebSockets are efficient for real-time applications as users don’t need to poll for updates. A single TCP connection can be shared for many purposes — which is fantastic when we’re transmitting many small messages. Traditionally, each real-time service would expose an HTTP endpoint and expect it to be polled, which doesn’t scale well once you have more than a few services. Making repeated HTTP requests can cause stuttering while playing live videos, especially on CPU-restricted devices such as TVs. WebSockets are also truly real-time, as there’s no waiting for a set interval between requests.
Why don’t more services use WebSockets?
WebSocket connections are persistent, which means a single connection is usually maintained for the entire user session. This results in some connections lasting for many hours, in comparison to just a few milliseconds for HTTP services. Therefore, WebSockets services are more vulnerable during periods of instability. A single server could have hundreds of thousands of open connections, and if the server goes down the user has to reconnect. This carries risks of bringing a thundering herd of reconnections, crashing your servers down like dominoes.
In short, WebSocket-based services are difficult to manage. Many businesses settle for simple HTTP-based APIs or offload the responsibility onto third-parties, such as Pusher. We love a challenge here at DAZN (come join us!), so when we discovered that AWS wasn’t ready to provide serverless WebSockets at scale, we decided to build our own solution.
Interface Specification
We decided to build this as a generic solution. We wanted to allow any service to use a single, shared WebSocket connection to broadcast messages in real-time. Therefore, the interfaces were our first design decisions to make — how do clients subscribe to rooms for updates, and how do services send messages to Pubby for broadcasting?
Clients — subscribing and receiving messages
Our front-end clients need to open a WebSocket connection, subscribe to any number of rooms, and handle any messages they receive.
Services — sending messages
Communication between services is simple when using HTTP, and Pubby is no exception.
Initial Architecture Designs
After completing our interface specifications, we moved on to the architecture.
Clients — subscribing and receiving messages
We couldn’t handle the WebSocket connections in a serverless manner quite yet, so we opted to use AWS managed services with an ECS cluster. When a client subscribes with dump: true
, the task fetches the existing messages for the given room from DynamoDB and uses Redis to cache the result.
Services — sending messages
For broadcasting messages, we were able to use serverless options with ease. Pubby exposes a REST API via an API Gateway. Services can make a simple HTTP call and Pubby then deals with persisting the message in DynamoDB and publishing it to the ECS tasks. Messages are broadcasted to all regions via SNS, then a Lambda function in each region publishes it to Redis. When the ECS task receives a message from Redis, it’s sent to every connection which is subscribed to the matching room.
Load testing
Broadcasting messages to a few connections over WebSockets is easy. Doing it at high-scale is where things become far more difficult so, once we’d built the core service, we immediately started load testing.
Our autoscaling policies were set up, like many services at DAZN, to scale up aggressively and scale down slowly. For example, we:
- Scale-up 50% when average CPU is above 60% for 3 minutes
- Scale-up 100% when maximum CPU is above 85% for 1 minute
- Scale-down 10% when maximum CPU is below 30% for 10 minutes
Scaling up
For any service to scale linearly, doubling the number of tasks or instances (scaling up 100%) should also double the maximum capacity. To test this, we turned off all autoscaling and ran 5 tasks. These tasks were able to handle 1 million WebSocket connections being opened over 5 minutes before we saw a slight increase in latency due to high CPU usage.
We expected that when we started 5 new tasks (total of 10), we’d be able to open another 1 million connections before seeing the same increase in latency. This was not the case — most of the new connections saw increased latency. After a few minutes, some of the existing connections were dropped, indicating service instability.
So, what was happening? Well, AWS Application Load Balancers route requests in a round-robin fashion. This works okay for HTTP-based services, as connections don’t stay open for more than a second in most cases. But, when using WebSockets, the connections are persistent, so round-robin request routing has a huge negative impact on our scalability.
When we added 5 more tasks, the original 5 tasks still had 1 million open connections. The ALB routed the next 1 million connections evenly across the 10 tasks, rather than trying to level out the load. After a while, the original 5 tasks became overloaded and were shut down due to health checks. When this happens, the new 5 tasks will end up being overloaded and experience the same issue (and again, and again…)
Many load balancers support different routing algorithms such as least outstanding requests (LOR). This is usually a great way of routing traffic for HTTP-based services, as it’ll select the host with the lowest load but, for WebSocket services, it’s a requirement. Luckily, AWS listened to our feedback and announced LOR routing at re:Invent 2019. Until then, to avoid instability, we were stuck with running our service at a higher capacity than we needed. This had significant cost implications, so we’re relieved to be able to use the LOR algorithm at last.
Scaling down
Scaling down is usually easy. The load balancer will stop routing new requests to the host, then it can wait for up to an hour for any open connections to finish before terminating it. For HTTP-based services, this works well as the connections aren’t open for long. But what happens with WebSocket connections which are open for many hours? The target will still have thousands of open connections after an hour, which are all forcefully terminated. This also impacts deployments as we need to switch out the targets gracefully.
If many targets are de-registered at the same time, we can easily end up with a thundering herd of reconnections. This would immediately cause the system to scale back up again, causing instability. We needed to be able to gracefully drain the WebSocket connections over some time.
The first idea was to listen to a process signal. We hoped that ECS would send one signal to the task as soon as it’s de-registered (such as SIGTERM
), and another when it’s killed (such as SIGKILL
). This is not the case. Tasks are only sent a SIGTERM
30s before a SIGKILL
, regardless of the load balancer deregistration delay. 30 seconds is nowhere near enough time to slowly drain all the connections.
In the end, the only solution was for each task to poll the AWS API for the target health. However, the task needs to work out which instanceId
and port
it is running on before it can call the API. The instance metadata API returns the instanceId
. Finding the port
requires querying the task metadata API to get the taskARN
and cluster, followed by calling the ECS.describeTasks
API function. Altogether, this is far more complex than it should be. We’re hoping to open-source our implementation soon.
Improving efficiency — Connection Tracking
Once we’d run a few successful tests of our scaling behaviour, we wanted to see how many connections each task could handle. We also wanted to see whether changing instance types or task sizes could improve efficiency.
We noticed that there was a hard limit on the number of connections which each instance type could handle. After some searching and support from AWS, we found that there are (unpublished) limits on the number of tracked connections a security group can handle, known as connection tracking or CONNTRACK. By opening our ALB and ECS security groups (relying on our private subnet and network ACLs for security), we were able to avoid this limit.
Unfortunately, when re-running our tests, we experienced similar behaviour. After a lot more searching and help from AWS experts, we found that it was limited by nf_conntrack
, which Linux uses for NAT and firewall purposes. On c4.large
instances, this is set to 65536
by default, meaning that a c4.large
instance can only handle 65k connections. sysctl
can increase this limit, for example /sbin/sysctl -w net.netfilter.nf_conntrack_max=196608
.
Now, our instances can perform to their best ability. Our next load test was able to scale quickly and handle tens of millions of connections.
Failing faster
During another load test, we made one of our regions perform badly, with simulated latency. We expected Route53 latency-based routing to detect this and to failover to another region — but this was not the case. Latency-based routing is the approximate latency between the user and the AWS region — NOT the user and your service! We resolved this issue by using finely-tuned Route 53 Health Checks. For more information, read our article on how to implement the perfect failover strategy using Amazon Route53.
Iteration & improvements
CloudFront-related issues
Whilst running Pubby in production, we noticed occasional spikes in the error rates reported by CloudFront. The spikes would only last a few minutes, but they were large enough to trigger our CloudWatch alarms. This didn’t correlate to any errors from our service or load balancers, but some users were being sent 502 errors so we had to investigate further.
The CloudFront access logs showed that these spikes were always sent from a single CloudFront edge location, such as FRA2 (Frankfurt). AWS Support helped us to diagnose the errors as “transient networking issues” from the edge locations. They were minor and short-lived incidents which weren’t even noticed by other HTTP-based services, so why was Pubby affected?
WebSockets, with their persistent nature, are vulnerable to network issues. When an edge location has an issue, CloudFront fails over within minutes so all new requests are directed elsewhere. With WebSockets, however, all of the open connections would be dropped and have to reconnect, causing the error rate to spike far higher than it would for a few failed HTTP requests.
After speaking with the CloudFront team, we decided to remove CloudFront from Pubby’s infrastructure. It’s not necessary for a pure WebSockets service. We’d like to introduce AWS Global Accelerator so requests enter the AWS global network as early as possible, but it’s currently missing IPv6 support.
Planned Future Architecture
Once we removed CloudFront, the next improvement on our todo list was to make the API multi-region. If our primary AWS region had an outage, our services would be unable to publish messages. This has been made much easier as single-region DynamoDB tables can now be converted to global tables.
Below is our planned future architecture. CloudFront is gone and the API is now multi-region. SNS has been replaced with DynamoDB Streams to simplify the architecture. The API Gateway has been replaced with an ALB.
Conclusions
Pubby has been running in production for over 10 months and has handled billions of messages. We’re continuing to iterate on the architecture and design to reduce costs whilst increasing availability even more.
AWS is releasing new products and features which might remove the need for Pubby altogether. AppSync saw some fantastic new announcements at re:Invent 2019, especially for real-time applications. Pubby was built to solve a very specific problem and is working well with minimal maintenance, but if AWS becomes ready to provide serverless WebSockets at scale, we’d be very happy to switch over and reduce our operational complexity.