Is AWS ready to provide serverless WebSockets at scale?

AppSync vs WebSockets for API Gateway

Simon Tabor
Mar 11, 2019 · 7 min read
Image for post
Image for post

DAZN streams sports to millions of loyal fans worldwide. Our user base demand real-time updates on scores, goals and knockouts. The classic solution is to build a WebSockets service, creating a persistent connection between the user’s browser and the server. But, if the server goes down the user has to reconnect. This carries risks of bringing a thundering herd of reconnections, crashing your servers down like dominoes.

Having dealt with such complexities around high-scale WebSockets services before, we wanted to try a different approach. Could we use AWS managed services to reduce our operational overhead? Could we take up Werner Vogels’ offer to do our “undifferentiated heavy lifting”?

Our specific requirements

  • Publish updates to users, rather than have the users poll
  • Retrieve previous messages and subscribe to new messages
  • Perform with high-traffic on a global scale
Image for post
Image for post

These requirements could easily apply to many organisations, the specific considerations for DAZN are outlined further below.

Publish updates to users, rather than have the users poll

Retrieve previous messages and subscribe to new messages

Perform with high-traffic on a global scale

With these requirements in mind, the two AWS products which look most suited to our needs are AWS AppSync and WebSockets for API Gateway.

Both services are relatively new — AppSync was released in April 2018 and WebSockets for API Gateway was announced at AWS re:Invent 2018 and only made available in December 2018.

AWS AppSync review

How it’s sold

AWS AppSync is a serverless back-end for mobile, web, and enterprise applications.

Uses GraphQL, an API query language designed to build client applications by providing an intuitive and flexible syntax for describing their data requirement.

Virtually unlimited throughput and storage, that scale according to your business needs.

Enables real-time subscriptions across millions of devices

It couldn’t sound any better! A flexible, serverless, massively-scalable, real-time, GraphQL-supporting managed service which doesn’t cost the earth. So does it live up to the claims?

What’s good about it

Fully serverless solution with the ability to integrate with many other services.

Where it falls down

Although multiple methods are available, you can only use a single authentication method at any given time.

Our front-end clients must be able to query data and subscribe for updates, but only our internal services should be able to publish data. Ideally, we would use API Key authentication for the front-end clients and a more-secure IAM authentication for our back-end services.

As AppSync only permits a single method, a work-around would be to use a Lambda authoriser. However, our bursty traffic would hit the hard-limits imposed on automatic scaling of 500 new concurrent executions per minute.

The only viable solution at this scale is to use API Key authentication, with one key for front-end and another for back-end. But then we’d need to hard-code those keys in the GraphQL resolvers to enforce access-control, which presents a significant security risk.

May 15th 2019: AppSync now supports multiple authorisation methods

Requires two requests

Two requests are required to retrieve existing data and subscribe to updates. This has the potential to cause race-conditions — the client must subscribe before making the GET request to ensure the user doesn’t miss any messages. This also requires the GET request to return a strongly-consistent copy of the data, removing the viability of caching.

Works best with DynamoDB

DynamoDB has limited capacity per partition. Due to our need to retrieve existing data, each new client must send a query to AppSync — which could result in millions of requests to a single partition. This can’t be cached due to the race-conditions mentioned above, so we’d hit the hard limit of 3,000 DynamoDB read-units per second for a single partition. This causes two issues — it could take 1m users at least 5 minutes to retrieve data, and it makes it too expensive to be a viable option.

To resolve this, we considered using DAX (DynamoDB Accelerator) for caching but AppSync doesn’t support DAX. Additionally it wouldn’t suit our needs of caching a query as it only supports invalidations for individual items — queries are cached separately with a fixed TTL.

The only workable solution is to duplicate the items and distribute them across multiple partitions, spreading the load more evenly. However, this introduces complexity in maintenance and it would still be prohibitively expensive.

WebSockets for API Gateway review

How it’s sold

Historically, building WebSocket APIs required setting up fleets of hosts that were responsible for managing the persistent connections […].

Now, with API Gateway, this is no longer necessary. API Gateway handles the connections [and] lets you build your business logic using HTTP-based backends such as AWS Lambda, Amazon Kinesis, or any other HTTP endpoint.

In short, instead of terminating WebSocket connections on your hosts, API Gateway can do it for you which allows seamless use of services like Lambda.

Operationally, this sounds fantastic. Holding millions of connections open is the most difficult aspect of scaling WebSockets.

What’s good about it

It has the ability to send messages to each individual connection, making it suitable for many different applications.

Fully serverless

Despite being a lower-level solution compared with AppSync, WebSockets for API Gateway is also fully serverless with many integration options.

Where it falls down

Connection metadata needs to be stored in a database, which we don’t need for our application. This means that for every connection and disconnection, a Lambda needs to be run. With this limitation we’d hit Lambda’s scaling limits. This might be avoidable by using an AWS Integration and have API Gateway call DynamoDB directly, but there would still be a huge amount of unnecessary database usage.

Broadcasting

There’s no way to broadcast messages to all connected clients with one API call. In fact, you need to make an API call to AWS for each connection you want to send a message to. Publishing a single update to 1m clients requires fetching all 1m connection IDs out of the database and making 1m API calls, which is a deal-breaker.

New Connection Limits

There is a hard limit of 500 new connections per second per account per region. This means we’d only be able to connect up to 1.8m new connections per hour — another deal-breaker.

The verdict

WebSockets for API Gateway has huge potential through its versatility and could be suitable for a range of software applications. Unfortunately, at the moment it doesn’t meet our requirements for scale or broadcasting to millions of clients.

What next?

In the meantime, we decided to build our own solution using a combination of many other AWS managed services. Stay tuned for the next post on how we designed and built a custom solution capable of handling millions of users worldwide.

Our post about our custom WebSockets solution, Pubby, is now live! Read it now.

DAZN Engineering

Revolutionising the sport industry

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store