Building a Scalable Playback API Platform: Infinity and Beyond

Published in

FOX TECH

8 min readJul 22, 2021

FOX is known as a industry leader, serving live and on demand content to millions of users simultaneously — including Super Bowl LIV with an average audience of 3.4 million! Delivering live content on a digital platform has significant challenges, especially when you have extremely high levels of concurrent viewers. The last two years have been very busy for the FOX TECH team, live streaming events like FIFA Women’s World Cup, Thursday Night Football, WWE Smackdown, PBC Pay-Per-View fights, Super Bowl in 2020, NFL Championships 2021, and many more. To handle these marquee events and significant future growth, we recently rebuilt our Playback API platform on a flexible, modern architecture.

At FOX, our Playback API platform is a group of backend services which help to:

Generate streaming playback URL for viewers
Capture analytics from the clients
Display various metadata fields to decorate views
Enforce ad related details for the players

Delivering the above depends on numerous external and internal data points like the user’s subscription status, geolocation, time zone, content availability rules, CDN routing policies, content type, device type and device configuration etc.

Here is high level overview of our legacy playback API platform.

The legacy system used to be a monolithic service which was dependent on data stores like Elasticsearch, DynamoDB and other internal/external services for building the consumer response. This platform was built in nodeJS. As diagram below indicates, this service was tightly coupled with all the dependent components. Due to its monolithic nature, we were unable to scale this service independently.

With our significant upcoming events on the horizon, we made the decision to overhaul this system to address the following challenges:

API platform can be scaled up infinitely independent of any dependent services.
We had high dependency on single backend data store which was the biggest bottleneck when it came to serving contents for a high number of concurrent users.
Legacy system didn’t have any way to cache data. Initially, it was intentionally built this way since with live streaming everything must be dynamic. Game action can change in an instant, so adding long lived cache can easily ruin the user experience. Even though this was the valid use case for not keeping the cache, we discovered that our content refresh goals could be achieved with strategic caching techniques.
In the legacy system internal features were tightly coupled. Due to this, even if a non-critical feature failed, it caused a playback failure which was not supposed to happen.
In the legacy system we didn’t have any way to fail forward in case of issues.
Because of its monolithic nature delivering even a small feature used to take months of development and difficult release processes.

To better support events like the Super Bowl and Thursday Night Football we had following goals in mind:

API platform should be scalable to support for 8–10 million concurrent digital streams with all required metadata.
API platform should be able to handle a thundering herd in case of app crash and simultaneous ad breaks.

App crash: Thundering herd is a very common problem for any system which has high volume of concurrent traffic. In case of live event where millions of concurrent users are connected, if any app crashes, then all the logged in users are going to retry the stream at the same time. This could cause sudden amount of high traffic for API platform and if this is not managed properly then this might bring down the whole system. This becomes even worse, if all clients keep retrying to get into stream as this causes 100x of unexpected traffic.
Ad Break: During live event streaming, if we serve a dynamic ad break at a particular moment then everyone who is into stream will hit backend API simultaneously. In this case, we will have a similar kind of thundering herd problem. Surviving these kinds of challenges while the game is going on is very important to deliver the best possible streaming experience.

API platform should be able to support scenarios like game extensions (due to overtime, rain delays, etc) without causing any additional overhead on existing users.

Game Extensions: Live events are configured for given time duration as we have the next program scheduled to stream. But in case any live game runs into overtime, we need to reprogram the schedule and this causes various elements to reset. This can cause same kind of thundering herd problem which we talked about earlier, so graceful approach is required.

API platform should be able to support Multi-CDN features in order to align provide adequate delivery capacity, fault tolerance, and per user path optimization.
API response time must be under 200ms.
In order to have a faster playback experience, it’s a good idea to serve the stream from nearest location the user is in. In order to support this, we need our API to have knowledge of the user location and be able to route accordingly.
API platform should be resilient to common failures such as infrastructure outages, third party system issues, etc.
API platform should support Site Reliability Engineering (SRE) needs for robust monitoring and end to end observability.

In order to address these challenges, developed a comprehensive plan of action enabling a significantly more robust API platform that could support all future needs.

True Micro Services

We broke down the API platform into true micro services based on following criteria:

Dependency on external Services.
Safeguard playback services from the failures caused by external components, so we created simple Golang Proxies around each external service so that our client doesn’t need to interact with external APIs directly.
Dependency on data stores like Elasticsearch.
Since we could not have dedicated an ES cluster for each micro service. We started treating each ES Index as its own data store and based on that, we wrote dedicated and specialized micro services.
Dependency on internal legacy services.
Since we could not rewrite all the internal legacy services, in order to safeguard ourselves from failures, we treated these as external services and wrote simple proxy services around each component.

Strategic Caching

Since serving requests directly from ES data store for high volume of traffic has its own risk, we added caching for services which were dependent on data store but critical for live streaming.

In order to handle various challenges like thundering herd, game extensions, and app crashs, we built our own Golang based caching solution which has following features:

Leaky bucket cache: Helped us to keep our cache warm all the time without any manual effort.
Stale cache: Helped to serve the contents, in case of External services failures.

Circuit Breaker for External and Internal Services

Circuit breaker is a common micro service pattern, that allows you to build a fault tolerant and resilient system that can survive gracefully when key services are either unavailable or have high latency. The purpose of the circuit breaker pattern is to detect the availability of a service and prevent the caller service from continuously making failed requests.

In order to implement this, we used the Golang Hystrix library which is based on Netflix’s Hystrix.

Because of this pattern we were able to fine tune dependent services based on

Service failure
Service timeout

Based on the type of failure, we could retry to call the dependent service or fail forward to serve the response from stale cache.

API Gateway Throttling

To protect external services which could not scale sufficiently, we used AWS API Gateway throttling backed up by client initiated retry for playback.

SingleFlight

For high volume of ad/analytics requests during the live game, we knew that whenever there is any Ad triggered from the client side, then most of the clients are going to have similar metadata requests. Since these requests are dynamic based on the user, a cache feature like leaky bucket could not help. So, we decided to implement a SingleFlight pattern. In this case if the backend received lots of concurrent requests for the same resource, then the service allows only one request to go to origin, get the data and fill in the cache. The rest of the requests are automatically served from local cache.

We used Golang Singleflight to implement this feature which provides a duplicate function call suppression mechanism.

Client Initiated Retry vs Auto Retry

For major events we added configuration to disable auto retry and let the clients retry in case the throttling limit crossed.

Preparation for Partial vs Major Failure

After safeguarding our API, we still needed to be ready for different types of failures which might not be in our control. To handle these at an API level, we configured operational modes depending on failure.

1. Normal (DEFCON-5): This was the default behavior served to users when things are running smoothly without any issues.

2. Major Failure (DEFCON- 0): In case of any major infrastructure failure, we were ready with a backup cloud region to serve content from stale cache.