Overview of FOX’s Resilient, Low Latency Streaming Video Architecture for Super Bowl LIV

Mayur Srinivasan
FOX TECH
Published in
8 min readFeb 5, 2021

On February 2nd, 2020, FOX delivered the most live-streamed Super Bowl in history, delivering an average minute audience of 3.4 million viewers. Behind the scenes, FOX’s video engineering team designed an innovative and highly redundant video streaming workflow to support this record breaking audience with a flawless experience.

We built this cloud-based streaming workflow in-house using a collection of vendor services across transmission, encode, storage, origin shielding, delivery and playback. Our focus was on building resiliency for every component of the video workflow.

This article will cover the following topics:

  • Architectural overview of FOX’s resilient, low latency video streaming workflow
  • Monitoring tools that were built/leveraged for the workflow
  • Testing strategies
  • Rehearsals leading up to game day
  • Game day recap

1. Architectural Overview

The following diagram illustrates the signal flow architecture that can be broken up into:

  • Transmission
  • Encode
  • Origin
  • Origin Shield
  • Delivery (Multi CDN)
  • Playback

Transmission

The primary broadcast signal originated from Hard Rock Stadium, in Miami and was then sent to FOX’s Master Control in Los Angeles. Master Control inserted commercials, rating watermarks and closed captioning. From LA, we used a managed fiber network to deliver the finished signal over four diverse fiber paths to multiple cloud regions.

Encode

For encoding resiliency, we used redundant encoder pipelines, deployed in multiple regions . For seamless failover in case of transmission issues on one of the paths, we made sure that the signals were sync locked. Each incoming signal was an RTP-FEC feed with video at 720p60 20 Mbps, AVC. In order to achieve low latency, we chose 2 sec segments with a live window size of 15 segments (30 secs total). On game day, end users experienced latency of roughly 8–12 secs behind the feed from Master Control!

Origin

We used two redundant origins, one in each region. This gave us geo-redundancy in the scenario wherein an entire cloud region had an outage/issue.

Origin Shield

To properly implement our Multi-CDN strategy and protect against congestion, we deployed a industry leading origin shield product to ensure that servers weren’t getting hammered with requests and to optimize for better caching. We also had the necessary knobs/logic to failover in either of the following scenarios:

  • If a particular feed in a region was down, failover to the backup feed within the same region
  • If a particular region was down (either dual feed failure/encoder failure/origin issue), failover to the backup origin in the alternate region

Additionally, in the scenario where the origin shield were to go down, we had the option of switching over to a backup shield with a different partner. This backup shield also had the same amount of redundant logic as the primary origin shield.

The ability to switch between primary and backup origin shields was handled through DNS.

Delivery (Multi CDN)

We used a pool of five CDNs for the event. We built a backend service for CDN decisioning that ran at each session start based on the following metrics:

  • Latency: We used test objects embedded within FOX properties to capture real time user metrics for latency. The size of the test objects was representative of our typical video segment size, in order to emulate latency measurements for video segment downloads. These test objects were fronted by each of the CDNs using configurations similar to actual segment delivery.
  • Rebuffering ratio: We utilized player rebuffering % metrics as part of the decisioning process.
  • Number of errors: We utilized player error metrics as part of the decisioning process.

We had protocols established, wherein if a particular CDN approached its reserved capacity, we would rebalance traffic to the rest of the CDN pool.

Additionally, we had a DNS based decisioning engine available as fallback if the primary API based decisioning service failed. This was part of what we internally referred to as “static” video mode, wherein clients would be directed to a heavily cached endpoint that would resolve to one of the CDN urls for playback.

Playback

We delivered Super Bowl for end users through the FOX Sports and FOX NOW apps on the following devices:

  • Apple TV, Roku, Android TV, Fire TV, Chromecast, Oculus VR, iPhone/iPad, Xbox, Android Phone/Tablet, FOXsports.com, FOX.com

The stream for non 4K users was a secure HLS stream with the following video bitrate ladder:

Audio was encoded in Stereo AAC-LC @ 96Kbps, 48Khz.

Leading up to Super Bowl, we focused on two primary areas of work for video playback:

Consistent playback of the secure stream across all CDNs

  • Each CDN had its own recommended tokenization approach to unlock the playback URLs. Hence we built a tokenization service that would protect the playback URL for each CDN. We then had to ensure that the tokenized stream URLs corresponding to each CDN were playable across the target device set.

Consistent integration with Player QoE analytics

  • We used multiple player QoE platforms to gain insights into real-time player metrics like rebuffing, video start failures, video playback failures, average bitrate, average frame rate, etc.
  • We also fed relevant player QoE metrics into the CDN decisioning engine.

2. Monitoring Tools

There were several tools used to monitor the workflow described above. Here are examples of some of the dashboards that were built:

Transmission

We queried APIs to monitor signal health and logged it to monitoring dashboards.

Encode, storage and origin shield

We pushed detailed metrics to various dashboards to visualize health.

Delivery

We ingested CDN logs and built dashboards to view key metrics like throughput and error rates

We also used a synthetic monitoring tool for evaluating CDN health

Playback

We had dashboards to provide key insights into Player QoE metrics

Please note: These are for illustration only and not representative of Feb 2nd 2020

Additional dashboards we had included

  • In-house webpage that could play a URL from a specified CDN
  • Detailed dashboards to monitor the health of backend services
  • Profiling dashboards that displayed latency through various parts of our ecosystem

3. Testing Strategies

Here are some of the highlights:

  • Encoder Sync lock Testing: We ensured that the encoder pipelines were sync locked. This ensured that there was no end user impact when we failed over either within the same region or to a different region.
  • Chaos Testing: We spent many days testing how the system would respond to the failure of key components at random in the workflow described above.
  • Load testing: We used a third party tool to load test our user facing endpoints. We ramped up request rates up to 30,000 RPS with various ramp up intervals to emulate anticipated traffic patterns on game day. We also simulated thundering herd scenarios by switching between various failure modes at high request rates. This simulated potential game day failures and tested the performance of the backup systems.

We also built a scenario planning document, calling out the following elements:

The scenario planning document helped us clearly identify roles and responsibilities across various folks involved in incident detection, escalation and eventual resolution.

4. Rehearsals leading up to game day

We had a huge war room setup that allowed us to host all the key people required to make the event a success. We had representation from more than a dozen partners onsite, along with internal team members representing engineering, product, security and client teams. Seating arrangements were made well in advance (think nerd wedding planning!) to ensure that there were clear lines of communication across various participants. Run books were created to define clear operational roles for participants. We had many dress rehearsals, live test events and repetitions leading up to the Super Bowl to ensure that enough ‘muscle memory’ was built up. The NFL wildcard, divisional and championship games that preceded the Super Bowl gave us opportunities to make incremental improvements and to fine tune the workflow. We also did three days of dress rehearsals just before game day to get everyone involved familiar with operational procedures. We used a ‘Ground Control’ model for issue tracking and resolution and practiced that during the dress rehearsals, so that everyone would be would ready on game day.

5. Game day

Looking back, game day feels like a blur. We had an early start to the day (7am PST) to ensure that all necessary components were in place for the big game. Throughout the course of the day, we encountered several minor hiccups that we were able to quickly mitigate without end user impact, thanks to all the systems and processes that we had put in place.

Beyond the architecture discussed in this article we had numerous layers of additional redundancy — including an entirely separate, secondary end to end workflow built on a different technology stack. We initially had traffic split between these two workflows, but as minor incidents were triaged throughout the game, we ended up routing a majority of the traffic to the primary workflow discussed above. Overall, our video ecosystem held up extremely well throughout the event, with video QoE metrics showing green — including overall rebuffing at less than 1 % even with more than 3 million concurrent users.

Overall, it was hugely satisfying to see this newly built video workflow being able to sustain and deliver an extremely successful event! We are very grateful for all of our team members and partners who worked countless hours to make this even such a record breaking success. Our approach has always been to invest heavily in the engineering and software talent required to build scalable platforms. BTW, if you are interested in joining our team… we’re hiring 😀

--

--