Streaming the Super Bowl: Preparing Hulu’s Systems and Operations for the Biggest Game of the Year
by Brandon Lonac, Principal Software Development Manager, Raymond Khalife, Senior Software Development Lead, Kalyani Narayanan, Principal Technical Program Manager, Travis McCann, Director, Product Management, Justin Goad, Senior Technical Program Management Lead
About a year ago on this blog, we outlined how we scaled our live service for large-scale, popular events like March Madness.
Since then, Hulu has grown to more than 25 million subscribers in the U.S. and we continue to break our concurrency records with nearly every major event. Heading into 2019, we knew this Super Bowl would break records again, so we had to up our game to prepare and scale for the biggest sports event of the year — and it paid off: Hulu had four times more live, concurrent viewers for the Super Bowl this year than in 2018, and we successfully delivered a stable, high quality broadcast of the big game to our viewers, across their favorite devices.
The Hulu tech team focused on three key areas over the last year as we evolved our overall approach to readiness:
- Improving Load Projections: Improving our ability to accurately predict load on our systems for any given event.
- Rapidly Scaling Our Systems: Beefing up critical systems on demand or provide redundancy where that wasn’t an option.
- Wargaming and Operational Readiness: Improving best practices for preparing for a major live TV event.
Improving Concurrency Projections With Historical Viewership Data
This year, we leveraged data from last season to forecast concurrent viewership. By taking historical viewership data and combining it with projected subscriber growth estimates, we were able to model week over week concurrency predictions. Throughout the 2018–2019 season, our estimates had an error rate of +/- 10% when compared with actuals. We also built a matrix of concurrent viewership and extrapolated that into requests per second targets for our critical path systems. Combining both of these new capabilities allowed us to better understand what systems needed to scale, by how much, and by when.
The model gave us a confidence range with min, max, and mean range of predictions. As we got closer to the actual date, the confidence interval would get tighter, until we had a result that looked something like this:
Scaling Our Platforms with a Hybrid Cloud Approach
Most of Hulu’s services run on an internal PaaS we call Donki, which leverages Apache Mesos/Aurora in our data centers and a container orchestration system in the cloud. Donki packages services into containers and the same containers can run in our data centers and the cloud. Although we recently moved some of our highest traffic services to run in the cloud we were able to leverage our data centers in possible failover scenarios. Donki allows us to easily orchestrate deployments to either environment depending on the needs of a particular service. We focused on taking advantage of the auto scaling features in the cloud to better handle unexpected surges in traffic to keep our systems performing well.
We leverage auto scaling in two ways:
- Scaling the cluster that hosts the services.
- Scaling the services themselves.
The clusters that host the services auto scale by adding or removing machines. Auto scaling happens according to rules to shrink or grow based on CPU and Memory reservation thresholds.
The services themselves will auto scale by adding or removing instances depending on metrics such as request per minute per instance, cpu usage per instance, or memory usage per instance. Our production engineering team that’s responsible for observability and simulation (chaos) runs load tests to find the capacity per instance and works with the team to set appropriate auto scaling rules.
One of the critical service areas we needed to scale was the stack that powers our user discovery experience. We took an agile approach to scaling the systems based on the traffic projections and automated regular performance testing to incrementally load and spike the complete end-to-end stack. To help service teams with this, our Production Readiness group provided two functions:
- A load testing tool for teams with realistic test simulations.
- Coordination of end-to-end stress testing across entire stacks of services.
This was a huge effort that freed teams to focus on their specific scaling needs.
Our system has distinct architectural domains that needed to be scaled individually and as a whole. We started with stress testing to find weak points in these domains and the overall system. This led to a break / fix cycle as we solidified each domain of the system. Automated system wide tests ran multiple times a week. This allowed for the rapid iteration of verifying team fixes found in previous runs. Individual teams were also able to stress test their services in isolation to verify improvements prior to larger scale tests being run. Since all the services in the domains use Donki, our PaaS, fine tuning the size of each application cluster was easy. The effort could be then focused on application optimizations and tuning application cluster and scale parameters.
Once the system was able to handle the expected load, we moved on to spike testing to simulate large numbers of users logging at game start or simulation of playback failure.
Different domains scale in different ways. The Discovery Experience focuses on personalized, metadata-rich responses. This can be at odds when scaling up for millions of users. The goal is to give the best possible response for the user at that moment. We focus on caching baseline responses and then personalize on top of that to ensure viewers to find the content they want. We built graceful degradation into the system from the ground up. To achieve the scale that was needed in the system, we made these architectural design decisions.
- Use Asynchronous / Non-Blocking application framework
- Use the Circuit Breaker, Rate Limiting, and Load Shedding patterns
- Use both a local and distributed cache
- Cohesive Client behavior
Our API gateway and edge services use a JVM-based asynchronous event-driven application framework and circuit breakers. This allows thousands of connections to be open against a single application instance at a time. If too many requests stay open too long, it can cause memory pressure. All applications have the point at which they become unresponsive. We used the stress and spike testing to fine tune rate limiting requests to the system to protect it from too much traffic. This allows the system to continue functioning and serve users during extreme spikes while it auto scales, rather than crumple under pressure and serve no one. In the event user traffic was beyond our rate limits, our system would begin to shed load. If requests needed to be shed, circuit breakers in our API layer would trip and send requests to the fallback cluster. This was a highly cached version of our core client application experience that supports requests for our unique users. The Discovery Experience’s main goal is to return a response, this combination of patterns helps to ensure that.
To achieve the low latency responses and scale for the Discovery Experience, caching is relied upon heavily. The nodes use both a local JVM based cache as well as a distributed cache. The nodes cache responses and metadata based on MRU and TTLs. In the event of a cache miss or eviction, the data is fetched from the distributed cache. This combined usage of different caches helped to make the Fallback experience nearly indistinguishable from a normal response.
This brings us to the last point which is cohesive client behavior. Using defined and consistent server APIs, clients can help with scaling as well. Respecting HTTP response codes and headers, clients can help to prevent bombarding in error cases and generating more load in error scenarios. We have seen what inconsistent error handling logic in various clients can do in the past. Using strategies like exponential backoff and variable amount of time when calling the API are simple ways that clients can help to scale. This may seem like a reasonable approach, but it requires a coordinated effort with the numerous clients that we have. It also requires best practices on how early the API should be communicated.
Preparing Failover Streams and Wargaming for Operational Readiness and Redundancy
We spent a lot of time stress testing and scaling our systems, but things don’t always go according to plan. That’s why we always plan for redundancy, especially in critical areas like video playback, which is the heart of our system. Our goal is to ensure the highest levels of availability of the source stream itself. Based on the content partner, the architecture of the source stream pathway can greatly differ, and each workflow presents itself with distinct challenges. So, it’s absolutely necessary to ensure that we implement multiple failover options for the live stream.
Implementing these additional signal pathways involves working closely with our signal providers and partners. Some of the best practices we followed here were establishing multiple, non-intersecting signal pathways for the live stream, ensuring that failover scripts are well-tested and can execute toggling between source streams in a matter of seconds, and preserve the live and DVR experience for users in the event of a failover without any hiccups.
In addition to all the technology involved in getting ready for big events, it was critical to also prepare our organization. As our scale continues to grow, a new initiative that we adopted is to conduct periodic tabletop exercises to pressure test failure scenarios to helps teams better prepare and recover from a potential outage. Wargaming has proved to be a very educational process to build a culture of constant operational readiness and ensure our runbooks are all buttoned up. The practice has also uncovered more failure scenarios we need to prepare mitigation plans for. We identified both short and long-term work needed across many teams that we’ll continue to track into the future.
Beyond the Super Bowl — Automating Our Preparations
We have continued to advance the rigor and discipline to how we approach scaling our service for large events. Teams across the entire technology organization have put countless hours into preparing our systems for more and more viewers.
Some key takeaways we had were:
- By automating our production load tests we were able to build a consistency of always being ready and pushing ourselves further.
- Continuing to improve our forecasting really helped drive the outcomes of what teams needed to deliver.
- By providing a central platform for load testing, we freed engineers up to focus on how to best architect and refactor systems to scale.
The culmination of these efforts led to more confidence going into the Super Bowl this year, but this process still involves some manual elements. As we move forward into the rest of 2019, we plan to dedicate more time towards automation and improvements in these areas. Eventually, we want our projections to automatically refine our capacity predictions to take into account more variables. We also plan to integrate load testing more into our CI/CD pipeline and expand the scenarios we test on a consistent basis to drive towards even better reliability. Though there may always be some oversight that is needed for big events, our goal is to prepare and automate as much as possible to make this process more efficient and streamlined for our team.
If you’re interested in working on projects like these and powering play at Hulu, see our current job openings here.