Scaling Hulu Live Streaming for Large Events: March Madness and Beyond

By Andrew McVeigh, Principal Architect and Justin Goad, Senior Technical Program Management Lead

For many of our viewers, Hulu’s live TV service is their destination for major TV and sporting events like March Madness. For events where we have a lot of concurrent viewers, our team spends a lot of time building and testing our systems to scale and be resilient to change. With hundreds of thousands of live viewers tuning in to watch March Madness last month, we used several strategies to plan for and prepare the service to scale for a high number of concurrent streams, while providing the best experience possible. Hulu’s system is comprised of over 800 microservices which inter-operate to provide a seamless blend of live TV and video on demand, and many of these services needed to be scaled or rewritten to account for the complexities of live TV.

In this post, we will outline how we projected numbers, categorized risk, identified degradation strategies, and validated our readiness. The techniques we used are applicable to scaling any large microservice-based system and give insight into how we looked at different surface areas and used this to streamline and simplify our engineering efforts.

We were primarily interested in tackling two things when we started this journey. The first was being able to quickly scale to support a high number of live TV viewers streaming at the same time (in addition to the normal VOD traffic). The second was spikes — where many viewers all attempt to do the same thing in a very short period of time. For example, when the First Four games aired this year we saw a spike of over double the load we usually see at that time.

Categorizing Systems for Scale

Our approach to categorization was to start by working out how much the request per second (RPS) load on each service was affected by additional concurrent viewers. We asked our service owners to calibrate back to two baseline dates where we knew both the number of viewers and the load on their service. For the dates we used a standard day, and also a day with a significant sporting event.

We then bucketed the services into three broad categories: category A (Linear = 1 RPS extra for each extra viewer), B (0.5 RPS) and C (0.3 RPS).

The next step was to form a calendar of upcoming events, and estimate how many viewers would be watching. To create our estimates, we used a variety of historical sources, including Nielsen data. From these numbers, we calculated the graph below to give our team an understanding of how much extra scale they needed for systems in each category and what the timelines were.

Once we had these expectations in place, we developed load tests to measure whether the services could take the future load. This methodology wasn’t perfect, but it was a starting point that allowed us to create structure around how we approached scaling, and we will evolve this approach moving forward.

Surface Areas

Determining surface areas for our system was a crucial simplification step — if we didn’t classify different areas, we would have an intractable problem in analyzing our 800+ microservice architecture.

We chose to divide our system into areas of key functionality: sign up, login, playback, browsing and searching, recording, etc. We were then able to look at each area in depth and examine the hot spot services for scaling. Typically each area had one or two key “edge facing” systems that carried most of the scaling burden.

For example, as we dove into the video playback area we realized that our manifest generator was in the linear scaling category (A). Generating manifests is such a critical part of our service — this tells each client device what live TV content to show every ten seconds or so — that we needed to engineer a more efficient approach. The team found that by reducing the number and variability of query parameters to the service, they were able to put a distributed Varnish cache in front of it. This reduced the scaling category down from A to C.

Preparing for Spikes

Another area of focus was planning for spikes — where many viewers all attempt to do the same thing in a short period of time. Spikes can easily bring down a system, as viewers try the action, fail, and then continue to retry. Game systems, for example, often use a login queue to prevent login spikes from interrupting their services.

We used load testing to simulate the various types of traffic on our system. We simulated short, large spikes of traffic as well as higher than average load for a sustained period of time. Our load testing framework allows running of arbitrary python scripts to simulate load. The framework takes a script to simulate a viewer, builds a Docker container, scales up/down the cluster of instances in the public cloud, which then get sent to the scheduler in parallel to run the desired number of workers (hundreds of thousands) to replicate the desired scale.

Initially, we built a headless Hulu client application to simulate the user behavior on our new live stack. We built a quick UI on top of it to allow people to easily pick the set of parameters they want (how much load, how long the test will run, specific settings relevant to the test case, etc.). We did this to start empowering teams to run tests against their own systems in production. A simple representation of a playback test would look like:

At the beginning of this process, we scheduled testing against the production environment for two days out of the week, every week. We did this to make sure people were available to monitor the system in case something went wrong, watch how things performed, identify any gaps in monitoring or alerting, and to encourage participation across the entire organization. As we kept pushing things higher and fixing errors that came up during testing, we added additional monitoring and began shifting our mindset to stop announcing when we were going to do a load test event. Our goal is to continually push ourselves to get better and be ready for anything — with live TV there is always something on.

Conclusion & Key Learnings

We are continually putting efforts into making our system more reliable at scale. This is just one area of quality we’re focusing on, but we think it’s a pretty big one. We believe the discipline and rigor around our approach helped us streamline the execution, and we are continually refining how we go about categorizing, measuring, and targeting validation.

Some of the key learnings from this process were:

  • Categorizing service scale helps in reducing the cognitive load of scaling a large microservice architecture.
  • Identifying “hot spots” for each major function was crucial in focusing our efforts
  • Load testing is always needed to test the system holistically, individual service scaling tests were not sufficient
  • It is difficult to simulate actual viewers accurately!

In the future we’ll be further expanding our suite of tests, integrating load testing to be more automatic as part of our CI/CD pipeline, and working to make our framework even more self-service. As we learn more each day we are also constantly refining the data set that goes into the methodology of how we build our targets and validate how well we think we’re doing.

If you’re interested in working on projects like these and powering play at Hulu, see our current job openings here.