How We Scaled Experimentation At Hulu

Published in

disney-streaming

9 min readMay 27, 2021

Photo by Glenn Carstens-Peters on Unsplash

INTRODUCTION

Experimentation has been part of Hulu’s DNA since the start, but it’s only recently that decentralization has enabled product managers and marketers to truly scale the practice. Over the past year, we’ve evolved A/B experimentation into a habit, helping our decision-making become more scientific and objective while delivering significant business results.

Velocity is a key factor for an experimentation program to be successful. For velocity to hit a rate that moves business metrics meaningfully, it is imperative that all practitioners adhere to the same shared set of goals, processes, and incentives. At Hulu, this journey started with the founding of a Center of Excellence, a group with the collective belief that experimentation can be a key competitive advantage.

After an in-depth audit of existing practices and diagnosis of pain points, the CoE’s top actions included (1) decentralizing experimentation so marketers and product managers were empowered to run experiments, and (2) educating practitioners and giving them a robust framework to operate with.

PROCESSES & PIPELINES

A key aspect of decentralization was the rollout of ‘Optimization Pods’: tiger teams focused on optimization of a single customer touchpoint (or domain), such as Landing Pages, Signup Flow or Account Management.

Each pod comprises 5–7 subject matter experts who collaborate to hypothesize, build, deliver, and measure experiments. These include:

Optimization Strategist: A product manager or marketer who champions experimentation. This individual maintains the hypothesis backlog, builds and launches experiments, has authority to mediate disagreements, and is accountable for program metrics.

UX Designer: Creates the design vision and breaks it down into iterative, testable phases, usually aligned with Engineering sprints.

Data Scientist: Owns product instrumentation and the data pipelines that ensure experimentation integrity. Brings statistical rigor to the interpretation of results.

Engineer (Dev & QA): Manages experimentation infrastructure and assures build/delivery quality.

Technical Project Manager: Intakes experiments that require developer help, determines LoE, and moves work forward across engineering teams.

Finance Analyst: Translates results into financial metrics such as subscriber LTV, which determine go/no-go decisions for experiment winners.

Once the pods were established, Optimization Strategists spent 4–6 weeks documenting workflows and fine-tuning processes under the CoE’s expert guidance. Artefacts included:

RACI chart, providing a clear delineation of responsibilities between pod specialists.
A “How to” Wiki, with intricate details of experimentation architecture for that domain, workflows and handoffs, the QA process, and pre-launch checklists.
A Hypothesis Backlog + Workflow Tracker + Prioritization Framework, combined into one repository (Hulu uses Airtable).

When used together, this documentation proved invaluable because it:

Brought Pod members to a shared understanding of their individual responsibilities within the workflow, reducing Optimization Strategist’s involvement between steps.
Provided air traffic control, which reduced friction between teams and minimized experiment downtime. Each specialist could use the same tracker to view upcoming experiments and plan resources for research, design, development, or analysis.
Increased experimentation velocity by orders of magnitude.

The documentation helped launch the optimization pods on solid footing — but it was the weekly pod ceremonies that helped the pods maintain momentum. Ceremonies were structured around “actions”: ideate and prioritize, approve/align on experiments, eliminate roadblocks, or share results.

When we began experimenting, our data collection and tracking mechanisms were nascent, and tabulating results was a manual process. Our data scientist(s) have introduced analytical rigor into interpretation of results while building the data infrastructure and automated dashboards required to scale the practice.

ANALYSIS FRAMEWORK

A requirement for running online experiments is having a high-quality data collection/tracking process in place to ensure the validity of the results. For this we gather and reconcile data from various sources and run all the necessary data quality checks before launching any tests. Once all checks are passed successfully, we enter the experiment design and analysis phases as demonstrated by the following simplified process flow:

Hypothesis definition:
A proposed explanation made on the basis of limited evidence as a starting point for further investigation. Essentially, this is the idea we want to test. Every hypothesis must be testable, declarative, concise, and logical in order to enable us to iterate in a systematic manner, as well as generalize and confirm our understanding. The general form is:

Based on X, we believe that…
If we do Y…
Then Z will happen…
As measured by metric(s) M.

Metric(s) definition:
At a high level, for every experiment, we define [a] a decision metric that is directly linked to the main goal of running a test, [b] guardrail metrics that need to remain within “safe” windows (often tied to business goals), and [c] supplementary/explanatory metrics that allow us to dig deeper into the users’ mindset and the reasons for performing specific actions. Metrics can be continuous or rate-based (binomial or ratios).

Test Design and power analysis:
To design and power a test, we ask ourselves 3 main questions:

How much do we tolerate false-alarms?
Type-I error rate α: commonly chosen at 𝛼=0.05. We accept that 5% of the time we will conclude that our treatment has an effect, when in fact it doesn’t.
If there is a true difference, how confident do we want to be that we will capture it?
Power (1 — β): commonly chosen at 𝜷=0.8; if there is an effect from our treatment, we will be able to catch it 80% of the time.
What is the smallest effect that we consider meaningful to detect?
MDE (Minimum Detectable Effect) is the minimum difference that we care about capturing (e.g. an MDE of 2% means that any difference less than 2%, even if statistically significant, isn’t relevant to us and we do not want to invest resources to detect it).

Once we determine the answers to these questions, we calculate the sample size necessary to reach statistically significant results, and translate that into the number of days/weeks we need to run our experiment.

Analyzing results:
After running the experiment for the required duration, we check our results for statistically significant differences in the metrics of interest between our control and treatment variants. If the results are “stat-sig”, we declare our conclusions and come up with actionable recommendations based on the results. To accomplish this, we perform various statistical analyses depending on the use-case, metrics, and the underlying distribution of the data. In the simplest cases, we check the confidence intervals and p-values. If the p-value is less than the significance level (or if the confidence interval does not overlap with zero), then our results are “stat-sig”.

Automation + Visualization:
In order to reach meaningful velocity and scale, it was inevitable that we needed to automate the entire design and analysis process from start to finish. The following process flow diagram provides a simplified illustration of how we accomplish this.

The steps above were challenging to standardize and implement, but it was equally important to build a culture that embraces and prioritizes data-driven optimization. Without it, all our efforts above would have been useless.

CULTURE OF EXPERIMENTATION

The oft-mentioned ‘culture of experimentation’ is real and important to the success of an experimentation program. But what is it, and how can you begin to instill it in your organization?

At Hulu, a few simple principles have proved successful:

1.Democratizing idea generation. After pod formation, the first 4–6 weeks were used to construct the hypothesis backlog. Hulu did not have a dearth of ideas — they had just dispersed across teams and trackers over the years. Optimization Strategists aggregated and centralized them into a repository, where the pod could groom them and assign priority scores. Hypotheses were solicited from across the org; a simple intake form allowed anyone to submit an idea (the only conditions being adherence to the IF-THEN-BECAUSE format and inclusion of supporting data).

2. Establishing a process or workflow and relentlessly refining it. At Hulu, we’re now able to take an experiment from idea to launch within ~12 hours in certain cases, but reaching this level of efficiency did not happen overnight. As mentioned, we invested time initially constructing our hypotheses backlog, then documenting and refining a Wiki for our workflow. Over time, we have fine-tuned the process by eliminating redundant steps, clarifying roles, establishing SLAs for pod specialists, and communicating clearly and often. The process is a virtue, not a burden.

3. Being passionate about experimentation. Successful experimentation requires investment of time spent thinking critically: to understand how users behave (user journeys), to develop high-quality hypotheses, and to design experiments thoughtfully. It is tempting to run with the first hypothesis that comes to mind — but we’ve realized that the initial idea is almost never the best one. The winning formula for Hulu this past year? Pick a few hypotheses backed by data, groom them, select the right KPI(s), have a decision tree in place before the fact, and launch experiments only when you’ve exhausted your analysis of the idea.

4. Patience and embracing failure: In most workplaces, leaders demand immediate results. They are unwilling to nurture a program or allow it time to mature. Experimentation requires both. At Hulu, we were lucky to have leaders who understood that even the best-designed processes have kinks, and that most experiments fail. They have allowed practitioners to be comfortable with failure, allowing us to gradually embrace it. They also helped us set realistic expectations. In fact, our first few experiments failed to make a dent on our chosen metrics. That’s ok!

5. Transparency. We have made a habit of sharing every result as widely as possible, with detailed interpretations in layperson’s terms. Every quarter, we package results into digestible learnings and share them across the org. And we kept designers and engineers abreast of financial results, making them shared owners of impact to the business. The goal, really, is to keep experimentation front-and-center through constant communication, eventually creating a success loop that starts to feed itself.

6. Pushing back against HiPPO. You may be familiar with the concept of HiPPO — Highest Paid Person’s Opinion. Moving away from HiPPO is a cultural and attitude shift that requires time and resilience. At Hulu, we encouraged pod members to push back against HiPPO in their meetings and emails, and to make “let’s test it!” part of their lexicon. Over time, as testing becomes the go-to solution to resolve deadlocks, it will instill a shared respect for data-driven decision-making.

7. The right incentives. Experimentation is time consuming and process-heavy. Why should employees, many of whom may be perfectly happy making their own subjective decisions, invest in it? The answer lies in making it worth their while. At Hulu, practitioners have tied the real and measurable dollar gains from experimentation to performance evaluations. In many cases, the number of experiments and conversion rate improvements have become part of quarterly OKRs. When success is shared equitably and generously, more teams and talent become attracted to the practice.

IMPACT

Adopting these processes and principles has created a several-hundred-percent increase in Hulu’s experimentation velocity YoY, which translates into incremental revenue and margin. These are significant dollar gains from what’s essentially a net-zero investment.

Website optimization has had the added benefit of making our media investment work harder. Hulu has seen significant ROI improvements because those same media dollars generated more paying customers from visitors already hitting our website.

While these are obvious, visible improvements, there are other not-so-obvious benefits. Experimentation has made us humble and hungry: As Hulugans, we’ve always approached projects with research and data but sometimes, no amount of it can prepare us for the reality of unexpected, irrational user behavior. Experimentation has challenged some of our most wide-held assumptions about our customers — truths and best practices that we as marketers and product managers took for granted. It has shifted our baseline of what to question and made us humble in the process. This culture of “question everything” has created an insatiable hunger to test and learn, and we are a far better organization because of it.

We hope to continue sharing our progress in future blog posts.

ACKNOWLEDGEMENTS

This post was written in collaboration with Moe Lotfy.

Thank you to all the teams that have embraced our culture of experimentation and scaled the practice: Data & Analytics, Engineering, Marketing, Product, Product Marketing, UX, and Creative. Special thanks to the contributors for this blog post: Amna Aftab, Brian Borkowski, Chris Gorski, Dylan Daniels, Jason Wong, Juyeon Lee, and Sarah Greenberg.