How Duolingo Runs Experiments at Scale

A Conversation with Severin Hacker, Duolingo’s CTO

FirstMark
FirstMark
Feb 17 · 5 min read

Severin Hacker is the co-founder and CTO of Duolingo, the world’s most popular language-learning platform and most downloaded education app, with over 500 million users worldwide. Severin joined FirstMark’s Guilds to share how Duolingo has built a test-driven culture and how the company runs experiments at scale.

Why Duolingo Started Testing Everything

Duolingo’s culture is centered around one of the key operating principles of “test everything.” To validate this, the company launches 30 new experiments per week and is running 100s of experiments concurrently.

Early in its lifecycle, Duolingo used a third-party tool to launch and manage product experiments. Eventually, the expense became too great and they reached the limits of experimentation possible using the tool — essentially, they had outgrown it. As a company, they also had determined that testing was so core to the business that they needed to run the infrastructure in-house. In the end, they decided to roll out their own platform.

How to Empower Experimentation

Any team member — product, engineering, marketing — is empowered to propose and run an experiment. Before doing so, they’re required to articulate an experiment memo, consisting of:

  • Hypothesis
  • Expected outcome
  • Links to related work
  • Audience selection
  • Design and interaction specs

Experiment results are available through a custom dashboard that shows how a given experiment impacts every important company metric vs. a control group — covering conversion, engagement, and monetization. (For Duolingo’s context, these are things like new lesson starts, the total number of lessons consumed, the total amount of lesson time, conversion to paid.)

Benchmark: On average, PMs are generally launching and managing one experiment per week (in addition to experiments launched by other teams, like Engineering.)

Moving to Production & Guardrail Metrics

After experiments run their course, team members are equipped by the internal dashboard to assess their overall impact on company metrics. When an experiment is clearly beneficial, team members are generally empowered to move those changes straight into production.

There is one major exception to this rule: so-called “guardrail metrics.” Guardrail metrics are those that can never be hurt, without a very deliberate and senior-level (even CEO-level) review. For Duolingo, guardrail metrics generally fall within the categories of engagement and retention, and not monetization. They have taken the long view of the business and very deliberately do not sacrifice long-term usage for short-term gain.

Best practice: Run experiments for longer than you think. It’s easy to be fooled by statistics or noise, and push something to production as soon as it’s “statistically significant”. Run experiments for at least a few weeks.

The Pros and Cons of a Testing Culture

The results of Duolingo’s testing culture speak for themselves, as the company has grown to become the #1 language learning app with over 500M users worldwide.

Below the surface, some of the other benefits include:

  • Repeatability: Having a repeatable process (essentially, the scientific method) to drive continued product improvement
  • Objectivity: Having an objective system for making decisions around product changes (avoiding alternatives like the “HIPPO” — the highest-paid person’s opinion.)
  • Autonomy: Encouraging autonomy, which in turns drives higher product velocity
  • Metric-Driven: Creating a system that can drive improvements in the most important business metrics, while also minimize the change of launch catastrophic changes

Of course, no system comes without drawbacks. In this case, an overemphasis on testing can lead to:

  • Requires Investment: Significant investment in infrastructure
  • Incrementalism: finding local rather than global minima/maxima)
  • Tech Debt: Can create additional QA overhead
  • Tough Metrics: Can be less equipped to drive certain metrics (say, virality or learning)
  • User Volume: Having a high volume of users and data is critical to successful experiments

Building a Testing Culture

Duolingo has very deliberately built a culture of testing. They do this by articulating “operating principles” that are one of the very first things new hires go through. And these operating principles are so important that the sessions are typically led by a co-founder.

While companies should adopt principles that work within their own specific cultures, Duolingo’s can serve as an inspiration to other teams. Their operating principles include:

  • Learners first (aka users first)
  • Take the long view
  • Prioritize ruthlessly
  • Test everything
  • Ship it
  • Strive for excellence
  • Be candid and kind

Bonus: Experimentation culture even permeates the hiring process. For prospective PMs, hiring managers share a current in-flight experiment and have the PM walk them through what changes, if any, they would make given the results.

How to Get Started

If you’re in the early stages of building a testing culture in your own company, Duolingo’s experience provides some very instructive guidance on how to get started:

  • Make sure you’re at the right scale for A/B testing. If you have dozens of customers, running A/B tests makes no sense, since you simply don’t have the data. While there’s no hard rule, a soft threshold is around ~100,000 users or more.
  • Have the right tools to make experimentation easy. If you want to run a lot of experiments, the marginal cost (whether measured in time or dollars) of each experiment should be small — the cost should be “less than 5% of overhead” — for both technical and non-technical employees. You’ll need to invest in the infrastructure to make running tests easy.
  • Document and train relentlessly. If you want to build a culture where nearly every team member can launch an experiment, you will need very precise documentation on how to execute experiments.
  • Know what to measure… and prioritize what matters. Be thoughtful about what you choose to measure when it comes to experiments — only measure the things that actually affect your business. And perhaps even more importantly, make sure your entire team has a shared understanding of what matters the most.

Are you a senior functional leader at a venture-backed tech startup? Apply here to join our Guilds. Members enjoy benefits including invites to private expert events, access to a full archive of playbooks and templates, inclusion in the private forum with other senior leaders, and much more.

To be eligible for Guild Membership, you must be a C- or VP-level leader at a technology company that has raised significant venture capital or grown to tens of millions of revenue.

Geek Culture

Proud to geek out. Follow to join our +1.5M monthly readers.