Mindful Experimentation

How to optimize your A/B testing roadmap

Published in

Google Play Apps & Games

13 min readDec 9, 2019

Most of us quickly pick up on the headline benefit of experimentation: that A/B tests provide reliable answers to “why” questions. Therefore, A/B tests are a widely used technique for product and product marketing optimization.

At a recent Playtime event in Amsterdam, we heard from Duolingo, Dailymotion, and Monzo about how they make sure that their A/B testing efforts bring actionable results and avoid the dreaded “spaghetti testing” approach — running tests for questionable hypotheses. While the answers show that there are no 1-size fits all solution, it is clear that most developers face very similar challenges.

Duolingo

Duolingo is a language learning app with over 300 million learners. According to Karin Tsai, Senior Engineering Manager, they have run over 3,000 experiments and typically have between 150 and 200 running.

“We have had an A/B testing culture since the beginning,” says Karin. “In the 7 years I’ve been with Duolingo, I have seen first-hand how important and critical A/B testing is to us improving as a company and improving our product. We’ve iterated and improved our testing processes, while making improvements to our analytics framework, to our experiment tools, and the decision-making process. The result is that we are a company that strives to make our app a little bit better every week.”

To drive improvements, Duolingo relies on the data from their A/B tests, using it to make informed decisions about how to build a better product. But, creating and executing great tests is not always easy. Therefore, Duolingo has developed operating principles to guide their testing with the 3 key principles being learners first, test everything, and prioritize ruthlessly.

Learners first

“Our first and probably most important principle is to put our learners first,” says Karin. “Many of us joined Duolingo because we care a lot about the mission of equalizing educational opportunities. So, when designing experiments we always ask ourselves “what are the learning experiences we want to create?”

Everyone at Duolingo is educated about the why behind running experiments: to take Duolingo a step closer to achieving their goals. And, because everyone rallies around this concept, Karin believes it has been easy to create an A/B testing culture that has purpose and direction.

“Every single change to our product is an experiment,” says Karin. “A persistent, incremental approach to A/B testing has seen us increase our D1 retention from 13% on launch up to 55% today.”

Test everything

“To improve incrementally every week, we rely on real data from real users,” says Karin. “And experimenting is how we get that data, so every change we make, we test.” To get to a point where they could test everything, Duolingo needed to scale. When Duolingo launched, only a handful of people could create or run experiments. So, in 2017, they introduced an online training system that teaches anyone at Duolingo about their experiments framework.

“Anyone, from junior engineers to PMs, can learn how to create experiments through our online training program,” says Karin. “Then, after they have completed a short quiz to test their understanding, they get access to our experiments framework and can implement and create experiments.”

This approach has dramatically reduced the friction to creating experiments and, since implementing it, Duolingo has seen a significant increase in the number of experiments being run.

Duolingo has also invested heavily in its analytics framework, so everyone has easy access to meaningful reports. “Every morning we generate a report of all running experiments,” says Karin. “We look at what KPIs they are affecting and whether or not those effects are statistically significant. By creating pretty simple but smart tools, in addition to being aligned on why we run experiments, everyone knows how to run experiments.”

Prioritize ruthlessly

“We have millions of learners on Duolingo and only 200 employees,” says Karin. “That means that if we try to test everything that could be tested, we’re not going to have enough time to make actual improvements to our app.”

To ensure that testing focuses on the things that matter, Duolingo kicks off their testing process with a “one-pager”. This one-pager is a short template that helps the author to think about the right questions to ask when deciding whether to dedicate resources to an experiment. The questions on Duolingo’s one-pager include:

What business metrics or KPIs should change?
What is the challenge that we’re trying to solve for our learners?
Are there teams or the community in general that need to be informed about this change before it happens?
What will success look like for this experiment?

“The last question is my favorite,” says Karin. “It is really important because by defining success early on it prevents drama when we need to decide whether or not to launch or kill an experiment. It also forces you to come up with a hypothesis for what you think will happen. The process of defining a hypothesis lets us test our product IQ and our intuitions. We are then in a strong position to update them if they were wrong or reinforce them if right.”

The one-pager is Duolingo’s defense against spaghetti testing. It ensures that every new experiment gets the right amount of advice and buy-in from all stakeholders, including the CEO, before they start implementing the experiment and dedicating resources to it.

“Up to the point where we start testing, we’re making sure that everything is aligned to the one-pager and the mission for that experiment,” says Karin. “If it’s not, we start again with a new one-pager.”

Typically, executing the A/B test and checking the metrics happen together, with the metrics checked every day to make sure nothing’s going horribly wrong. Usually, it’s after about two weeks that Duolingo will make a decision about whether to continue running the experiment, launch it, or kill it.

Everyone at Duolingo is encouraged to send their experiment results to their all-hands meetings, called Parliament, as well as through an internal testing email group. Karin believes this practice creates a culture where everyone is learning what is working and what is not, so they can make better decisions for future experiments. Duolingo also encourages a culture of no shame. “We often learn much more from failed experiments than we do from successful ones,” says Karin.

Through these processes and its tools, Duolingo keeps the cost of running experiments and analyzing them low. Costs are further controlled by the one-pager process — even though it’s very easy for people to set up an experiment, the process insures people are running the right experiments.

“This balance of just enough process to prevent spaghetti testing together with empowering everyone to run A/B tests is how we strive to be a company that improves incrementally every week,” says Karin.

From their experience, Karin highlights 3 key things as the catalysts for becoming an experiment-first company:

Put your users first and align everyone on how testing contributes to their experience.
Test everything by empowering your entire company to run experiments — taking thousands of small steps collectively yields big improvements.
Prioritize ruthlessly by implementing processes, like the one-pager, that ensure that you’re working on the right things.

Dailymotion

Founded in 2005, Dailymotion has 250 million users — according to Jean-Loup Yu, VP of product — who generate more than 3 billion video views each month. In 2017, Dailymotion decided to revamp their entire product. This was part of a strategy to define themselves as “the home for video that matters” and deliver premium video from official content partners.

Dailymotion integrates A/B testing into their product roadmap, with every roadmap project having at least one A/B test. However, this wasn’t the case when they embarked on delivering their new product vision.

In 2017, Dailymotion had a very small data team of two data scientists. They were running basic A/B tests that focused on the UI. The outcome of these tests usually resulted in small changes in how the UI worked. However, Dailymotion wanted to change their backend APIs and their recommendation engine, and recognized the need to improve A/B testing to support this. After evaluating different solutions, they decided that none of the third-party testing tools met their requirements, so they decided to build an in-house testing framework.

Becoming data aware

Jean-Loup believes that the most important change made during the past two years was becoming more data-aware as a company. “In 2017, we had a design-driven mindset,” says Jean-Loup. “We believed in the new product vision and product positioning, so we built an MVP driven by the design and the user experience and then tested it in the market. It was only after going through this process that we started to get data that allowed us to identify new opportunities and run A/B tests to make better-informed decisions.”

“However, we don’t see ourselves as data-driven,” says Jean-Loup. “We still take a lot of design-based decisions to transform our audience and move towards the new product vision.”

The tool Dailymotion built offers a set of dashboards for A/B test and has enabled testing to scale. To help them work through the important stages of planning and implementation, the in-house testing framework includes features for:

Scoping out a new experiment.
Monitoring the experiment as it runs.
Taking the decision about whether to implement the change once the test has finished.
Monitoring after the change is released to make sure that the benefits measured during the test are seen in production.

Jean-Loup reveals that the introduction of the scoping phase was quite a change for the team. The tool provides a simple dashboard for PMs, enabling them to make informed decisions about a test, such as the region in which to run it or the platform to run it on as well as how long to run the test for.

Most A/B tests conducted at Dailymotion run for 7 to 10 days. One of the biggest challenges was determining what to do when an A/B test was finished. “It’s pretty easy to determine success in an A/B test that has a hypothesis that is validated, you just implement the change,” says Jean-Loup. “But what happens when it’s not validated? Should you reject the improvement, or try again and iterate until you’ve reached your setKPIs?” Dailymotion reviewed the decisions they took and found that they run two types of tests:

Major product features, which Dailymotion calls foundations, such as a new homepage, video page, or recommendation engine. These foundations are a critical part of Dailymotion’s move toward their new vision.
Minor product enhancements and tweaks.

“It’s way easier to reject a small experiment than a project to completely revamp a foundational element,” says Jean-Loup. “One of the challenges we have been thinking about is these major product changes and how you safely release them.”

Handling major product releases

When working on a foundational change, Dailymotion runs multiple A/B tests. The first test is on the MVP to make sure that there is no user backlash against the change. If there is a backlash, the change is immediately reworked. Once the MVP is stable, Dailymotion starts to test the impact on the set KPIs. Because the investment in foundation changes is very high, Dailymotion runs several A/B tests sequentially, looking for incremental improvements before rolling out the new future.

“Initially, we had a problem with this stage as we were monitoring too many KPIs at once,” says Jean-Loup. “Now we test only one primary KPI, one very close to the product change that measures the success of the change. We also have a North Star KPI, which is premium views, which is key for us to measure the success of our transformation, but it also serves as a guardrail metric.”

An example of this process in action comes from a change Dailymotion made to its recommendation engine. They were using a third-party engine from their previous product but decided, after scaling the data team, to build their own in-house. The first goal for this in-house engine was to perform as well as the third-party one. Once the performance had been matched or bettered, they had their foundation, and could start experimenting to find ways to improve the engine. The journey was long, but Dailymotion learned a lot from the experience. In the end it took 6 months, running 5 tests before Dailymotion released the new engine.

From their experience, Jean-Loup highlights 3 key things as the catalyst for transforming a testing process:

Reset your mindset — to make the most of A/B testing, you have to be data-aware.
For major public releases, set the correct goals and run multiple A/B tests.
Set your North Star metric as a guardrail metric.

Monzo

Monzo is one of the UK’s youngest banks, started four years ago, whose goal is to make banking easy by offering an alternative to traditional banking services. According to Bruno Vaz Moço, Product Manager, Monzo is one of the fastest-growing banks in the UK, with just over 3 million customers.

A little over a year ago, Monzo was a very different company, with less than half the customers and less than half of the over 1,000 staff it has today. It also had a completely different way of working. Their product team worked in a squad-based system, and the focus of his squad was to optimize the activation funnel and to introduce customers to the value of Monzo as quickly as possible.

They examined their activation funnel closely and identified their biggest pain points — for example, a 20% drop at the legal documentation step. Here customers are asked to take a photo of their passport, which they may not have on them, and record a selfie video, when they might not be in a suitable environment.

Adopting the growth mindset

“Our product team met and mapped out the onboarding journey on the walls of our team room,” says Bruno. “We then looked at the conversion, screen by screen, trying to understand where we could drive improvement. We knew that even a zero-point-something percent improvement each week on every stage of the funnel would, after a few months, deliver tremendous impact for the business.”

So, the team entered into what Bruno describes as a mode of high-tempo testing. Every week the team started a new sprint with an ideation session, where the entire team would contribute ideas for new experiments. “We then completed an impact assessment before prioritizing, designing, executing, and releasing a test. “A week is very quick to launch a new test,” says Bruno. “And then we had to wait for 2 to 3 weeks for each experiment to reach a conclusion. So, the more experiments we launched, the higher our stock of experiments running was, and this eventually started to complicate the development process.”

At the time Monzo had around 3,000 customers signing up every day, across iOS and Android. These customers then had to be split across different experiments, and the team soon started hitting the limit of how many tests they could run simultaneously. This was exacerbated because, as Bruno notes, “A/B tests are so cheap to launch the team was tempted to keep adding more and more tests.”

This eventually got them into trouble. They found themselves running out of customer segments to allocate to new tests, so they decided to stretch their limits by reusing the control group. They did this with an experiment about nudging users to add money to their account and a new experiment to assess the impact of offering a small cashback bonus if people moved their Spotify or Netflix payments to their Monzo account.

As they didn’t have enough customers to assign a new segment to the cashback experiment, they decided to share the control group from the adding money experiment. A few days after launching the experiment, the team discovered that they were giving the test to the control group as well. This meant that both experiments were invalid. “Overall, this was a huge setback for us, costing the team several weeks,” says Bruno.

“This pace also meant that we missed some of the tests,” says Bruno. “The new thing is always the most exciting thing, and we sometimes forget about the ones we released a month ago.”

A new era of testing

This was a pivotal moment for Monzo and how they approached A/B testing experiments. “We became much more mindful about how many experiments we can run,” says Bruno. “When it comes to prioritization, we now reinforce principles such as the one-pager. We now have a dedicated team working on A/B testing tools, to make sure that these tools evolve at the same rate as our company. Ultimately, we want any team at Monzo to be able to launch an experiment and to be confident that they can interpret the results.”

From their experience, Bruno highlights 3 key things as the catalysts for becoming a company that experiments better:

Look at the cost of running an A/B test from setup to assess the results, because there’s only so many you can run.
Make sure you fully embrace the constraints on your A/B testing in your prioritization — if your team you can only run 5 experiments, only run 5 experiments.
Remember that not everything needs to be tested — you can, instead, resort to user research or simply trust your product intuition.

Watch this space for an upcoming deep analysis into A/B testing with Duolingo.

What do you think?

Do you have thoughts on optimizing the A/B testing roadmap? Let us know in the comments below or tweet using #AskPlayDev and we’ll reply from @GooglePlayDev, where we regularly share news and tips on how to be successful on Google Play.