Strava’s Lean, Mean, Testing Machine

At the beginning of the year, Strava’s growth team was launching 3 A/B tests each quarter on our mobile apps. Within a few months we were launching over 30 tests per quarter. This blog post covers how we made rapid A/B testing possible by improving three critical processes.

At Strava, the growth product team’s primary focus is onboarding and activation, or the process of guiding an athlete from their first introduction to Strava to uploading activities and engaging with the community. We are advocates for the new members of our community, yet we ourselves are no longer new. To make sure we are effectively serving our new athletes, we must avoid making assumptions and instead use data to drive the team’s roadmap. We do this primarily via A/B testing.

A/B tests are the backbone of the growth team. Beyond helping us gain a better understanding of the athletes we serve, they help us to quantify the impact of our projects, they enable decisive decision making, and they prevent us from making changes to the product that negatively impact our core metrics. Each hypothesis that is either proven or disproven by an A/B test has immediate impact on the team’s roadmap. The more tests we run, the faster we learn, and the bigger impact we have.

Enabling the growth team to run more A/B tests was not just a matter of adding more engineers, though it certainly helped. We needed to streamline the process for each test in order to reduce the time from idea generation to a launched test. We identified three areas where we could save engineering time: adding analytics to tests, polishing and reviewing code, and waiting for merged tests to ship. For each bottleneck we defined the issue, implemented short term fixes within the growth team, and kicked off larger company wide solutions.

Strava’s growth team is :fired-up-on-growth:

Analytics

Problem: The overhead for adding analytics to our A/B tests was more complex than building the tests themselves.

Last year we launched an analytics solution with the goal of gathering insight into our athletes interactions with Strava’s feed. This analytics system needed to ensure data consistency across multiple clients, and had to be robust against long term use. The end result was a strongly typed system with a schema defined in a central repository and shared by all clients. Contrary to what the system was designed to accomplish, the analytics required to support A/B tests do not need to ensure cross platform consistency or longevity, since experiments are only designed to run for a short time and on a single platform. While the new analytics system has served us well for instrumenting long-lived components of Strava’s experience, like the feed, it was clunky and difficult to use for instrumenting a simple event such as a button click. All new events required multiple stakeholders to agree on the schema updates, and each event required code changes and pull requests in multiple repositories.

Solution: Leverage Apptimize for measuring A/B tests.

Apptimize is a mobile A/B testing tool that we subscribed to for its cohorting capabilities. It also has a suite of tools for analyzing tests. Since we already had Apptimize integrated into our mobile apps, no additional setup was required to start using its analytics tools. An event is tracked with a simple call to the SDK: [Apptimize trackEvent:@“event”]. Apptimize automatically separates events for each cohort and displays them in an easy to comprehend dashboard. Switching from our strongly typed analytics system to Apptimize analytics allowed us to spin up A/B tests quickly, and limited the code changes to a single repository.

It’s worth noting the few downsides to using Apptimize. As with any third party service, when things don’t work properly it is difficult and time-consuming to debug the issue, often requiring back-and-forth communication with Apptimize’s support team. Additionally, coordinating data between two different analytics platforms is challenging.

The long term vision for analytics is to develop a system internally that is similar to Apptimize’s easy to use and loosely typed system. This would allow us to have full control over the data pipeline, and to run analysis using both strongly and loosely typed data.

Cadence

Problem: Our mobile release cycle required 2–4 weeks of wait time from code complete to launching a test.

At Strava we launch updates to our mobile apps in a bi-weekly cadence. Before a version of the app is shipped to users, it undergoes two weeks of external beta testing. A “train cut” is when a version of the app is cut from the master branch and published into beta. Once a train is in beta, only small, low risk code changes are made to ensure the app is stable and as bug-free as possible. So, when an A/B test is ready to ship, we wait up to two weeks for the next train cut, and another two weeks for the beta to complete its test cycle before the test can start.

Each A/B test is designed to prove or disprove a hypothesis about how athletes interact with our product. Whether the test fails or succeeds, the learnings drive the direction of the product. Even if we can collect data, replan, and build the next iteration of a test in a week, the wait time for shipping that test limits us to three iterations of a test in a quarter.

Solution: A/B tests are merged directly into the beta.

By allowing A/B tests to merge directly into the beta we cut the wait time from 2–4 weeks to 1–3 weeks (once we submit a binary to the app store, which is done a week before the intended release date, we can no longer make changes to it). In order to keep the beta stable and bug free, we limit this exception to simple tests. We also wrap every code change inside an Apptimize A/B test, which gives us the ability to turn off a test even after the code has gone live.

In the future we’d like to see Strava move from a two week cadence to a one week cadence. This would reduce the wait time by an additional week, which would double the number of iterations we can do on a line of testing in a given time period.

Experimental Code

Problem: Wasted engineering time polishing code that is removed if an A/B test fails.

At Strava, most of the features we’ve built for our mobile apps remain in production today. That will become less true the more we utilize A/B testing. The nature of A/B tests is that at the end, one of the variants wins, and the other must be removed. Our engineering culture, which is so strongly rooted in the need to build code that will last, drives us to seek perfection in every code change. As the growth team pushes the boundaries of A/B testing by taking bigger product risks, we encounter tests that fail. And when those tests fail, the work to polish and refactor our original code is wasted because the feature needs to be deleted.

Solution: Redefine code standards for Experimental Code.

Experimental Code is the term we’ve coined to refer to code that is supporting an unproven feature or product change, and is being validated by an A/B test. Experimental code is held to a much lower standard than non-experimental code. Only the following criteria must be met:

  1. The code must work.
  2. The code must be isolated by a feature switch, Apptimize test, or access group.
  3. The code must not break the mobile applications or website.
  4. The code must have an associated JIRA ticket with a deadline for its removal.

At the conclusion of the experiment, the code must either be removed, or upgraded to match style guides, aptly refactored, and adjusted to follow best practices. This ensures that we stick to our “built to last” engineering culture.

To monitor the additional risk experimental code poses to the stability of the app and quality of the codebase, we’ve formed a cross functional team: the Risk Budget Working Group. The group established two budgets: The Error Budget constrains how many bad user experiences (crashes, severe bugs, etc.) we are willing to accommodate in our products. The Technical Debt Budget constrains how long temporary or experimental code can remain in our codebase before it must be removed or refactored. If either of those budgets are depleted, we pause the addition of new experimental code until the budget resets, or until the overdue code is cleaned up. While budgets feel restrictive, in practice they are also liberating. As long as we stay below the budget, we have room to experiment.

Lean, Mean, Testing Machine

With these changes in place, Strava’s growth team is running multiple tests every week. But, we still have a long way to go and a lot to learn before we can truly call ourselves a “lean, mean, testing machine”. What we do know, is that A/B testing will remain an essential part of Strava’s culture, so stay tuned for more updates from our growth team.