Objective A/B Test Prioritization

Jeremy Dorn
GrowthBook
Published in
4 min readFeb 25, 2021
(image by peoplecreations — www.freepik.com)

There are many prioritization frameworks out there for A/B tests (PIE, ICE, PXL, etc.), but they all suffer from one critical problem — subjectivity. The reason A/B testing is so powerful in the first place is because people are really bad at guessing user behavior and the impact of changes. Why are we then using those same bad guesses to prioritize? In addition, many of these frameworks place far too much emphasis on effort — how long a test takes to implement. Except in the rarest of cases, the time it takes to implement a test is far shorter than the time the test needs to run. In most cases, effort should be more of a tie breaker, not a core part of prioritization.

At Growth Book, we came up with a new prioritization framework that is based on hard data, not subjective guesses. It all centers around an Impact Score which aims to answer the question — how much can you move the needle on your metric? The Impact Score has 3 components — metric coverage, experiment length, and metric importance.

Metric Coverage

Metric Coverage gets at what percent of the metric conversions your experiment touches. If you have a Sign Up button in your site-wide top navigation, it has 100% metric coverage for signups because 100% of members will see that button before signing up. Your experiment may not change their behavior, but it at least has the potential to. On the other hand, if you have a Sign Up button on your homepage, that may only see 20% of potential new members. Even if you do an amazing job optimizing the button, it will have no effect on the 80% of people who come in through different landing pages.

Metric Coverage doesn’t just take into account the URLs an experiment runs on, it also needs to factor in targeting rules. If your experiment only runs for users in the US, that reduces the coverage. If the test only affects the UI on mobile devices, that lowers it as well.

Calculating Metric Coverage is actually fairly simple. You take the number of conversions your test could possibly influence and divide by the total number of conversion across the entire site. Getting these numbers usually requires looking in Google Analytics or SQL and can be tricky for non-technical users. At Growth Book, we solve this by generating SQL and querying your database automatically given a few simple prompts (the experiment URLs, the user segments being tested, etc.).

Experiment Length

Experiment Length is an estimate of how long the experiment needs to run before reaching significance. In essence, you do a sample size calculation and then divide by the daily traffic each variation will receive.

There are many sample size calculators out there (I recommend this one — https://www.evanmiller.org/ab-testing/sample-size.html) and the statistics to implement your own are not too hard, so I won’t cover that here. I will note that the sample size calculation does require a bit of subjectivity — namely choosing a Minimum Detectable Effect (MDE). If you are making a tiny change that most people probably won’t notice, you are going to need a lower MDE to pick up changes. Conversely, if you are making a major change, a higher MDE will suffice.

Let’s say you do that and come back with a sample size of 2000 per variation. If your experiment would receive 500 visitors total per day (for the selected URLs and user segments), and you are doing a simple 2-way A/B test, that means it will take 8 days to finish (2000 / (500 / 2)). A 3-way test with the same traffic would take 12 days.

Because it’s best practice to run an experiment for at least a week, even if you have very high traffic, we set a minimum length of 7.

Metric Importance

Not all metrics are created equal. A “revenue” metric is more valuable than an “enter checkout” metric, which is more valuable than a “sign up for newsletter” metric.

This part of the equation is simply assigning a number to each metric from 0 to 1 on a linear scale. For example, “revenue” might get a 1, “enter checkout” might get a 0.7, and “sign up for newsletter” might get a 0.2.

Coming up with this scale can either be entirely subjective or backed by data science and modeling. Companies usually have a relatively small set of metrics that are fairly stable over time, so this scale can be established once at the organization level instead of doing it for each experiment.

Putting it all Together

Now we come to the actual Impact Score calculation (on a 0–100 scale):

metricCoverage * (7 / experimentLength) * metricImportance * 100

This optimizes for A/B tests that will finish quickly and have a big potential impact on the most important metrics. It does not try to guess how likely the test is to succeed (that’s why we’re testing in the first place).

The Impact Score removes subjectivity from the equation and lets PMs focus on what they are really good at during prioritization — planning around limited engineering/design resources, conflicting experiments, marketing promotions, and other external factors.

--

--

Jeremy Dorn
GrowthBook

Jeremy is the Co-Founder of Growth Book, an open source feature flagging and A/B testing platform.