My Second Lesson in A/B Testing

6 min readJun 13, 2016

Last week I had to decide whether we should ship a particular feature or kill it. This was the second such call I’ve made in my time at Yammer so far, but the first for a feature that I was PMing from concept to experiment.

In chatting casually with my skiplevel boss about my decision, I learned a couple of new things about experiment analysis.

But before I dig into that, let’s level the field for those unfamiliar and explain product experiments a bit.

1. A brief primer on data-informed development.

There are a number of ways companies use data to help them make decisions. An incomplete list includes:

To validate resource allocation: How many people use that dialog box—is it really worth the design investment to make it better?
To understand where their customers find value: Where in the product are most of our upgrades coming from? If it’s from Upgrade Prompt C, perhaps we should invest more in that area of the product.
To evaluate the cost of cutting a feature: How many users engage with this feature? What engagement would we lose if we cut it?
To determine if they should even ship a feature that’s been built: If we give this feature to some percentage of users, do they perform better or worse than the users we hold this back from?

That last type of data usage is called A/B testing, and the one where we Yammer PMs at the individual contributor level spend most of our time.

2. A/B Testing, what is it anyway?

In traditional A/B testing all your users are randomly assigned to one of two groups:

Control (the current experience of your product)
Treatment (the version of your product that you want to test)

As a PM, you want to see if the behavior of the Treatment group is better or worse* than the behavior of the Control group, and use that information to make an informed call about launching this feature to 100% of users or turning off the experiment and removing that code from your product.

That’s the Ship/Kill call.

One of the responsibilities of a PM is to specify in advance the particular metrics you’re going to evaluate between the Control and Treatment.

Pick and prioritize! Construct the story you expect to see and the metrics changes that would prove that story true. Then construct a few other likely scenarios (positive AND negative) and identify the metrics changes that would support those theories.

Possible Outcome 1: Users learn more about the value of small group contexts. Metrics to watch… Generic posts would go down, Group posts would go up.
Possible Outcome 2: This feature could encourage users to consume more casually. Metrics to watch… Group Visits would go up, but Group posts would go down.

3. Understanding Changes in Metrics

But what does it mean for a metric to “go up” or “go down”? This may sounds like a silly question, we all know what directionality looks like, right?We have really solid mental models for it in the real world: up means more, down means less.

But nothing can “go” in a particular direction without a reference point from whence it was coming. I find that a lot of people entering product roles fundamentally misunderstand the correct reference point in A/B testing.

The most natural reference point people land upon is Time. That’s super reasonable in a lot of data contexts:

“Sales are up this quarter” (…relative to where they were last quarter)
“The price of gas is down this month” (…relative to last month)
“Way more snow this year!” (… relative to last winter)
Etc., etc., etc.

But when it comes to product testing, Time as a reference point is very, very wrong. Why? Because there are so many contributing factors to a person’s behavior with a product that may have nothing to do with your experiment:

“Hmm, site visits are down today compared to last Monday.” Guess what: it’s a holiday in Canada.
“Wow, signups are WAY up this month!” Guess what: that marketing post from last year somehow made a big splash on a German forum 3 weeks ago.
“Oh no, too few purchases this week.” Guess what: the Chinese stock market took a big hit and people are holding tighter to their wallets this week.
Etc., etc., etc.

How do you account for all of this?? Well, that’s the whole reason for A/B testing. If you set a random 50% of your users to see the Control and the other random 50% to see the Treatment version, you can measure ups and downs in one group relative to the baseline of the other.

So, if we’re expecting “group posts to go up”, we mean that we expect Treatment users to do more of that action than Control users did while the experiment was running.

To get into terminology, we look for the Lift, the increase (or decrease) in a metric for the Treatment group relative to Control.

4. One last thing… p-values.

😬😬😬 I knowwww… am I really going to get into statistics here?! No, don’t worry—you can look that up on your own 😉

I won’t get into how they work (honestly, I’m still trying to fully understand that), but I do what to explain what they mean:

If we see a 4% lift on Group Posts with a p-value of 0.01 that means we’re pretty sure that it’s not just chance. But if we saw that same lift of 4% with a p-value of 0.45 we would consider that change to likely be just chance.

Two things you need to understand about p-values:

P-values are probabilities measured from 1.00 to 0.00; the closer to 0.00 the more likely it’s NOT chance. That means as a PM you’d prefer p-values closer to 0.00 so you can have more faith in your data.
Usually, larger Lifts will have lower p-values. And that makes a lot of sense: p-values are just the probability that it’s a chance effect, and the larger the change, the easier it is to observe. If you saw a bunny twitch its whiskers, you may not be 100% sure you saw some movement, but if you saw a bunny jump 5 feet in the air, you’d be darn certain it moved.

So… what was that new thing I learned? Wasn’t there a point to this post?

Yes! Thanks for reminding me 😊 My Ship/Kill call!

In this particular experiment, we actually ran two different Treatments… an A/B/C test, where 34% of users got the Control experience, 33% got Treatment #1, and 33% got Treatment #2.

In analyzing the results, I found that the two treatments moved together in most of the Success Metrics. That is, on any given metric, when one treatment did better than control, the other one did as well.

But a couple times, I saw them move in opposite directions: Treatment #1 did better than control here, but Treatment #2 did worse.

Now, at the time I didn’t include those observations in my analysis report because both lifts (the positive lift for Treatment #1 and the negative lift for Treatment #2) had really high p-values (around 0.35). When I talked about these, I called them both “FLAT”, which is the common way to refer to metrics that don’t show any change between Treatment and Control.

Since the p-values indicated that these lifts were highly likely to be random happenstance, I knew I couldn’t use them to support any theories, so I just called them FLAT to indicate their meaninglessness.

However, in speaking with our Director of Product, I learned two new things about A/B testing and experiment results:

FLAT is the wrong way to describe these metrics. They moved, they changed, they didn’t stay flat. Sure, we can’t derive meaning from those changes, but calling them FLAT conveys another (inaccurate) idea. The better way to describe the metric change would be: “A positive lift (or negative lift) that’s insignificant.”
Much more importantly, since larger lifts tend to have lower p-values, I had a great opportunity here for further analysis. Remember how Treatment #1 went up and Treatment #2 went down? Well, that’s relative to Control. The Dir. of Product pointed out that I could re-run the analysis to look at the difference between the two treatments relative to each other. Since we saw a bigger change relative to each other, I had the opportunity to better assess the probability of those lifts being real, and not the product of random chance.

Mind blown. Of course! How did I not think of that? Well, you live and learn, right? And hopefully you find a way to share what you learn.

Tune in next time for more on what I’m learning.
Till then,
Anna Marie

Like what you read? Give it a 💚 below to help others discover it and ➕Follow me so you don’t miss future pieces 👍💯👌