Confidence Intervals: or “How I learned to stop worrying and love non-significant results”
In my last post, I looked at how the pressure to make a decision leads to some pretty wacky interpretations of non-significant experiment results. In this post, I’ll talk about how to use confidence intervals to de-risk these decisions without compromising your intellectual integrity.
Confidence intervals express a range of possible improvement values for your metrics. For metrics which haven’t reached significance, that range will be quite large and will include 0 (i.e. there’s a chance that the null hypothesis is true). The good news is that this range of values gives you an idea of the upper and lower bounds for the true improvement that you would see if your test were more powerful. On the Optimizely results page, the “true” improvement for a metric with a significance threshold of 90% will have a 90% chance of existing within the confidence interval.
This lets you say things like “Variation A’s conversion rate is likely not X% worse than the baseline conversion rate.” It might be enough to make a decision if your goal is simply not to hurt performance by making a change, and it sounds a lot better than calling it “a directional winner.”
I’ve encountered this situation myself. Take, for example, a test I ran on the Optimizely Experiment Overview page. The hypothesis: Displaying visitor counts for each experiment on this page will make it easier for users to find relevant data without having to click into the Results page for each test:
The idea is straightforward, validated by customer feedback, and just makes intuitive sense. The trouble is: how to make a data-driven decision about rolling it out? Some on the team thought that users who were exposed to the treatment would view fewer results pages, while others thought it might increase results page views (as users who wouldn’t otherwise view results became curious). And if some users increased consumption of the Results page while others decreased their consumption, how would we be able to tell from our test results which could be flat?
Ultimately, we decided that the only reason we wouldn’t want to make this change was if we saw a big drop in results page consumption across the board.
After running the experiment for over a month, it came time to analyze the results. As we had feared, our primary metric had not achieved significance, and we needed to make a decision about what to do next. By examining the confidence interval for the “Experiment Visitors” variation, we were able to establish a “lower bound” of improvement:
Even though this confidence interval is quite broad, it helped us understand the level of risk we were taking in making this change. In other words, the worst-case scenario was that making this change would decrease conversions to the results page by ~22%. Given the fact that making it easier to find relevant experiment results could reasonably decrease the number of irrelevant results pages viewed, this seemed like an acceptable tradeoff.
Ability to make a statistically rigorous decision with non-significant metrics? Check! Thanks confidence interval!