Reasonable-sounding but fallacious statements heard at Experiment Review

My favorite meeting at Optimizely is our weekly Experiment Review. It’s a place where people come together to hone product experiment ideas and share the results of past experiments. It’s a great place to give and receive feedback, and I look forward to it every week.

But I, like many experimenters, have been in the unenviable position of presenting an experiment where none of the metrics reached statistical significance. In that moment, the pressure to glean something of value from the experiment can be extremely high. To get to this point, your idea had to beat out countless other awesome ideas to rise to the top of the backlog. An engineer spent valuable cycles coding it up. It ran for weeks. And now, the team is looking to you to make a data-driven decision about the direction of the product. 😬

The dreaded “sea of grey” strikes again!

Watching the mental gymnastics experimenters go through in this situation is one of the true joys of Experiment Review. Here are some of the reasonable-sounding but ultimately fallacious interpretations of non-statically significant results I’ve heard:

  • Directionally speaking, Variation A is outperforming the control”
  • “Variation A is trending in the direction of a win
  • “Variation A has the highest significance of all variations, so that’s a good sign”
  • If you held a gun to my head, I guess I’d go with Variation A”

We all know that the intellectually honest thing to do would be to focus our energy on designing the next iteration of the test which would be more likely to reach significance. When working with p-values, stats should be black and white: either the results show a statistically significant effect or they don’t. Or to give it a more poetic phrasing:

(source)

When I find myself talking about “directional results,” I take comfort in the fact that I’m not alone. In fact, Probable Error (Matthew Hankins’ humorous stats blog) compiled a list of creative language for “non-significant results” found in peer-reviewed academic journals. Some of my favorites include:

  • “A nonsignificant trend toward significance”
  • “Teetering on the brink of significance”
  • “Not significant in the narrow sense of the word”
  • “Approaches but fails to achieve a customary level of statistical significance” 🤔

So if even career academics are prone to this sort of flawed logic, what should we mere experimentation mortals do when faced with non-significant results? Check out my next post on the power of confidence intervals for tips on gleaning actionable insights from non-significant metrics!