2016 in experimentation
(This is my annual round-up of my tweets about A/B-testing. 2015’s review is here).
Elections are like experiments gone terribly wrong. You have different success metrics for each variation, participants constantly arguing between themselves on Facebook, Russian bots distorting the process, no randomisation, cohort effects all over the place and the winner is chosen based on the largest sample size. Either way you don’t get the actual results before several years after the decision is made. It’s basically how we made websites before A/B-testing came around.
To win elections, though, you apparently couldn’t only do A/B-testing in 2016 — Bernie Sanders had a nice run and it didn’t get him past the goal posts. This time the winner used A/B-testing on steroids, whatever that is. Perhaps a new incongruous category for Matthew Pennell’s otherwise useful D&D alignment guide to A/B-testing. Although it is probably just referring to machine learning.
Science isn’t broken, Christie Aschwanden at FiveThirtyEight claimed, while also giving some practical advice on making data show anything you want (cough…). At the same time perhaps explaining how 29 teams of academic analysts could find 29 different results in a blind test (original paper). The world of science, however, did as science does and self-corrected after the “replication crisis” of 2015. Gilbert et al reviewed the Novak “landmark study” of 100 psychology experiments and concluded that the pessimistic conclusions were quite unwarranted.
“Readers surely assumed that if a group of scientists did a hundred replications, then they must have used the same methods to study the same populations. In this case, that assumption would be quite wrong. Replications always vary from originals in minor ways of course, but if you read the reports carefully, as we did, you discover that many of the replication studies differed in truly astounding ways — ways that make it hard to understand how they could even be called replications.”
The American Statistical Association did release a useful clarification on how to interpret your p-values, but as Yoav Benjamini points out, it’s not the p-values that got us into this mess! In the same vein, Daniel Lakens did a nice review of the journal who banned p-values. Apparently that wasn’t the solution.
My favorite thing that happened in business experimentation in 2016 was by far Arjan Haring’s interview series with leading thinkers. Including the experimentation big kahuna Ronny Kohavi claiming we’re still in the alchemy days of experimentation (totally!), Prof. Jeffrey Pfeffer’s thoughts on leadership that is willing to experiment and learn and Nobel memorial prize winner Al Roth on progress being made from series of experiments, rather than individual experiments.
Arjan also wrote a nice piece on why your c-suite isn’t interested in experimentation. A popular topic, apparently, with ConversionXL also running a piece on how your organization don’t want your optimization to succeed. Depending on your organization, perhaps Jonas Downey’s insight that minimalism is not always the answer, or the reckoning that Mills Baker predicts is coming to design will help:
In order to avoid losing its place atop organizations, design must deliver results. Designers must also accept that if they don’t, they’re not actually designing well; in technology, at least, the subjective artistry of design is mirrored by the objective finality of [data]. A “great” design which produces bad outcomes — low engagement, little utility, few downloads, indifference on the part of the target market — should be regarded as a failure.
If you’re reading this article, though, your organization probably already has a culture like Ben Dressler describes at Spotify. And your design methodology is already like the design science that Jason Hreha describes. But if you’re not sure where you’re at, PRWD created a nice conversion optimization maturity audit (although I’m assuming the answer is that you need their services). And if you’re not, perhaps start with this piece by B.J.Fogg on a design process for persuasive technology. I also came across this older declaration that Microsoft want to make product development more science than art.
This is clearly not for everyone, though. To quote Booking.com’s Stuart Clarke-Frisby:
Execution-wise, there’s finally a good reply to the local maximum/minimum criticism that occasionally is raised against incrementalism. I now just say that I’m obviously not using a Nelder-Mead algorithm but a Conjugate Gradient and forward this beautifully visualized tutorial on numerical optimization. Evan Miller had an interesting review of sequential A/B-testing (i.e. stopping a test early). And Pinterest showed how they A/B-test their SEO-practices!
Peep Laja wrote a rant on the engaging topic of what makes a good conversion rate, then managed to answer it wrong. And the only one who got it right was Stuart Clarke-Frisby: “The easiest way to make your conversion rate go up is to stop acquiring new customers”. I promise I’m not a Stuart Clarke-Frisby fanboy, but he does have a knack for phrasing things well.
Just one more thing: Tristan Harris wrote an excellent review of the ways technology hijacks people’s minds through vulnerabilities and cognitive biases. I’m not suggesting to use it as inspiration for your next tests (although you probably will), but also to give some thought to the control we occasionally get to wield over the decision making of our users. You don’t want to find your experiments backfire on you four years down the road, because you didn’t take your constituency seriously!