Zoltar, Tom Hanks, Big (1998)

These are the ONLY things your testing tool should tell you.

I want to make testing easier for a wider audience — it’s a team sport.

I want more people to play and the stats stuff has to get a lot less nerdy, ok?

I don’t want to dumb anything down — I just want to find a better way to abstract the complexity of the underlying rules you choose, or thresholds you set — in order to give people DECISION SUPPORT.

People need to know what to do at the end of a test — and it’s a lot easier to get people playing large scale testing, if you simplify the often brain numbing meetings that pass for test signoff.

Craig Sullivan is VERY interested in the stats, the details, the cohorts, the personas, the segments, the way the test was run. The CMO is not remotely interested in this stuff.

If I put a new line of clothing in M&S — I may get a window of time to run my new item, before a decision is made. If a new yoghurt goes on the shelf at your supermarket, sure enough it will have some way of making a decision on what to do after a trial period.

The parameters might be mind numbing — you could have piloted this at thousands of stores. The price analysis alone could be several spreadsheets worth. But when you stand in front of the CEO and he asks you “That new line. We rolling it out nationwide or not?”.

That’s it — it’s decision time. And in retail, manufacturing and many other sectors, it’s the decision and execution of that fork in the road that will count, not running an ersatz stats course for bored senior executives.

So — to the meat of this article. And this disclaimer is for the people who might otherwise say ‘You’re glossing over some important things”.

I am — but I want to abstract the complexity of this work for human beings and that’s important enough to warrant the effort. I also am not prescribing here what statistical thresholds, sample sizes, running times — or other potential biases that can arse your test up — you should use. That’s a deeper level, of course.

So, assume that you’ve run a good test estimating tool — that accounts for your expectation of lift — and has given you a prediction of how long you’re going to need to run it for (to get a sample capable of detecting that magnitude of shift).

You run your test for this period. You stop the test. You run the analysis.*

At this stage, I want the tool to tell the Marketing, Product, Engineering, Board, Growth or HR departments what to DECIDE. I’m not interested in showing them some graphs.

If you have graphs on your AB test result for stakeholders, ask yourself this question:

“What decision are you expecting people to make based on these graphs?”

If the answer was “Nothing” then delete them. It’s interesting data, but to the audience that needs to make a business decision, it’s noise.

What they need to be told is one of these (x) answers. I am opening this article to challenge, improvement, tweaks or total ridicule — in order to iterate this idea into a set of rules which will abstract the job of calling tests.

Zoltar is a fortune telling machine in the movie Big, with Tom Hanks — and is one of the touches I recall most. I want Zoltar to spit out one of these printed cards for the stakeholder:

(1) Winner

“Your test has finished and has won — and the statistics tell us we can be confident.”

In a bayesian stack — either a threshold is applied to declare a winner or more likely, a statement like “There’s an 80% chance this will make you 5M a year. There’s a 20% chance you’ll lose 100K a year.”

(2) Not enough data

“The sample size is smaller at this time than we expected — and needs (estimate) more time to reach a conclusion.”

(3) Loser

“The original version has won — and the statistics tell us we can be confident.”

(4) Inconclusive

“We can’t be confident that the original or the winner are any better or worse than each other.”

(5) Warning

“Your test variation is losing [value] per [time period]. By the end of the test period this loss will be [value]. Continue y/n?”

So — I’m looking forward to seeing whether we can argue, shape, tweak or otherwise pummel these into some high level simplification of the test decisions we must all make.

At least we can then have mind numbingly detailed and argumentative meetings about how Zoltar actually spits out the cards, which drives the logic to support the business. If we decide to change this method (for example, running a Bayesian ‘stack’) then we can change our under the hood method without changing the flag we give to the business.

We need to abstract this complexity, to support the business — to help them make the decisions they need to grow, with a proper understanding of the risks and consequences.

I also think Zoltar would be a great name for an internal (or commercial) AB testing tool ;-o

C.