Notes to my younger self: AB testing

Chris Kenwright
3 min readDec 30, 2015

--

Lessons learned from running split tests in an eCommerce environment.

You don’t have the traffic

Open R, call power.prop.test(…) and look at the sample size required. That level of traffic will never be hit because you aren’t working at Google or Netflix or Amazon. Classical statistics are not as helpful as the literature would have you believe, the answers are not black and white.

The purpose of statistics is to be as un-wrong as possible and make informed decisions. Learn about credible ranges. Learn about risk. Split tests don’t really tell you what is best, it gives you a probability of being better. And a probability of being worse. Managing this risk, and its value, is what the business owners need.

Look at the business cycles and run tests over complete cycles, which is probably weeks.

Monitor traffic during the test. There will be bugs. You can estimate likelihood during the test and stop early if it is all going wrong.

Run at least one AA test. Just to be sure.

Success Criteria

Determining success is a compromise.

To make a decision: trigger success as soon as possible, on the form submit button, the next stage of the booking process, wherever the optimisation lies.

To make a difference: trigger success when a KPI is hit. That is probably when a sale is made. Anything else and you aren’t hitting the bottom line, but you are exposed to confounding variables.

Understand Segments

Over what segment are you going to run your test?

Business segments: which product types or markets? which devices? which users?

Rather than looking at this from a simple testing perspective consider:

Operational constraints: can take the findings in the tested traffic, roll them out to the business at large and keep a well maintained codebase. Nobody will be able to manage the complexity of running a different version of the code for every different product line.

Reporting constraints: how dow we report uplift from testing to the business? What is it worth over the course of the year and how will this be reconciled with budget and fincancing and forecasting? Web development costs money and is a seriously important reason to report uplift with KPIs.

Technical constraints: your split tests probably depend on cookies. They might depend on users. If you can’t get the data, clean it up and report on the conversion, you can’t run the test. You need the infrastructure to segment your traffic — if you can’t pick out users with 2 or more items in their shopping cart, you can’t test this segment.

Now, which segments are available to you?

Don’t test everything

Which is what people say. Sometimes decisions will be made for reasons other than performance. When a test result is unclear, fall back on strategic objectives, operational practicalities.

There are more ideas than time to test them and most tests won’t be worth running. Not everything can be valued.

What’s it worth? Well, you run a split test for that, but if the difference is too small, don’t run it.

Don’t over segment

As soon as there is a quantifiable difference, the qualitative questions will follow. Explaining non-segmenting is hard, especially with just a few segments. Be aware of the Bonferroni correction. Adjust for it, but don’t avoid segmentation.

That’s where the next ideas come from.

Tools

Building split test reporting is straight forward. Building the reporting application, less so. Optimizely spent a long time running shonky statistics

Understand benchmarks

One in ten split tests are not successfull. Google, Microsoft report this number.

When you run split tests, as a business you are competing on your ability to identify customer problems, develop better alternatives and execute them.

This drives conversion up. If you drive conversion up faster than your competitors, you can spend more on getting traffic. You win.

Summary

You came to split testing with a rigorous set of rules from dogmatic statisticians. You have learned that split testing is littered with practical contraints and that the goals are often unclear, there is an art in test design and execution.

--

--