A non-scary guide to AB Testing

7 min readSep 20, 2017

New process, New fear

If I remember correctly, it was about four years ago, when I was first told our product team would be introducing AB split testing into our analytics process. I still remember the wave of anxiety that overcame me, the math, the process and the data were all insurmountable feats of comprehension that were better left to analytics gurus and math majors. Fortunately, after being exposed to a few different formats of AB and Multi-variable testing over the past few years, I realized that I really didn’t have anything to worry about — and with a few helpful tips, I could have been well on my way to getting clear direction on the performance of the features we were building.

With some quick guidelines and best practices, I think anyone with a little time and interest can grasp AB split testing concepts and apply them within their own teams. Hopefully after reading you’ll be able to do the following:

Understand the basic components of an AB Test.
Use Excel to validate your Test results.
Know some core guidelines and common misnomers .

Foundations

Most of us have heard of AB testing before, but I will provide a review on fundamentals before we move onto its application.

AB testing, when applied to the web, is designed to measure the impact of a new feature to the current web experience. A new feature can be a new button style or new layout for displaying products on your e-commerce site. The methodology will allocate a percentage of your Default Production traffic over to two independent experiences for comparison. A Control experience that will be identical to your Default production experience and a new alternative experience (Sample A). Having a Control experience independent from your Default is critical to maintaining clean data for comparison. As Control and Default are the same, their performance metrics should look the similar (within +/-4%). Any drastic metrics in performance will be a key indicator that something is awry. It is equally critical to select a performance metric before the experiment begins (e.g. click conversion).

Figure 1 below, will illustrate a common AB split test setup. After our Control and Sample A experiences have been established and performance metric targeted, we can successfully run our test and observe which experience has the desired effect on Clicks (our chosen metric).

Experiment Validation

In order to ensure that we have found a valid outcome, we need to apply some calculations to our experiment results. This step is critical to any AB experiment conclusion. Without it we could be promoting features to production that could end up negatively impacting our current experiences. It would be both very easy and very wrong to declare a winner after comparing metric data and choosing the best performer. Without running a statistical significance, you would not be able to determine whether the difference observed between both samples was due to the merits (or lack thereof) of each experience or due to random chance. An easy way to avoid this issue is running a significance test when you have a good amount of stable data (we’ll get into what stable means later).

For our AB testing format, we will use the standard Two-tailed T-Test. This calculation will allow us to measure the results of our button click conversions for our Control and Sample A experiences and assess whether our metric analysis is valid.

The T-Test will weigh our data and generate 2 important values. The critical value and the T-stat. The critical value is a +/-range that, when layered on a Standard Normal Distribution curve, creates a zone of valid results from the Critical value to the end of the curve (colored green in Figure 2.). If the T-stat value falls within that zone, the experiment is valid and you can be assured that any measured differences in metrics are due to the merits or faults of the experience and not random chance. If the T-stat value falls outside of that zone, the experiment is invalid and any impacts you observe are not attributed to the experiences, but just random outcomes.

Fortunately, this T-Test analysis can be done within Excel fairly quickly by following the steps below:

Step 1

Make sure you have the Excel Data Analysis Plug installed. It’s free and is stored within your Excel Software. If you do not see the Data Analysis button under the Data tab, you need to install it.

[ Click the File tab, click Options, and then click the Add-Ins category. In the Manage box, select Excel Add-ins and then click Go. In the Add-Ins available box, select the Analysis ToolPak check box, and then click OK ]

Step 2

Now that you have to the tool, select the Data Analysis button.

Step 3

From the drop down select “t-test: Two-Sample Assuming Equal variances”.

Step 4

Export your data from your analytics tool and drop it into a worksheet. Highlight your data and select “okay”.

Step 5

Your analysis will output the below data where you’ll find both the T-stat and the “t Critical two-tail” (bolded) values that will allow you to verify if your T-stat is within your Critical value range and conclude whether the test significant or inconclusive.

Traffic Allocation

Another important piece of your testing setup is you traffic allocation. With any test, you would never want to expose a significant amount of your users (or their respective revenue) to an unproven experience.

A standard breakdown is; 80% of total traffic to your current Default web experience, 10% to your Control experience, and 10% to your Sample A experience. This allows for a significant flow to still keep your revenue stream intact in Default and also give you vital feedback data to Test Samples. Depending on your daily traffic flows, you might need to increase some of the traffic to your Test Samples and push to 15% for both Control and Sample A. Anything above 15% per Test Sample is a bit too aggressive from my experience. Some web experiences with large traffic flows (in the 10K+ of unique visitors) can reduce the allocation of the Test samples and reach stability of results in quicker time.

Stability

Stability of results can be observed by monitoring the difference between your performance metrics on Default and Control. If we are using the example of Click conversion, since they are the same experience, they should receive the same amount of clicks per view. Anything within or below 4% of variance between the Default and Control shows good stability. Usually Default and Control start off drastically different, but stabilize over time to parity. I would consider about 2 weeks of data inside 4% of variance as stable.

Duration

I usually set Experiment completion dates for roughly 4–6 weeks. Within that time I could usually see a stable Default and Control relationship and would have seen a good amount of traffic flow through Control and Sample A experience. Traffic volume largely varies site to site, but with 4–6 weeks of total data and 2 weeks of stable Default/Control, you could probably run a T-test and see some conclusive data.

Potential Issues

You are now armed with the ability to compare two sample experiences and conclude whether there is a meaningful difference between the two. With this ability comes newfound responsibility to ensure you are testing the right candidates, analyzing the appropriate metrics and using some common sense. Here are some small guidelines I picked up along the way:

Pick a target

A sure-fire way to have tests run awry is to not select a target metric. If you decide to run a test of two experiences simply to “see what happens,” something will probably happen and odds are it probably had nothing to do with your experiences.

Start Small

It is very easy to get carried away building really cool Sample experiences. I found it best to start with lightest effort changes. Finding the order of magnitude for impact on a couple of small changes (e.g. button size, layout shifts, copy) can really give you insight into where the value is on your site. That information could help guide bigger changes and save time and experiments.

Quick experiments

Ending experiments too quickly is another sure-fire way to tamper and disrupt successful AB testing. Often testers will see positive metrics from their candidates and prematurely declare a winner. Candidate metrics can easily fluctuate within the first few weeks of testing. The easiest way to avoid volatility is to stick to a predetermined test duration. Even if early tests are stable and valid, I found more consistent result when allowing metrics to stabilize for a few weeks(~4–6weeks in total). Prematurely pushing winners to 100% traffic can often result in costly mistakes.

The brilliant Julie Zhou also wrote on the woes of data decision making and listed out some common issues to avoid. Check it out here.

I’m always looking to fine tune this process. Feel free to send responses or contact me directly (seanpardo@gmail.com).

A non-scary guide to AB Testing

Written by Sean Pardo