
Practical A | B Testing for Practically Any Marketer
If you have access to data and spreadsheet tool, you can conduct powerful A/B tests.
“But wait! Why would I do that when there are countless numbers of applications that will A/B test everything for me?”
Yes, there are plenty of applications that will conduct A/B testing for you. Truthfully, they make it quite easy. However, as a marketer, I don’t feel comfortable in just accepting a result. I want to understand the result, why it works that way, and why the result could be flawed.
This is to trying to dissuade you from not using A/B testing software like Optimizely or VWO (Visual Website Optimizer). I’m merely suggesting you should know the magic behind the numbers.
Instead of throwing a bunch of statistical jargon, formulas, and theories at you, I’m simply going to go through a practical A/B testing example, work it out in Excel, and interpret the results. I have a Master’s in Economics and a bunch of coursework in applied statistics, and I still find the way most stats related stuff is taught incredibly boring and stuffy.
Hopefully, I can change that.
I’m using Excel as a spreadsheet tool. However, other apps like Google Spreadsheets will work.
NOTE: In these examples, I will be referencing the FREE Excel add -in, “The Data Analysis Toolpak.” Microsoft provides some easy-peasy installation instructions for this amazing add-in.
SCENARIO 1
An email marketer is interested in knowing if a landing page call-to-action button variation is better than the existing landing page CTA variation.
Sampling Methods: The marketer randomly selects 200 people from her subscriber list. She then randomly splits the group into two groups of 100.
Length of Time for A/B Test: The marketer decides the correct time frame is 30 days.
Key Experiment Facts: The marketer will send the emails to the two groups over the course of 30 days and record and export all of the clicks on the CTA into a spreadsheet for testing.

This is the click results of the experiment.
We are ready to analyze some data.
Step One: Provide some basic summary statistics. Specifically, we are interested in the average number of clicks for each group, also known as the mean.
This is the Excel formula:
=average(array)

This is my average formula for groups Clicks A. Do this for Clicks B as well.
Next, obtain the variance.
What’s the variance? If you were to visualize the data, you would obtain a distribution of the data.

The variance is the spread of the data, or how distant each data point is from each other. You could also calculate the standard deviation, but the t-test uses variance, so it’s useful to obtain that metric.
This is the formula for variance in Excel, and how it looks in my spreadsheet for group Clicks A.
=var(array)

Now, we could easily obtain correlations, skew of the dataset, and much more. But the key with statistics is to not overwhelm yourself with data just for the sake of data. In this case, we have a t-test that will tell us all we need to practically know.
This is how my spreadsheet looks at this step. Your numbers should match.

Because the t-test looks for any meaningful differences in means, I’ll usually calculate the average differences to put the results into better context.
Step Two: Conduct a T-test
The next step is to statistically test the differences in group Clicks A and group Clicks B.
This is where A/B testing gets murky.
Many marketers simply want to pick a “winner,” and that winner is usually chosen by which variation gets the most clicks.
Statistical testing provides more rigor by providing us with information on how confident we can be in our results that one variation outperforms the other. Note, there are never winners in statistics — only cases for evidence in favor of something under certain conditions.
In order to understand how to provide more rigorous testing, we will go through the steps require to conduct a t-test.
1. Set a Confidence Level
The confidence level, or alpha, or a is a measure of how confident we can be in the probability that the result of our t-test happens by random chance alone (or, is potentially influenced by other factors not present in the design). The normal a levels are .05 or .01. These correspond to 95% and 99% confidence intervals.
2. What’s a Confidence Interval
The confidence interval is a range of numbers (- and +, respectively) that the mean of the data falls into. The tighter and more narrow the confidence interval, the better for picking a “winner.” The wider the interval, the more concern that a statistically significant result may not necessarily be able to pick a winner.
3. Hypothesis Test
The default hypothesis test for comparing means within the t-test context is:
Null Hypothesis: There is no difference between the means of the groups (0 difference)
Alternative Hypothesis: There is a difference between the means of the groups.
Though you can select certain values, I highly, highly suggest sticking with the default and most commonly used hypothesis test.
Rejecting the null hypothesis means there is a statistically significant difference between the means (something greater than 0). Failing to reject the null means the difference between the means is essentially zero.
4. Statistical Interpretation of Results
Once the hypothesis test is set, you’ll need a way of interpreting the results of your test.
There are two commonly used methods that give the same result.
The first method involves the t-statistic. If the t-statistic is greater than the t-value for your confidence interval, you reject the null hypothesis and conclude the mean between groups is statistically significant. The t-value for the 95% confidence interval is 1.96. For the 99% confidence interval, the t-value is 2.576.
The second method uses the p-value.
The p-value is the probability that the means of the groups is not significant. If the p-value is less than the chosen a, reject the null hypothesis. If it’s equal to a, the common rule is to reject the null hypothesis, but be very wary of any practical application to marketing efforts.
5. Choose the Appropriate T-test
The type of t-test you choose depends on the quality and distribution of the data, along with what kind of question you want to answer.
The independent samples t-test requires there be no overlapping subjects.
Practically, this means that those who received variation A did not receive variation B, and vice-versa.
The paired samples t-test is useful when you are testing the mean difference between the same group. This is not used in the common A/B test, but is used extensively in population samples (where all of your subscribers would be tested as oppose to just samples).
We will be choosing independent samples since there are no overlapping subjects.
You will also have to choose between a one-tail or two-tail test. This isn’t the easiest thing to explain. The tail involves the “extremes,” or end tails of a data distribution.
A two-tailed is useful when the difference of the means could go in either direction from the mean whereas a one-tailed test means the difference can only go in one direction.
In A/B testing, the direction can go both ways, so choose a two-tailed test. Practically, we always choose a two-tailed test.
Finally, we must select whether or not our data display equal or unequal variances. This doesn’t have a drastic result on the final outcome of the test. However, because we can easily compute the different variances, you can easily decide whether equal or unequal is the best choice for the test.
We end with, for this example, an independent samples two-tailed t-test with unequal variances.
6. Conduct the T-test
Finally!
First, I will do it in the spreadsheet using the following formula:
=ttest(array1, array2, tails, type).
In my spreadsheet, this is:
=ttest(b2:b31, c2:c31, 2, 3)
where 2 represents two tails and 3 represents unequal variances.

Next, I’ll conduct the test using the data analysis toolpak.

First, click on the “Data” tab.

Next, click on “Data Analysis.”

Now, scroll down to t-test: Two-Sample Assuming Unequal Variances.

Next, fill in the box like I have here. Alpha 0.05 refers to a 95% confidence interval.
7. Analyze Results
Results from formula
The two-tailed P(t) is .1451. Because this is greater than our alpha of 5 (Excel will pick an alpha of 5 by default using the formula array method), we cannot reject the null hypothesis.
.1451 > 0.05
That is, the probability that the difference in means is due to chance is greater than the threshold of 5% that we allowed. There is a probability of .1451 (15%) that the difference in means is due to chance alone. 15% is greater than 5%, and we fail to reject the null.
Results from data analysis toolpak

Here, we can see that the t-stat of 1.4782 is not greater than 1.96. Therefore, we fail to reject the null hypothesis and conclude there is no meaningful difference greater than zero between the groups.
The P(T<=t) two-tail is 0.1451, the same as we obtained from the t-test formula method.
Step 3: Discuss the Results
The most important aspect of data analysis is having the ability to discuss the results with fellow marketers and non-technical audiences. That is, you have to be able to put these into real world terms.
While the CTA in group A received, on average, 91 more clicks than group B, a statistically rigorous process like the t-test shows that the results were not statistically significant. Essentially, the difference between the groups was a statistical zero.
Therefore, claiming that A “wins” over B, or B is somehow inferior to A is not proven with any justifiable evidence.
Step 4: What’s Next?
You can create another variation and run the same test using the same methods and see if the outcome is different.
However, a potentially interesting experiment would be to sample two groups and give them the same exact CTA. If the results showed no statistical significance between groups, you would conclude that sampling method worked.
But in some cases, one group performs better with the same CTA.
How does this happen?
This is the concept of randomness and chance at work. Both groups can receive the same CTA, and depending on multiple other factors, could have a statistically significant difference in clicks over a month long period.
This something marketers don’t consider, as the premise of an A/B test implies that one variation must be better than the other.
Try conducting an A/A test and see what happens.
Bonus! Step 5: Calculating the Confidence Interval
It’s not necessary to know the data range of the confidence interval in order to conduct a t-test. However, it certainly helps to know the data range if you want to embrace your inner statistician.
The confidence interval is the lower and upper bound limits in which the true value of the population exists. Practically, this is the range of values that include the average of the group. The wider the interval, the more unsure we are about the average clicks the CTA would receive if we were to sample our entire subscriber base.
In order to obtain the confidence interval, we need to obtain a statistic we haven’t talked about yet: the standard error of the mean and the sample standard deviation.
Using the data analysis toolpak, Excel will provide a sizeable output of descriptive statistics.


And this is the result you obtain:

Since we only sampled part of a (assumably) larger subscriber list, we actually don’t know the standard deviation of the population. In these cases, we use the standard error to calculate a confidence interval.
You can obtain confidence levels in Excel very easy using the confidence formula. Give it a go:
Conf. Interval: =confidence(alpha, standard deviation, n)
Conf. Interval = confidence(0.05, 270.09, 99) = 53.20
Note: because we are dealing with sample data and not population data, we subtract one from the sample size, giving us 99.
Now, for the lower bound confidence interval, it is simply the mean of group Click A minus 53.20. For the upper bound, you add 53.20 to the mean of group Click A.
This results in a confidence interval of (348, 456).
402 certainly falls within that confidence interval. However, the number 53.20 represents a margin of error which corresponds to what we see earlier in the variance of our clicks. That means, during the experiment, using variation A, clicks can fluctuate + or — 53.
In Closing
I hope you have a better grasp on the concept of A/B testing.
It’s not about picking the winner on a frequency basis, but rather finding a way to test how variations performed under certain statistical testing.
Frequency is certainly one method by which to accomplish this. However, unless the number is much larger between variation clicks, a test like the t-test opens the results up for more in depth discussion.
Consumer choice doesn’t happen in a vacuum. There are always multiple other variables at play that can influence something like CTA clicks. Understanding the how and why of A/B testing expands your business strategy by taking these factors into account and searching for additional data that can better guide your decision making.