Statistics for A/B testing and A/B testing mastery

This is the part 10/12 in my series reviewing the Conversion Optimization Minidegree, provided by CXL Institute.

Fernanda Leal
7 min readNov 11, 2021

Over the past week, I had the opportunity to delve into three other courses that make up the Conversion Optimization Minidegree: “Statistics for A/B testing” and “A/B testing mastery”.

Here is a brief compilation of my main lessons!

A/B testing mastery

A/B testing is one of the most powerful weapons a company can adopt to make assertive decisions. Despite this, very common mistakes can compromise your results:

  • Insufficient data to support decisions;
  • Misuse of statistical methods;
  • Lack of understanding as to the validity of the experiment;
  • Lack of understanding as to the real winner of the experiment;
  • A lot more.

To avoid these and other errors, the “A/B testing mastery” course, taught by Tom Wesseling, aims to clarify the planning, execution, and results of A/B tests.

Intro pillar:

When to use A/B testing: one of the first aspects discussed by Tom concerns the use of A/B testing. According to him, there are 3 main uses:

  1. To prevent a new functionality from negatively impacting KPIs relevant to your company;
  2. To research and understand which elements really impact your website’s conversion;
  3. To optimize the website focusing on the results obtained by the company.

Planning pillar:

There are a few things to consider before you start planning your A/B test. Here they go:

1. Amount of data:
According to Tom, if you have less than 1,000 conversions (transactions, leads, clicks, and so on) per month, A/B testing is not recommended.

Some tools make it easier to calculate these numbers, such as the AB Test Guide Calculator.

To avoid false positives and false negatives, Tom recommends that you use the high statistical power (>80%) and the high significance level (>90%).

2. KPI chosen:
Choosing the KPI is also part of the A/B test planning. Some metrics (like clicks, for example) are less effective than others, but it all depends on your goal.

  • Clicks: can be a good source of insights, especially when the site doesn’t have many transactions yet;
  • Behavior: tends to be a more effective metric than clicks;
  • Transactions: refers to the number of transactions, whether sales, leads, etc.
  • Revenue per user: this is an even more advanced metric than the number of conversions;
  • Potential lifetime value: the “golden goal metric”.

Another concept discussed by Tom is the “Overall Evaluation Criterion” (OEC). This concept reveals the importance of defining a goal metric for the entire institution, avoiding potential conflicts of interest, and aligning actions. The OEC should be a short metric that predicts long-term value.

3. Research:
Many companies and optimizers perform a lot of testing but get unimpressive results because they don’t define the reasons behind the experiments.

To make the research process easier, Tom suggests a framework called 6V Conversion Canvas:

Value:

  • What is your mission?
  • What is your strategy?
  • Short and long-term goals?

Versus:

  • Who are my competitors?
  • What is your audience overlap?
  • Which tools are they using?

View:

  • Where do visitors start their journey?
  • Is there any difference between new and existing customers?
  • Is there any difference per devices?
  • Where do visitors come from?
  • Do they already have a product in mind?
  • Do they already know the brand?

Pro tip: create behavioral segments. A typical e-commerce flow could be:

- All users on your website with enought time to take action;
- All users on your website with at least some interaction;
- All users on your website with heavy interaction;
- All users on your website with clear intent to buy;
- All users on your website that are willing to buy;
- All users on your website that succeed in buying;
- All users on your website that return with intent to buy more.

Voice:

  • Check out chat logs
  • Talk with the team
  • Look at social media feedback
  • Ask for feedback online
  • Interview customers
  • Recruit testers

Verified:

  • What do we know from scientific literature?

Validated:

  • Insights of previous tests and analyses

4. Hypothesis setting:
Properly raising and writing hypotheses helps align the team. The hypothesis includes:

  • The problem;
  • The solution proposed;
  • The expected result.

Based on the hypothesis, different teams can work together and aligned.

5. Prioritize your A/B tests:
Two well-known prioritization models are PIE (Potential x Importance x Ease) and ICE (Impact x Confidence x Effort).

For Tom, however, the A/B testing success formula is hypothesis x location x chance of impact.

To find out where to test, Tom proposes a framework called PIPE:

One of the great advantages of the model proposed by Tom is that it also takes into account practical aspects, such as the number of conversions needed, the duration of the test and the level of significance.

Execution pillar:

It is time to execute your A/B test.

Statistics for A/B Testing

According to Georgi Georgiev, author of “Statistical Methods in Online A/B Testing” and specialist in applied statistics, “data is a proxy for reality”.

However, making data-based decisions is no simple task. Often time and space correlated data have no causality to each other.

That’s exactly why A/B testing is so prominent in business decision-making.

In an A/B test, the only non-random difference between control and variant(s) remains the intervention. Therefore, this type of test allows:

  • Establish causal links;
  • Estimation (including uncertainty estimate);
  • Risk management.

Statiscal model:

A statistical model is a mathematical description and a set of assumptions that explain the chance of regularity of the data.

When translating business issues into statistical hypotheses under a specified statistical model, we can calculate probabilities for events. This allows us to estimate the uncertainty of each claim and therefore manage the upper limit of business risk.

Scheme presented by Georgi.

Statistical Significance & Other Estimates

According to Georgi, a discrepancy measure has some preferred characteristics. Are they:

  • It comes down to a single number;
  • Reflects data discrepancy from a statistical model;
  • Facilitates the comparison between different experiments (A/B tests).

To reach this measure, it is important to understand some concepts.

  1. Standard deviation: “a measurement of central tendency which quantifies the amount of dispersion in a given set of data”.

2. Z-score: “measurement of the difference between the mean of a distribution and a given observation expressed in number of standard deviations”. It is recommended for measure the estimated variance, the observed distance from the model, and the sample size.

3. P-value: “is the probability, under the specified statistical model for the null hypothesis, of observing a statistic as extreme or more extreme than the observed”.

Once you understand the above concepts, it is possible to understand what statistical significance is. Basically, “a test outcome is statistically significant if the resulting p-value meets an evidential threshold called significance threshold (𝛂). If 𝛂 = 0.05 and p = 0.04, then p < 𝛂 therefore the result is statistically significant at level 0.05. It is also significant for any 𝛂 > 0.04”.

Observing a low p-value can mean 3 different outcomes:

  • The null hypothesis is not true;
  • The null hypothesis is true, but a rare outcome the results;
  • The statistical model is inadequate.

--

--