A/B Testing Foundations, How to Run Tests, and Testing Strategies — Review

This is the part 9/12 in my series reviewing the Conversion Optimization Minidegree, provided by CXL Institute.

Fernanda Leal
9 min readOct 29, 2021

Over the past week, I had the opportunity to delve into three other courses that make up the Conversion Optimization Minidegree: “A/B Testing Foundations”, “How to Run Tests”, and “Testing Strategies”.

Here is a brief compilation of my main lessons!

A/B Testing Foundations

According to Peep Laja, many people who take A/B tests get merely imaginary victories: in fact, the results they get are not statistically relevant.

Therefore, the aim of the “A/B Testing Foundations” course is to clarify important concepts and reveal how to conduct statistically valid experiments.

The basics:

Initially, Peep warns us that A/B test and optimization are not the same thing. While optimization is “compound interest for growth”, A/B testing is a way to validate business impact through measurement and learn.

It is totally possible to do conversion optimization without testing, but the opposite of testing is “intuition”, what can be a problem for an organization.

Another important point about A/B testing is that it doesn’t tell you the reason why one option worked better than other.

What to test:

Another point addressed by Peep is what to test. All A/B testing involves a cost, so it is essential to know how to prioritize what really deserves to be tested.

“If you test useless things, you will get useless results”

To prioritize what should be tested, Peep recommends that you look at A/B testing as a way to solve problems and ask yourself questions:

  • Where are the problems: use Digital Analytics tools to analyze your funnel and understand which pages on your site can have the greatest impact. Hardly an A/B test performed on an “About us” page with little traffic will yield large incremental gains, for example.
  • What is the problem and what is the root cause: once you’ve identified where the problems are, you should start investigating what the problems are and what causes them. For this, you must use qualitative research, surveys and interviews.
  • What is a good hypothesis to solve the problem: after investigating the problem and its cause, use the team’s power to come up with hypotheses and think of possible solutions to the problem..
  • What we should test first: a problem can be solved in very different ways, so it is important to prioritize the solutions brought according to the potential impact and the effort/cost involved.

Test prioritization:

The amount of testing you can conduct will always be limited by the amount of traffic to your site. Therefore, it is even more relevant to know how to prioritize your hypotheses and understand what really deserves to be tested.

“Even if you know what the problem is, you don’t know what the solution is”.

There are several frameworks that can help you decide which hypotheses should be tested first:

  1. Potencial, Importance and Ease (PIE) Framework: this framework considers the probability of test success (P), the impact of its potential result (I) and the ease of implementation (E). Each criterion receives a score from 0 to 10 and the hypotheses are prioritized according to the average of the sum of the points. The problem with this model is that the marks assigned to each element are very subjective.
  2. Impact, Cost and Effort (ICE) Framework: very similar to the framework above. The main difference is in the scale. In it, the grades awarded range from 0 to 4 points.
  3. PXL Framework: developed by Peep Laja, this framework aims to give more objectivity to the models mentioned above. Grades are assigned according to answers to objective questions such as “yes” or “no”.

A/B testing statistics:

Not every A/B test is conclusive. To ensure that your results really have statistical validity, you need to take some precautions.

1. Do we have enough sample size to run this test?
According to Peep Laja, the first thing you need to know is whether the page receives enough traffic and conversions to validate your test. For this, it is necessary to make a mathematical calculation.

The CXL Institute provides an AB+ Test Calculator.

It is important to note that the smaller the number of test weeks, the greater the minimal detectible effect. In other words: in the image above, the test can only be considered conclusive if one of the versions presents a performance 65.76% higher.

Important note: Peep recommends that you keep your tests for at least 4 weeks.

2. When is the test done?
The second question you need to answer is when can you end the test and consider it conclusive.

While some sites report information such as “after 100 conversions per variation” or “when you reach 95% statistical relevance,” there is no magic number to determine if the test was conclusive.

In general, these are the rules for stopping a test:

  • Adequate sample size;
  • 2–4 week test duration;
  • 95% (or more) statistical significance;
  • 80% (or more) statistical power.

Pro-tip: Peep recommends that the tests last no longer than four weeks to avoid polluting the sample. By increasing the time, the chances of user overlap also increase as the visitor may return using a different device or clear browser cookies after a certain period of time.

Bandit test:

No teste A/B, as amostras da versão A e da versão B devem ser iguais. Diferenças no tamanho da amostra podem, inclusive, invalidar o teste.

Bandit testing (or multi-armed bandits) is a testing methodology that uses algorithms to optimise for your conversion goal during the test.

Ou seja: enquanto o objetivo do teste A/B é validar uma hipótese, o bandit teste quer otimizar os ganhos, sendo uma excelente opção para momentos pontuais (uma campanha de Black Friday, por exemplo).

What to measure in a test:

Technically, an A/B test allows you to measure the gains/losses of any metric. Despite this, Peep recommends that you don’t waste time testing micro-conversions and focus on purchases, cart additions, leads, and other aspects that potentially impact revenue.

Another recommendation given by Peep is that you avoid doing simultaneous tests — even in different parts of the site.

Testing strategies:

1. What kind of test to run first?
In the last module of the course, Peep comments on his strategies. According to the instructor, the ideal is that you start testing by “low-hanging fruits”: problems whose solutions are more obvious.

An important point is that innovative testing involves big risks and can result in big losses.

2. One change per test vs. many changes
Many people claim that you should only test one change at a time. For Peep, however, this is nonsense. Since most sites have traffic limitations, making only small changes can generate incremental gains that are difficult to measure. Test one hypothesis at a time, but don’t limit yourself to the number of changes.

3. Innovative testing x iterative testing:
Most of the time, you will do iterative tests. Bet on groundbreaking tests when you feel you’ve already reached the maximum location (ie, there are no more improvement points within your current structure).

4. What can affect the test outcome:
Peep ends the course citing some problems that often boycott tests:

  • History effect: when you disregard the context and don’t realize, for example, that your competitor ran a big campaign that hurt your conversion.
  • Instrumentation effect: when one or more of the versions created have usability errors.
  • Selection effect: when the sample used for the test is not the same sample that would access the website under normal conditions. Example: test with cold traffic purchase on Facebook.

Testing strategies

Once you know how to run tests, the next step is to know when to employ which A/B testing strategy. During this course, Peep Laja discusses the different types of tests and explains when to use each of them.

What to test?

Start with obvious solutions to obvious problems and then move on to testing creative solutions.

If everything you proposed fails, it is possible that you have reached the maximum location: the best possible result within the existing layout structure. In that case, it’s time to rethink the whole structure (which can be quite risky).

  • Step 1: Identify problems with Conversion Research;
  • Step 2: Determine urgent user problems;
  • Step 3: Test possible solutions.

How Many Changes Per Test?

Unless you have a ton of traffic, don’t change one thing at a time. If you don’t have a lot of traffic, changing just one thing on the site may not be enough to change user behavior.

Changing more things makes it more likely that the user’s behavior will change. The problem is that you have less clarity about learning, as you cannot understand which of the changes brought about the improvement.

Strategies to change a lot at once:

  • All changes made are directly related to a single issue;
  • All changes made are directly related to a single hypothesis.

A/B testing vs MVT:

A/B tests are best recommended for testing drastic changes. Multivariate tests are best suited for testing the interaction between different elements.

Bandits:

Traffic is not split evenly: it is dynamically split according to version performance. The idea is that you maximize how much you earn per minute while running the experiments. They are ideal for short and/or seasonal campaigns, such as Black Friday.

Tests of this type are also good for situations with little learning potential where you don’t need as much structure.

Existence testing:

Often, the pages of a website are full of content, but we cannot tell which parts of the text increase or decrease the conversion rate.

Often the pages on a website have several different contents. Most of the time, we’re not clear which part of the content is converting more and which part is getting worse.

Existence testing is a simple way to resolve this issue. It consists of removing parts of the original content and running tests to see what happens.

Iterative Testing & Learning From Results:

Iterative tests allow you to assign results to very specific changes, such as a phrase, a button, and so on.

This test is faster, easier to implement, and cheaper. It’s a great way to start a testing culture in the company.

If an A/B test fails, use iterative tests to work out the same hypotheses in other ways.

When you know that more than half of your tests are likely not to produce a lift, you will have new-found appreciation for learning. Always test a specific hypothesis! That way you never fully fail. With experience, you begin to realize that you sometimes learn even more from tests that did not perform as expected.

What happens if test results are inconclusive?

If your test is inconclusive, go to Google Analytics and check how the test performed for different segments (new users, returning users, different devices, different traffic sources, etc). It is very likely that you will have victories in one of these segments.

Use the data collected to improve your hypotheses and define new tests. Also, log the flaws in your customer theory and see what insights you can extract from them.

Innovative Testing:

It’s not about radically changing the entire site, but rather innovating a part of your site and running an experiment to see if the idea really has merit.

Most of the time, when you make bigger changes you are more likely to get changes in user behavior.

Tests of this type are also a possibility for sites with low traffic, as minor changes tend to cause minor behavioral changes.

Split Path Testing:

It’s the kind of testing that takes the user on different journeys. Example: one-step checkout vs multi-step checkout.

It helps us understand what kind of customer journey leads us to better results.

--

--