The Ultimate Beginner’s Guide to A/B Testing

Or, what you should know to get started

Is this the ultimate guide in general? Or is it meant for ultimate beginners? Maybe both, maybe none of it! (probably rather the latter than the former.) But that’s besides the point.

What this is, however, is a solid list of the basics of A/B Testing, kept real simple, meant as a starting point for anyone with little knowledge in the filed or the want to refresh their memory. Certainly this is not a guide for expert testers or seasoned statisticians.

This guide should capture all that is necessary to wet your appetite and get startet with A/B testing. Plus a few important aspects that all too often get missed in lurid articles. So without further ado, lets get going!

A/B tests are first and foremost a method or a tool to learn for future developments, and should always be seen as such!

What does A/B testing mean?

  • Testing two variations (or versions) of a digital product against each other
  • To determine, which of the variations has a positive or negative effect on the relevant performance indicators
  • To decide which of the variations to continue using
  • To learn which of the changes or differences caused the effect
  • To conclude what steps to take for the next iteration of your product

Common Examples for A/B Testing

  • Email/newsletter subject lines: measuring open rates
  • Email/newsletter content: measuring leads to a website/landing page
  • Landing pages: measuring which of the variants results in higher conversion, generates more leads, etc…
  • CTAs: Which wording, what color, what position etc. leads to more clicks

What is Multivariate Testing?

  • Multivariate testing describes simultaneously testing more than two variations against each other
  • Generally, these variations are formed by combining the different states of multiple variables within the product
  • A variable, in our case, describes a specific attribute or feature to be changed based on the default (more on that later) variant

Let me explain that with a simple example:

You have a landing page that you use to tell people about your product. Eventually, abstracted from all of your product’s intent, you want them click a CTA on the page.

  • You suspect that the color and the label of the CTA influence how many people will click it, thus you want to test the effect of both.
  • You don’t have a hypothesis how a certain color and label would work in conjunction (otherwise, maybe a simple A/B test is the better option)
  • Color and Label are the two variables in our example, each having two possible values or attributes (however you will)
     → Color can be red or green
     → Label can be “Read more” or “Try for free”
  • The individual attributes of our variables allow for four unique combinations (red and “Read more”, red and “Try for free”, green and “Read more”, green and “Try for free”)
  • Each of the combinations is one test variation
    → In this case we would have an “A/B/C/D Test”

Now back to theory:

  • By measuring the results of our variations we now see which combination worked better than the others.
  • Based on the measured results we can also calculate the individual effect of each of the variable’s attributes, to say more precisely what change lead to what kind of effect. Attention: Advanced knowledge in statistics is necessary (or at the very least highly beneficial).
  • Keep in mind: Though multivariate tests are enticing, they’re difficult to conduct effectively. The higher number of test variants leads to a great increase in necessary visitors and subsequently the time necessary to get reliable results. Whenever possible, conduct multiple A/B Tests in iterations by continually improving your product. This approach will most likely lead to better results faster.

Crucial Basics of A/B Testing

These are the things you should have at least heard about and have a rough understanding of before considering to conduct your own A/B tests.

Hypothesis

The hypothesis is the very basis of every test and needs to be set before you can continue planning your test. It describes which changes you made from your base variant leads to what effect, and why. It describes your intention and motivation for conducting the test. Your reasoning describes, why you expect your test to be successful (if you don’t expect it to be successful, don’t do it!). This can be based on research, data, your knowledge about your customers, best practices, or whatever... But it has to be based on something you can argue logically! Your hypothesis is the basis of learning from your test, and thus the basis to continue working with your results.

Testing Variants

  • default: The base variant you want to test your changes against (in case you have an existing variant).
  • (optional) default_s: Your control variant, and thus the exact same as the default variant, used to more reliably exclude coincidental effects. The result of default_s has to be about the same as default, you should not be able to see an effect here.
  • (alternative) var_0: Alternative naming for default, in case all of your variations are new.
  • var_a: The new variant you want to test.

Key Data and Statistics

In order to run an effective test, you need to know how your product performs right now, how you expect the test to improve it, and if you can reach reliable results. Here is what you need to get started:

  • Metrics: Determine and define the metrics that will be affected in the test variants, based on your hypothesis. Pick one that let’s you determine wether your test was successful. 
    Example: You want to measure the rate of visitors clicking your CTA.
  • Base value: That is the value currently shown in your default variant.
    Example: Right now, 10% of your visitors click your CTA.
  • Effect: How much you expect (estimate) your metric to change from your base value. This change is defined as either an absolute or relative percentage.
    Example: You expect a relative uplift of 20%, pushing your click rate from 10% to 12%. In contrast, an absolute effect of 20% would mean you expect the click rate in your new variant to reach 30%.
  • Statistical significance (1 — α): The probability, that the effect you measured actually exists, meaning it didn’t appear by chance but can reliably be reproduced (see: Type I error). Based on scientific consensus, a significance level of 95% is good to use in most cases.
  • Statistical power (1 — β): The probability, that you will not measure an effect although in reality it actually exists (see: Typ II error). Based on scientific consensus, a statistical power of 80% is good to use in most cases.
  • Sample size: The number of visitors you will measure per variant

All of these values are necessary to conduct effective A/B tests and are mathematically interdependent. In order to determine if a test will produce reliable results, and thus makes sense to conduct at all, you can use an A/B testing calculator. A good one I can recommend is part of Evan Millers Awesome A/B Tools.

Whatever tool you use, make sure it includes significance and power. Some simpler calculators, often provided by paid A/B testing services, hide these values to reduce the number of visitors necessary. This encourages to conduct more tests, though they might lead to unreliable results.

Type I & Type II Error:

Errors occur whenever the measured result differs from the actual truth. Type I & II errors describe the two possibilities of how you can be wrong. The concept is a bit tricky to explain, so I tried to explain it with an illustration (idea credit goes to Effect Size FAQs).

Type I & II Errors

Timing and External Influences

Uncontrollable external factors, such as the weather, changing trends, cultural, ethnographic or social settings, can influence your test and, in the worst case, invalidate your results. In order to reduce and minimize interference from these factors you should consider the following:

  • Randomize the assignment of visitors to your testing variants, in order to get comparably similar populations for all your variations.
    → Negative example: The population on the default variant mostly consists of young women, while your variant’s population is made up of mostly older men. The results of your test will be inconclusive.
  • Test both of your variants in parallel rather than chronologically. That way external factors occurring over time will affect both variants equally.
     → Example: Testing variant A of a page selling sunglasses in summer versus the default variant measured in winter will lead to inconclusive results.

Why Should You Use A/B Tests?

A/B tests can be used to test logical conclusions and assumption on their correctness. That way they can be a powerful tool for building successful products. A/B tests have some clear advantages over qualitative testing (which in turn have their own advantages, but that’s another article):

  • They examine a more natural setting than traditional user tests and steer clear of any influence created by the test setup.
  • They measure actual effects that include the inner motivation, interest and other personal circumstances of users and their lives.

Common Pitfalls for Running A/B Tests

There are a few common mistakes people tend to make when running tests. These often lead to unusable results, subsequently to no real progress in a product’s development and thus make testing a waste of time. Avoid these mistakes and reach reliable results faster!

  • Missing knowledge about test setup and statistical basics leads to wrong assumptions, results and conclusions.
     → Acquire at least basic knowledge to make the right conclusions.
  • Ending tests too early without reaching significance makes results unusable and simply a matter of guessing.
     → Let your tests run until it reaches significance.
  • Testing just marginal changes often reaches no results as effects are too little to measure and thus hinders bigger progress.
     → If possible, go for the biggest improvement of your product you can confidently argue.

A/B Testing Tools

There is a multitude of tools that let you set up A/B tests. I will not attempt to list them all, you can easily find them by searching the web. Tools generally fall in two categories:

  • Analytics tools, such as Google Analytics
    These tools are often more powerful and provide richer, more in depth analytics to get all the data you can wish for. They require more technical understanding for setup and more statistical knowledge for interpreting test results.
  • Specialised A/B Testing Tools, such as Optimizely
    Specialised tools are easier and accessible to almost anyone. They offer a simple drag and drop interface for creating variants and take care of all the technical things. They also try to present results in a format that is understandable for everyone. As mentioned, they tend to exaggerate results to encourage more testing (it’s their business after all) so use them with caution and make sure you understand what they present to you.

Additional Helpful or Insightful Links:


This is you

If you like what I write and want to make me smile, simply , comment and/or follow. If you don’t, do it anyways, life is short, do something nice! Either way, thanks a ton for reading! — Peace

Show your support

Clapping shows how much you appreciated Pascal Becker’s story.