A/B Test Using Machine Learning

Abel Mitiku
6 min readMay 21, 2022

What is A/B Testing?

Split testing, commonly known as A/B testing, is a randomized experimentation procedure in which two or more versions of a variable (web page, page element, etc.) are given to various segments of website users at the same time to see which version has the most impact and drives business metrics.

A/B testing, in essence, removes all of the guesswork from website optimization and allows experienced optimizers to make data-driven judgments. A stands for ‘control’ or the original testing variable in A/B testing. ‘Variation/exposed’ or a new version of the original testing variable is denoted by B.

The winner is the version that improves your business metrics. Using the changes from this winning variant on your tested pages/elements can help you optimize your website and boost your business ROI.

Classical A/B testing

Metric of choice

Consider, We’d like to compare the results of groups A and B, with group A receiving a dummy ad and group B receiving a creative ad from the SmartAd brand.

Data

The BIO data for this project is a “Yes” and “No” response of online users to the following question

Q: Do you know the brand Lux?______

O Yes or O No

This is a test run and the main objective is to validate the hypothesis algorithm we built. SmartAd ran this campaign from 3–10 July 2020. The users that were presented with the questionnaire above were chosen according to the following rule:

Control: users who have been shown a dummy ad.

Exposed: users who have been shown a creative (ad) that was designed by SmartAd for the client.

We want to look at the Engagement performance after seeing these modifications. As a result, Engagement is our AB Test Metric of Choice.

Engagement Formula

Stating the hypothesis

We want to see if there is a performance difference between the two groups. We’re especially interested in seeing if there’s a statistically significant difference in their Engagement results.

Hypothesis testing in statistics is a way for you to test the results of a survey or experiment to see if you have meaningful results. You’re basically testing whether your results are valid by figuring out the odds that your results have happened by chance. If your results may have happened by chance, the experiment won’t be repeatable and so has little use. So here is Stating the hypothesis:

Hypothesis 0 (Null Hypothesis)— states that both control and variation group engagement results have no statistical significance.

Hypothesis 1 — states that control and variation group engagement success has a different result and one of them has a statistical significance.

Level of Significance, 𝛼=0.05

We generated p and t values from the Ad campaign data and compared them to the Level of Significance alpha. Here is combining success (yes) for the exposed and control groups

aggregating success(yes) for control and exposed group

After that calculate engagement results for each group

control and exposed engagement result

We generated p and t values from the Ad campaign data and compared them to the Level of Significance alpha.

Since the p-value is 0.5185 > alpha(0.05), We fail to reject the null hypothesis H0.

So we came to the conclusion that there is no statistically significant difference between the two campaigns.

Limitations of traditional A/B test

A/B testing can be useful when you want to test two specific variables against each other. On the other hand, traditional A/B testing will only get you so far. Traditional/classic ab testing has some disadvantages that some people are willing to overlook.

Time and resources: Running an A/B test can take longer than other methods of testing and can be a drain on resources and time.

Fluctuating winner: Another limitation of A/B testing is that it requires the tester to extend the results indefinitely into the future. Traditional A/B tests assume an unchanging world view and don’t take into account changes in trends and consumer behavior and the impact of seasonal events, for example.

Sequential testing

In sequential AB-tests, the final sample size where the AB-test is stopped is dependent on the data we observe during the test. So if we observe more extreme results at the start, the test can be ended earlier. To achieve this, a pair of statistical borders are drawn, e.g. based on the Type-I-Error rate we would like to obtain in our test. For every new data point we get, the sum of the log-likelihood ratio of our data is then compared to these boundaries.

plot the upper and lower bound errors

We must continue the test to determine if it converges to blue (Accept Null area)

Pros

  • Gives a chance to finish experiments earlier without increasing the possibility of false results
  • Optimize necessary observation (sample size)
  • Reduce the likelihood of error

Cons

  • If we are concerned with preserving type I errors, we need to recognize that we are doing multiple comparisons
  • Allows for earlier completion of studies without raising the risk of erroneous results. We have three non-independent possibilities to produce a type I error after three data analyses.
  • For a fixed sample size and significance level, sequential testing ends up reducing power compared to waiting until all the data comes in.

Machine Learning (The alternative)

Speed of getting to the winner is key for maximum ROI. And continuing to check to ensure the winner is still performing better than other options ensures that performance doesn’t drop off over time.

Machine learning is when a system learns the relationship between input and output data without direct human control. The process is used to create algorithms and models that make sense of data without the model being directly written by a human programmer. These algorithms can be used to make decisions, classify data, or perform complex tasks.

A/B testing is the split testing of an altered variable, with success usually measured by user engagement or conversions. The overall aim of A/B testing and machine learning is therefore very different. A major difference in approach is that machine learning models are usually developed in an offline environment before being deployed to live, dynamic data. In comparison, A/B testing is performed on live or online data.

A/B testing with machine learning can be combined to:

  • Test and refine the deployment of new machine learning models.
  • Automate A/B testing to make the process more efficient and effective.
  • Discover useful information about datasets and variables when developing or aligning algorithms.
accuracy score
logistic regression feature importance
xg-boost feature importance
decision tree feature importance
best models according to accuracy scores

Classical A/B test vs Machine learning

With classical A/B testing, we determined if there was a significant lift in brand awareness which is instrumental to smartAd in making the next move.

With Machine Learning, we discover that the other features like the hour of the day, and the dates, determine the conversion in brand awareness.

There is a greater potential to have a significant lift in brand awareness.

--

--