A/B Testing with Machine Learning

Amdework Asefa
8 min readSep 4, 2022

--

In my previous post I have shown you how to implement a tweeter-data-Analysis on machine learning models. Now in this blog, I will show you how to implement A/B testing with Machine Learning Models (including MLops and MLflow).

A/B Testing can be used to determine user experience and behavior. Source.

In this article, I’ll show you the role of Machine Learning in A/B Testing, It includes the following contents.

Table of Contents

1. Basics of A/B testing and its use cases
2. Limitations and challenges of classical A/B testing
3. Advantages and disadvantages of sequential A/B testing
4. A/B testing formulation in Machine Learning context
5. Data review and ML A/B testing result
6. The advantage of using MLFlow and DVC in ML experimentation

1. A/B testing and its use cases

What is A/B Testing

A/B testing (also known as bucket testing or split-run testing) is a user experience research methodology. A/B tests consist of a randomized experiment with two variants, A and B. It includes the application of statistical hypothesis testing or “two-sample hypothesis testing” as used in the field of statistics. A/B testing is a way to compare two versions of a single variable, typically by testing a subject’s response to variant A against variant B, and determining which of the two variants are more effective.

Essentially, A/B testing eliminates all the guesswork out and enables experienced optimizers to make data-backed decisions. In A/B testing, A refers to ‘control’ or the original testing variable. Whereas B refers to ‘variation’ or a new version of the original testing variable.

A/B Testing is a tried-and-true method commonly performed using a traditional statistical inference approach grounded in a hypothesis test (e.g. t-test, z-score, chi-squared test).

•The two tests are run in parallel:

  • Control Group (Group A) — This group experiences no change from the current setup.
  • Treatment/variation Group (Group B) — This group is exposed to the new web page, popup form, etc.

•The goal of the A/B is then to compare the conversion rates of the two groups using statistical inference.

  • A successful A/B Testing strategy can lead to massive gains — more satisfied users, more engagement, and more sales — Win.

Types of A/B Testing

There are several types of A/B tests. You should choose the one that best fits your particular situation.

  • Classic A/B test. The classic A/B test presents users with two variations of your pages at the same URL. That way, you can compare two or several variations of the same element.
  • Split tests or redirect tests. The split test redirects your traffic towards one or several distinct URLs. If you are hosting new pages on your server, this could be an effective approach.
  • Multivariate or MVT test. Lastly, multivariate testing measures the impact of multiple changes on the same web page. For example, you can modify your banner, the color of your text, your presentation, and more.

In general, these all types of A/B testing are traditional. They have general/common structure as follows.

The A/B testing process (Use case):

Let’s assume we want to know if the SmartAd has an impact on the conversion rate.

1. Make a hypothesis about Ad will improve users engagement.
2. Create a variation of users per that Ad.
3. Divide incoming traffic equally between each variation and the Ad.
4. Run the test as long as it takes to acquire statistically significant findings.
5. If the Ad variation produces a statistically significant increase in page conversions, use it to advert more.
6. Repeat

Consider, We’d like to compare the results of groups A and B, with group A receiving a dummy ad and group B receiving a creative ad from the SmartAd brand.

The BIO data for this project is a “Yes” and “No” response of online users to the following question

Q: Do you know the brand Lux?______

O Yes or O No

This is a test run and the main objective is to validate the hypothesis algorithm we built. SmartAd ran this campaign from 3–10 July 2020. The users that were presented with the questionnaire above were chosen according to the following rule:

Control: users who have been shown a dummy ad.

Exposed: users who have been shown a creative (ad) that was designed by SmartAd for the client.

We want to look at the Engagement performance after seeing these modifications. As a result, Engagement is our AB Test Metric of Choice.

Equation for users Engagement

Stating the hypothesis

We want to see if there is a performance difference between the two groups. We’re especially interested in seeing if there’s a statistically significant difference in their Engagement results.

Hypothesis testing in statistics it is a way of testing the results of a survey or experiment to see if there is meaningful results. You’re basically testing whether your results are valid by figuring out the odds that your results have happened by chance. If your results may have happened by chance, the experiment won’t be repeatable and so has little use. So here is Stating the hypothesis:

Hypothesis and threshold formulation

Hypothesis 0 (Null Hypothesis) — states that both control and variation group engagement results have no statistical significance.

Hypothesis 1 — states that control and variation group engagement success has a different result and one of them has a statistical significance.

Level of Significance, 𝛼=0.05

We generated p and t values from the Ad campaign data and compared them to the Level of Significance alpha. Here is combining success (yes) for the exposed and control groups

The combination of success (yes) for both exposed and control groups

After that calculate the engagement result for each group as

The engagement result of each group

Finally, we generated p and t values from the Ad campaign data and compared them to the Level of Significance alpha.

The p and t-values

Since the p-value is 0.5185 > alpha(0.05), We fail to reject the null hypothesis H0. So we came to the conclusion that there is no statistically significant difference between the two campaigns.

2. Limitations and challenges of classical A/B testing

Classical A/B testing works but it isn’t terribly efficient. You’ll need to use a significant amount of time and resources to carry out your tests before you can gain any meaningful results.

A/B testing doesn’t work well when testing major changes, like new products, new branding, or completely new user experiences. In these cases, there may be effects that drive higher than normal engagement or emotional responses that may cause users to behave in a different manner.

In general, the classical A/B testing:

•Can take lots of time and resources to see the result

•Only works for specific goals

•Uses a fixed sample size

•Could end up with constant testing

•The last main drawback of classical A/B testing is as shown in SmartAd experiment, the chance of mistake (p-value — the worst-case probability when the null hypothesis is true) shouldn’t exceed 5%.

  • it’s allowed to check test results only at the very end when the sample size for both variations is reached.

Not only that, but instead of considering each visitor to be equal as in split A/B testing, ML can take into consideration factors such as demographics, customer status, and previous behavior to dynamically serve up different versions of your site to different groups of users.

The power of ML enables you to personalize and optimize your web properties from thousands of potential variations to display the single version that offers the best chance of conversion for each individual visitor.

3. Sequential testing

In sequential AB-tests, the final sample size where the A/B-test stopped is dependent on the data we observe during the test. So if we observe more extreme results at the start, the test can be ended earlier. To achieve this, a pair of statistical borders are drawn, e.g. based on the Type-I-Error rate we would like to obtain in our test. For every new data point we get, the sum of the log-likelihood ratio of our data is then compared to these boundaries.

plot the upper and lower bound errors

We must continue the test to determine if it converges to blue (Accept Null area) …

Pros

  • Gives a chance to finish experiments earlier without increasing the possibility of false results
  • Optimize necessary observation (sample size)
  • Reduce the likelihood of error

Cons

  • If we are concerned with preserving type I errors, we need to recognize that we are doing multiple comparisons
  • Allows for earlier completion of studies without raising the risk of erroneous results. We have three non-independent possibilities to produce a type I error after three data analyses.
  • For a fixed sample size and significance level, sequential testing ends up reducing power compared to waiting until all the data comes in.

Formulating A/B testing with Machine Learning (The solution)

Machine learning is a method of data analysis that automates analytical model building. It is a branch of artificial intelligence based on the idea that systems can learn from data, identify patterns and make decisions with minimal/no human intervention.

It is when a system learns the relationship between input and output data. The process is used to create algorithms and models that make sense of data. These algorithms can be used to make decisions, classify data, or perform complex tasks.

A/B testing is the split testing of an altered variable, with success usually measured by user engagement or conversions. The overall aim of A/B testing and machine learning is therefore very different. A major difference in approach is that machine learning models are usually developed in an offline environment before being deployed to live, dynamic data. In comparison, A/B testing is performed on live or online data.

A/B testing with machine learning can be combined to:

  • Test and refine the deployment of new machine learning models.
  • Automate A/B testing to make the process more efficient and effective.
  • Discover useful information about datasets and variables when developing or aligning algorithms.

Logistic Regression

Accuracy score using browser data frame
Accuracy score using platform data frame
XGBoost feature importance
Decision tree feature importance

After each ML model has been developed, final step is checking their accuracy

Best models according to accuracy comparison

Classical A/B test vs Machine learning

With classical A/B testing, we determined if there was a significant lift in brand awareness which is instrumental to smartAd in making the next move.

With Machine Learning, we discover that the other features like the hour of the day, and the dates, determine the conversion in brand awareness.

There is a greater potential to have a significant lift in brand awareness.

References

  1. https://conversionsciences.com/ab-testing-guide/

2. https://vwo.com/ab-testing/

3. https://www.analyticsvidhya.com/blog/2020/10/ab-testing-data-science/

4. https://www.business-science.io/business/2019/03/11/ab-testing-machine-learning.html

--

--