Supercharge your A/B Testing using Reinforcement Learning

How to improve the website user experience by faster experimentation using AI and automation

Pramod R

Published in

FalabellaTechnology

7 min readSep 8, 2021

Introduction

Every decision we make these days, is heavily backed by data and deep analysis that supports our decisions. Be it deciding to understand the effects of Vaccine composition on human body to deciding the voter’s choice during the elections to understanding the users’ preference on a website — everything will boil down to scientifically understanding the patterns of the data and taking actions based on statistical tests. One of the popular framework for doing such experiments to collect data to enable decisions particularly in an online setup, is A/B testing.

What is A/B Testing?

In simple words, A/B testing is a method of comparing two (or more) variants of your experiment and see which one is better. In this method, the users are randomly split into two (or more) groups and assigned each variants of the experiment. The term A/B was coined because we essentially test the A variant with that of the B variant of the experiment. While this method is now being widely used by the Facebooks, Googles and Amazons of the world, the concept itself was first introduced in early 1900s by a statistician named Ronald Fisher, in agriculture, where he ran experiments between different fertilisers on different patch of land to see the effects, before rolling it out to the entire field. But the entire idea of A/B testing comes from the entire human evolution and lineage, where multiple variants were ‘tested’ before we arrived at this current form! This idea was later adopted to clinical trials and direct marketing campaigns and now, used extensively on the e-commerce/social media/media content websites to determine the users’ choice to different variants of site design and more importantly, recommendations.

Experimentation Setup

Typically, we start with an objective of what to test/compare against and what are we trying to optimise. For example does a recommendation of “Compare similar items” go well or “Compare similar brands” yields better customer response in an e-commerce site? Also, are we trying to generate higher Click Through Rate or a longer dwell time on the page or a higher conversion (Upper funnel vs Lower funnel of the marketing chain)? How many experimental variants do we have (some of them go way beyond the standard A vs B)? Are there any preset filters for the customers who would be shown these recommendations? What proportions of the randoms splits do we want to do between the variants? How long do we run these experiments for?

These questions will help us better design the experiment and plan for the data and the engineering efforts for implementation. With the answers to the above questions, we start designing the hypothesis for our tests and start testing for the statistical significance by collecting the results during the experimentation period [01]. As we gain confidence over the results, we would finally decide on the single best winner OR re-iterate again with new set of experiments with further set of hypothesis.

Limitations with the current setup

All this is great when it comes to a standard A/B testing set up. But this comes with its own set of challenges as listed down —

Period of Experimentation: The A/B tests are typically designed to run for short durations — like 5–8 weeks. It could go shorter — at the cost of collecting lesser observation data/customer response OR longer — at the cost of loosing out on significance time to enter the market and also loose out on the potential revenue that could have been earned by providing the right recommendations for the right crowd — who happened to be allocated a wrong test variant. Either ways, there is a dilemma on choosing the optimal period of experimentation.
Evolving/Changing demands/preference: The customers’ choices keep evolving frequently over a time. What was relevant currently, may seem totally irrelevant over a period of time. Given this premise, it is hard to experiment for a pre-defined duration and decide on the single winner for the rest of the eternity. Also, the subsequent iteration of experiments will again consume the entire experimentation timeline in order to finish the tests and decide on the winner.
Why choose a single winner? With the increased efforts of personalisation everywhere, it is not necessary that we everyone will like a single winner all the time. It is possible that there could be a mixed bag of responses where people exhibit their personal preferences which can contradict the ‘popular’ sentiment. In such cases, we would need to account for such personalised preferences and not impose a popular winner as a universal choice.
Intelligent learning machines: While the statistical tests for significance offers a data driven way to measure the effectiveness of the A/B tests retrospectively, it does not clearly provide a Machine Learning mechanism of providing the optimal choice of suggesting, given the customer’s past choices and the rich context they bring to the table. Also we are not leveraging the experimentation results for our further refinement of our models, which could have led to more optimal selection of the variants.

Enter Reinforcement Learning

Reinforcement Learning is an emerging area of Machine Learning, where the models are self trained based on rewards and regrets (analogous to carrot and stick approach). In this setup, there is an Agent (typically our website engine/recommendation engine/etc.) who controls the optimal action to be given (like recommending an Ad) to an environment (in our case — the set of customers), by observing their current state (the features representing the customers). Based on these actions, we observe the reward/regret that the environment provides and learn from these responses, to further decide on the next action to be provided.

Reinforcement Learning: An Introduction — Sutton and Barto [02]

Muti Arm Bandit Setup

In the strict statistical terminology, the traditional A/B test is pure play exploration. In an ideal world, where we know the outcome every type of an action we take, we would go ahead and deploy that action that would yield the highest reward. But we live in a non deterministic world where the outcome of these actions are unknown. Multi Arm Bandit comes handy in these setup — where it tries to balance out between the ‘Exploitation’ and ‘Exploration’ between the different actions (also called as Arms — in the context of the ‘Arm’ of the Bandit or the Slot machines in casinos). When we need to collect more responses from the users to a relatively new action (arm), we explore; when we gain more confidence (i.e. response data) from the arm, we exploit. At any given point in time, we maintain a balance between the amount of exploration and exploitation based on many parameters [03]

We start by allocating the traffic evenly to all our variants (50–50 in case of A/B tests). But as we start observing the customers’ response to each of our actions (Arms), we start shifting the traffic to that arm that yields maximal reward, while still retaining a statistically significant percentage of the traffic (Epsilon in case of Epsilon greedy/exploration parameter alpha in case of UCB, etc.) By this way, we need not settle down for a single winner towards the end of the experiment and continue experimenting with a small proportion of the other variants for the eternity. This ensures that the continuous learning by the way of capturing the feedback data from the played arm, and thus continuously updating our beliefs about the responsiveness of the customer. In the event that the customers shift their preference over a period of time, the feedback helps in automatically increasing the share of split over the subsequent days (see in the below chart — Multi Arm Bandit Setup variant B over 7th week).

Thus the MAB set up not only helps us play the optimal arm (i.e. send the optimal recommendation based on the past feedback), but also helps us remain profitable (earn while you learn!) right from the day 1.

There are many algorithms listed under the Multi Arm Bandit paradigm like Epsilon Greedy, Upper Confidence Bound, Thomson Sampling, Bayesian Bandits, Adversarial Bandits etc. which considers various heuristics to play the optimal arms. I’ll reserve a deep dive into each of these algorithms for a different post.

Personalization via Contextual Multi Arm Bandit

While the above Multi Arm Bandit framework works well for intelligently alloting the traffic split, it does not take into consideration the customers’ context into consideration, which may greatly help in personalising the content, in addition to their past responses. This requirement gave rise to the Contextual Multi Arm Bandits [04] which takes into consideration the customer context, which is derived out of their demographics, past web interactions like pages visited, dwell time, add2cart, purchases, etc. that best represents a customer. This set up not only enables the traffic split but also re-allocates the recommendations to the optimal customers while maximising the reward. All these are again done on an online learning framework, which enables a continuous learning and update of the models over time, while accounting for the data drift and changing customer preferences/behaviour over time.

Use of Reinforcement Learning in the A/B test setup is still in the research phase, although a lot of companies have adopted this for their current experimentation setup. We can expect more advances in this area in the days to come