A/B/ n— Always Be Testing (Part 1)

By Ron Thalanki and Sachin Raghavendran, Summer Interns

Published in

The Hive

10 min readAug 10, 2018

Overview

The benefits of A/B testing are becoming more obvious as the complexity of platforms have increased and optimizing these platforms requires more than theory. AI/Machine Learning systems still exhibit issues with reproducibility and explainability that can be lessened (and even removed) with the implementation of a proper A/B testing framework with statistical selection mechanisms such as multi-armed bandits. This blog will give explain the benefits of using A/B testing for AI/ML systems whose results may difficult to explain, reproduce or understand and the statistics behind multi-armed bandits and contextual bandits. And in the next blog, we will describe The Hive’s A/B Testing framework that is designed to deal with the limits and obstacles of the enterprise setting.

Importance of A/B Testing [1]

Hidden Technical Debt [1]

Software engineering projects accumulate significant long-terms costs, or technical debt, as quick development without end-to-end testing, reducing dependencies, etc. and machine learning systems only increase the amount of technical debt as these systems are not resistant to problems like data dependencies and entanglement. As more companies begin to incorporate artificial intelligence into enterprise systems, it becomes important to standardize the approach to reducing technical debt and its effects.

Entanglement (aka the change anything change everything paradigm) of features, data and more are results of the interdependency of the ensembles and models that may be developed in the process of solving a specific problem. Correction Cascades, a situation where one takes a certain model m𝑥 for a problem 𝑥 and attempts to learn a model m𝑥’ for a different problem (with m𝑥 as the input) and learn a fast correction, is tempting but generally a bad idea as trying to solve m𝑥’’ based on m𝑥’ will lead to significant refactoring/training costs in the future and if the m𝑥 is improved then m𝑥’ can have a worse performance. The power of ML/AI algorithms is alluring for many situations, and general lack of problem-tailored design is consequence of a complacent approach to solving problems that can severely impact future code-usage and upgrading.

Model Performance

For example, a Web Design company may be testing the quality of some of their UI elements. Even if their objective success metric is the clickthrough rate of their page (with change factor of the elements themselves), a generally good — or bad — clickthrough rate for an element may end up being misleading. Maybe certain elements induce better click-through for different demographics of race, age, etc. Indeed, these sort of model performance issues are more prevalent than expected. After serving models — for testing purposes as an example — to the public, aggregating and analyzing data (and model performance) is of the utmost importance and must be approached in a manner that is not too fixated.

There is always the possibility of domain-specific variations serving as a detriment to the behavior of pre-trained models. These issues may be alleviated with the leveraging of subject domain experts, but the mathematics behind the models themselves may be limited in other aspects and as such, results may be disputable. Additionally, the behavior of the element being testing (for example, a business functionality) will carry with it a profundity that may be unobvious and may serve incorrect results. Issues pertaining to the domain may provide unexplainable model performance.

Real-World Testing

Although theory can be used to predict performance of models & elements that will optimize a certain metric, it is important to ensure that they perform in the real-world. Most people have already heard of A/B testing; it is the standard method of statistical testing where a null and alternative hypotheses are tested. Although, there is significant mathematics backing A/B testing, this hypothesis testing can sometimes drain resources and take too much time (especially with a complex experiment). However, A/B testing is really good at allowing corporations to test multiple versions of their product and determine which version optimizes a certain metric given real-world data in as little time as possible.

A/B testing is commonly used in client-facing products (especially user interfaces) where many variations of UI elements can be tested to optimize metrics like click-through or time spent on page. When wishing to test — for example — x number of variations of a button, one must essentially set up x pairs of null and alternative hypotheses (as part of standard practice for statistical tests). The number of tests obviously accumulates with the increase in the number of factors tested. Obviously, testing is also vital in the validation phase of selecting appropriate machine learning models for production where multiple models are tested against data to determine which model has superior performance. Since testing of the success of the model(s) can sometimes take more time than necessary, Multi-Armed Bandits (a.k.a Machine-Learning Based Optimization) offer a possible alternative.

Multi-Armed Bandits

A/B Testing vs MABs

Below is a reference diagram for the major differences between A/B testing and MABs. Although we have not touched extensively on MABs yet, it is likely that the current readers of this article have and as such we will post this diagram to summarize the differences between the two paradigms.

Reinforcement Learning

The Multi-Armed Bandit (a.k.a. the n-armed bandit problem) is among the quintessential reinforcement learning problems. Given a number of slot machines (each with a different success rate), you need to find the best bandit in the least amount of time while also optimizing the success rate. This problem deals with the balance of exploitation and exploration. Traditional greedy algorithms will primarily exploit the best arm. However, there are instances where such a greedy algorithm will not find the best arm(due to the lack of exploration). Thus, this greedy choice results in a sub-optimal bandit. MABs are particularly useful as they provide objective evidence of the success of certain algorithms/elements and lend credible data on how to — contextually — determine the next possible option. Some of the benefits of MABs include:

Exploiting/earning and exploring/learning simultaneously
Automating selection
Accounting the changing & erratic nature of the world
Automating for scale

One of the most-renowned (and easier-to-use) MAB algorithms is the UCB1 algorithm and its mantra is Optimism in the Face of Uncertainty. Let us lay the groundwork for this algorithm[2]:

Given K actions labeled as {1, 2, … K}, in each round we will select an actions/bandit and observe the payout. After a certain number of rounds, we may even do something random. The way the algorithm’s success is measured is the regret. The cumulative regret for an algorithm A over T rounds is the difference between the expected reward of the best action and the expected reward of A for the first T rounds. The best action of any round is the action with the highest payoff:

The UCB1 algorithm works by establishing an Upper Confidence Bound where we wish to know (with high probability) that an action’s payoff is less than our upper bound. If our optimistic guess is wrong, then we will be compelled to switch to a different action. In essence the UCB1 algorithm is as follows:

If you actually look through the algorithm (explained by Jeremy Kun), you will realize that UCB1 is not actually as complicated as it seems and is indeed something that can be implemented and integrated in a business scenario. You can find out more information about some of the complexity and mechanics of the algorithm at this wonderful site :https :// jeremykun.com /2013/10/28/optimism-in-the-face-of-uncertainty-the-ucb1-algorithm/ . Another simple and easy-to-use algorithm is the epsilon-greedy algorithm which randomly explores options a given percentage of the time ( set as some value of 100*ℇ) and exploits the other percentage of the time(100–100*ℇ). You can find out more information about the specifics of the algorithm here: https://imaddabbura.github.io/blog/data%20science/2018/03/31/epsilon-Greedy-Algorithm.html. We ran a simulation of 5 different bandits with some initialized rewards (a contrived example, but an example nonetheless) and the action of the bandit with the best payoff corresponded with the best performing algorithm. (Feedback mechanisms to alter the exploration rate in order to conserve resources can be implemented). The overall reward of the algorithm had an upwards trend:

A natural Bayesian algorithm like Thompson Sampling[3] is another MAB algorithm that is tailored to minimize regret. The setup for this algorithm is essentially the same as that for the UCB1, but a significant difference in Thompson’s is the use of the Beta distribution, whose probability density function (pdf) is defined with the gamma function (𝚪)as follows:

where a and b correspond to the shape and inverse scale parameter, respectively. The algorithm itself is as follows:

The principal difference between Thompson Sampling and UCB1 is that UCB1 is deterministic while Thompson Sampling is probabilistic. MABs do not necessarily make A/B testing irrelevant, but rather a time-saving machine-learning- based- optimization alternative. Furthermore, some of the prototypical MAB algorithms (like Epsilon Greedy, Thompson Sampling and UCB1) are not overly complicated in mechanics — which contrasts with their asymptotic complexity analysis). In A/B testing, in order to to retrieve conclusive (i.e. statistically-significant) results for n different variations, it is generally required to run n different tests with n different null hypotheses and n different alternative hypotheses, which quite obviously will take a perilous amount of time. Furthermore, A/B testing explores for a short while(i.e. bucketing the participants of the experiment) and then proceeds to purely exploit. Unlike MABs, A/B tests themselves cannot be perpetually running and reforming of their own accord.

The principal downfall of MABs is that they may output an action that does not take into account information about the state of the environment (context). However, a Contextual Bandit will make the decision conditional on the state of the state(s) of the environment. This is extremely useful for a business use-case as there may instances where personalization of the model for different situations is vital. Although Bandit-based algorithms can produce superior results to A/B testing, companies can become so reliant on these algorithms to the point that they may create too large of an abstraction barrier which can impair future processes and decisions.

Problems in AI World [4]

Explainability

For the majority of use-cases, such systems are treated as black boxes; they receive an input, return an input and tend to not offer any information about why. Indeed, as Casimir Wierzynski (Senior Director, Artificial Intelligence Products Group of Intel) puts it, there are six major concerns:

Bias: What if the human creators of the AI have/had some unconscious bias? How about ensuring that my system itself does not host a biased world-view?

Fairness: Did my system make decisions fairly? And what does fairness even mean in the respective context?

Transparency: Should I not be able to have decisions — made by the specific AI system — explained to me, especially in terms that I can understand?

Safety: Can I trust my AI system to achieve some result without any (or minimal) explanation of how it reached it?

Causality: Provided that I have learned a model from some data, can I get the correct inferences compounded with some additional explanation?

Engineering: How can I deal with incorrect outputs from trained models?

The explainability issue of an AI system is ameliorated with the usage of A/B testing — especially when A/B testing is done in a standardized manner. The general lack of foresight and the appearance of biases can hamper the analysis of the results, but the emphasis of standardization in A/B testing (as well as the heavy-duty statistical backing) benefits all who are attempting to better their current models/applications in the long-run.

Reproducibility

With the advent of AI in recent years, solutions have begun to accrue more and more moving parts — from frameworks, model designations, non-transparent models — that has eventually led to a difficulty in understanding the differences between what is expected and what is observed. A significant problem arises when a machine learning model seems to perform well. It may be tempting to just trust the model and ignore why it made the certain decision(s). The principal problem with this approach is that a single metric, such as classification accuracy, is an incomplete description of most real-world tasks.

Only with interpretability can machine learning algorithms be debugged and audited. Even in low risk environments, like movie recommendation, interpretability in the research and development stage — as well as after deployment — is valuable. Later, when some model is used in production, things can go wrong. Having an interpretation for a faulty prediction helps to understand the cause of the fault. It delivers a direction for how to fix the system and having this direction saves time and resources in the long run.

In the next part of this blog, we will introduce you to Hibert, The Hive’s A/B Testing framework and discuss how it leverages different components — and designs — in order to be integrated into pipelines.

Sources

1. https://papers.nips.cc/paper/5656-hidden-technical-debt-in-machine-learning-systems.pdf
2. https://jeremykun.com/2013/10/28/optimism-in-the-face-of-uncertainty-the-ucb1-algorithm/
3. http://proceedings.mlr.press/v23/agrawal12/agrawal12.pdf
4. https://ai.intel.com/the-challenges-and-opportunities-of-explainable-ai/