Multi-Armed Bandits for Model Serving and Experimentation

Published in

Data Science and Machine Learning at Pluralsight

12 min readNov 17, 2020

Introduction

In Machine Learning Engineering we are often concerned with things like model serving environments and the frameworks which make constant experimentation with machine learning possible in production safely and robustly. Sometimes we have only rough approximations on how well a model will do on a particular task once it receives real data from users or systems, which may change over time or be unlike data we had in our training sets. There are a few, well-known strategies for doing just this sort of experimentation. They include canary deployments, bespoke a/b testing, and dynamic serving frameworks based on bandits — which we will be exploring here.

The benefits of bandit-based model serving over static methods, such as a/b testing, are that as the system experiences feedback from users, such as whether the model is leading to the kind of engagement or action you are interested in, the bandit will adjust the likelihood of models which are performing well such that those models are more likely to be served in subsequent rounds of serving. This is beneficial, in part, because we can ship models without the fear that customers will be over-exposed to a model that is lacking in production. Though, before we get too far let’s take a look at the Multi-Armed Bandit problem more generally and set up some terminology and notation I will be using throughout the post.

The Multi-Armed Bandit Problem

We will be sticking with our example of serving models throughout this post and avoid cliche gambling analogies (sorry, not sorry). To restate, we have a series of K models we wish to serve. They each have some reward distribution R_k with mean µ_k, for this example, these distributions are modeled as Bernoulli distributions representing click-through-rates (CTR). We obviously want to serve the model which will lead to the highest CTR, the problem is the true value of each µ_k is unknown. We, therefore, turn to various algorithms to make these selections, using past data to best minimize the serving of models which likely have a lower µ_k than the optimal model. To formalize this a bit more, we are interested in regret minimization, a process by which we measure the difference between the current model’s strategy and some optimal strategy. We define regret as ρ, which is calculated over T rounds of selecting a model to serve and observing whether a click occurs or not, like so:

To break this down: the regret is defined as the difference between the best model at the current timestep and the sum of all past rewards. The difficult task for our bandit is learning the R_k distributions while also minimizing regret. In many bandit papers, the authors discuss regret guarantees or the properties of their algorithm with respect to regret. You can get a detailed treatment of regret analysis from Lattimore and Szepesvári in their excellent book Bandit Algorithms

Solving with E-Greedy

Before we go further, let’s see how we could solve this problem with an approximate algorithm called Epsilon Greedy, or e-greedy.

e-greedy can be implemented like so in python:

So, as you can see above, we will be using numpy arrays to track our successes and tries with each model. These constitute the parameters of the distributions we are trying to learn. The meat of the algorithm happens in the _epsilon_greedy_selection method where we select a random number and check that number against epsilon. If the number is smaller than epsilon we select a random model, if not we select the best model so far by taking the max over the means (ie, CTR) of the arms so far:

if random.random() < self.epsilon:
    random_model = random.choice(self.models)
    self._increment_model_tries(random_model)
    return random_model
else:
    best_model_so_far = self.models[
        np.nanargmax(self.model_successes / self.model_tries)
    ]
    self._increment_model_tries(best_model_so_far)
    return best_model_so_far

In this way, we see that epsilon is a parameter which will control how much our bandit explores versus how much our bandit greedily exploits the current best model.

This EGreedy object can now be used as model serving middleware where, once a system gets a request for some kind of model output like recommendations, the select_model method is invoked and the resulting model info is used to request model outputs. Those outputs are then shoveled to the client. If that results in some success client-side, like buying or watching one of the recommended items, the client makes a request which triggers the reward_model method with model information. While the E-greedy algorithm is useful for getting up and running quickly with tasks like this it lacks in a multitude of ways. Two of these downsides, in particular, will be the focus of our next two solutions. First, e-greedy weighs the averages as though they are all equal from an information standpoint, meaning an arm with a mean of .71 from 7 pulls will always lose over an arm with a steady mean of .8 with 100k pulls with, at best, 1 — (epsilon * (1 / n_models)) probability — which would be nearly always for most reasonable values of epsilon. Second, it cannot condition the arm selection on information relative to the context in which is occurring, ie it cannot take into account information about the client or other aspects of the environment which may help with making a better selection.

Solving with Upper Confidence Bounds (UCB)

The Upper Confidence Bound family of algorithms improves upon E-greedy by directly incorporating more information on those R_k distributions we are trying to learn. The algorithm can be summed up as an “optimistic” algorithm which states that the optimal solution is taking the arm with the maximum empirical mean after incorporating information on how often it has been played. Arms with high means and little exposure are, perhaps counter-intuitively, always preferred because either the algorithm will be correct and some intended action will occur or it will be wrong and we will have learned some crucial information about something that looked very promising. Contrasting this with e-Greedy we can see why this is a better strategy.

The upper confidence bound algorithms rely upon inequalities and bounds of our empirical reward distributions. The bound of some distribution allows, in english, us to select models whose means are likely to be the highest instead of what is empirically so. Back to our example of one arm with a mean of .71 from 7 pulls versus an arm with a steady mean of .8 with 100k pulls, we can reason intuitively that we can be quite sure that the true mean of the arm with 100k pulls is very, very close to .8, however we can not say the same about the other arm, whose mean could lie in a much wider region due to how little information we have. The mean of the other arm could reasonably be above the mean of the arm with 100k pulls and therefore should be preferred. The way we construct this “reasonable” region within which the true mean could lie are these inequalities. In particular we will look at Hoeffding’s inequality and the Chernoff-Hoeffding bound which is a bound on any bounded distribution.

For this example, our X will be a series of 1-subgaussian random variables whose true mean is µ. We observe the following inequality:

for all ∂ in (0,1)

So, our Upper confidence bound can thus be constructed like so:

So if the model has never been served, we serve it. Otherwise, we construct the upper confidence bound and select according to whichever model has the maximum upper confidence bound estimate when summed with its empirical mean. This algorithm has an important tuning parameter ∂, which one can leave fixed or set as a variable.

UCB1 is a simplified version of UCB which removes the ∂ tuning parameter and instead scales the bound by the current timestep.

So our UCB estimate grows larger as our global T grows toward infinity and shrinks as we observe more data from a particular arm T_k. This provides a desirable explore-exploit behavior in a principled way.

Let’s now see how we would implement this in python:

We can see clearly that UCB estimates are an improvement over epsilon-greedy and are guaranteed to be so (see: Bandit Algorithms Chapter 7). However, there are still areas for improvement. Our UCB estimate could be further enriched by taking into account features, usually called context, which may allow us to make more intelligent model selections.

For example, if we have a series of recommendation algorithms, some of which take user history into account, others which take global trends into account, and finally one for company promotions. We also have context in the form of numerical features regarding recent activity, geographies, historical interactions with promotions, and subscription information for each user, which may help select one recommender over the other. Serving these sorts of experiences through epsilon greedy would mean whichever, on average, had the highest CTR at the time of serving had a 1 — epsilon chance of being served otherwise a random model would be selected — obviously ignoring important data found in the context.

Solving with LinUCB

To take this useful context into account, we will now look at an algorithm called LinUCB, which builds upon the Upper Confidence Bound algorithm. Like in UCB, we have an inequality we will use to construct our “optimistic” estimates of our µ_k values. This inequality is actually with respect to a linear regression model, instead of some empirical mean. To save space here, I will omit the rationale but you can learn more about getting linear models to fit into this framework by checking out Chapter 19 of the Bandit Algorithms book and, in particular to the following formation, this paper. The particular inequality we are interested in takes the form:

Expanding on our notation from earlier, we have x_t as some input features to our LinUCB model, a set of arm embeddings which will be used to generate per-arm mean estimates, and α a tuning parameter we set which will determine how much of an effect the UCB estimate has on our selection. I will omit the arm indices from the following equations since they are all done per-arm. We will be following algorithm 1 from the paper A Contextual-Bandit Approach to Personalized News Article Recommendation, important to note are the definitions of the arm embeddings (theta hat), the UCB, and how we update the model parameters:

where A is a per-arm covariance matrix and b is a reward vector. We obtain the point-wise mean estimate by taking the dot product between our arm embeddings and our input features x_t :

we then obtain our estimate of the bound via:

So our full LinUCB estimate per-arm is:

If this is unclear perhaps some python can make this a bit more concrete

Complexity-wise, we see that we are linear in the number of models to serve, but at-worst cubic in the number of features, so we should choose our context wisely! We also see that at each step we calculate the inverse of the covariance matrix. This is a costly operation that can be cached, so instead of computing the inverse of A every step, we compute it once every N steps and store the result for use in subsequent requests.

Challenges and Downsides

There are some well-known challenges in the MAB problem space, such as how stationary the reward distributions are. If they are known to be highly un-stationary, it is difficult to appropriately model for µ_k and make the right selection.

With respect to contextual bandits in production systems, we often need to deal with delayed feedback occurring jointly with a changing context. Therefore we are required to store the user context for some delay window D, within which the bandit will be rewarded. This update requires the context as it was at serving time. We store this context in some delay buffer or cache and the entry is cleared should we reach the end of that delay window without a reward signal. If we get a reward signal that value is read from the cache and used to update the model. This can add a significant burden to production systems where the context is large and/or the rate of requests is very high. You can see an excellent explanation of LinUCB and delayed feedback by Claire Vernade from Deepmind here.

Further, most implementations will require you to roll your own bandit. There are very few if any holistic packages I have found that would work in production use-cases, which brings me to my next point: how to use these sorts of algorithms in a production setting.

Production Use-Case

The bandits we have seen so far are written with a blog post in mind and are therefore optimized for clarity and not a production use-case. Here I will detail some of the things we have done at Pluralsight to ship bandits into production.

The components of a production system are the following:

Caching layer for model state
Feature store for client/model context
Client-facing API layer
Serving layer for models
(Optional) Queue for context vectors and impression information within delay buffer D

The caching layer tracks the bandit state, like the tries and success vectors from the Epsilon Greedy model or the covariance matrices and reward vectors from LinUCB. Saving this state in an external caching layer, like Redis, allows us to horizontally scale our bandit serving and provides an additional layer of robustness to our system. This way we can have many machines running bandits who fetch their state from Redis and provide updates back in turn.

Providing these updates is tricky though. Should 2 or more machines fetch the same state and provide 2 or more different updates we could run into race conditions or have inaccurate updates — thereby losing data or forcing a change to the system which would invariably slow it down. Instead, we have written some custom code for Redis which allows us to send commands which perform the vector/matrix updates within Redis itself. This allows us to build a system that scales well, is much more robust to data loss, and is not blocking on writes or leading to race-conditions. This Redis instance represents our first system-level building block.

To get contextual recommendations from the bandit we need to provide it with context. To get this context we need some sort of feature store which we keep up to date with engineered features. We use Postgres for this, but there are many great choices here including Cassandra and MySQL. This database stores information on user activity, preferences, our content, and more.

When fulfilling requests the bandit hits both the database for context and the Redis instance for covariance matrices and reward vectors. Those are reads that can happen in parallel so it is best to choose a framework that can fire off these requests asynchronously, for this client-facing API layer we have chosen node.js and Express, in particular, as our API framework. Node with Express allows us to easily leverage the async model when making these calls but is not the only kid on the block. For folks looking for something that can handle the async pattern in python look no further than FastAPI, a framework we have been experimenting with more and more at Pluralsight and are seeing some very promising results from.

But, for now, I will describe the system as it stands as of this writing. We deploy these Express endpoints to EC2 instances behind an Application Load Balancer in AWS, which alternates which instance is hit with a request in a round-robin fashion — allowing us to scale up our throughput by scaling horizontally. The bandits are within this API layer where they decide, per-request, which model we will get output from to present to a particular user. A simplified view would be like so:

One final improvement is a separate model serving layer to which the bandit will route requests. This layer is usually wrapped in its own auto-scaling group with a load balancer. I’ve omitted the load balancer to save some space.

One final piece of the puzzle is monitoring. For that, we use Grafana and Postgres. We write metrics around impressions and interactions to Postgres and visualize that data with Grafana. This allows us to detect any issues with the bandit as well as monitor convergence. Having some sort of monitoring is crucial to shipping any sort of custom model to production — bandit, or otherwise.

If you found this post interesting and would like to work on challenges like this with us, check out our open positions!