How To Build A Better Bandit

Tim Schott
Red Ventures Data Science & Engineering
16 min readAug 15, 2022


Improving Contextual Bandits with Rich Interaction Features

san francisco steep street photograph
“We are our choices.” — Jean-Paul Sartre (Image Author’s).

Today’s internet is the culmination of an interconnected constellation of decisions made in the milliseconds between a user clicking on a link and the correct information rendering in their browser.

At Red Ventures, we’re no stranger to grappling with the best way to make these decisions. The content of the 100+ sites we manage reaches more than 750 million unique users across the world every month. This means efficient and accurate decision making is high priority work.

Naturally, our data science team devotes a lot of time to decision optimization, including a fascinating application of multi-armed bandits for scaling out a decision making system. We’ve posted before about the benefits of using bandits for optimizing decisions. At Red Ventures, one popular use case is determining the “best” asset to show a given user.

To approach this problem, our team created a bandit-learning pipeline that allows product owners to easily optimize a desired metric of interest such as average click-through rate (CTR) and start a campaign. Like any data science offering, though, constant iteration is a must– and our decision optimization suite is no exception.

As a Data Science Intern on the RV platform team, I spent the summer exploring how to improve our bandits in order to make even better decisions.

Bandit Learning Overview

To kick things off, let me be honest. If we rolled back the calendar a few months to the start of my internship, like most people reading this article, that introduction would have left me saying, “That all sounds … interesting. But what’s a bandit? Shouldn’t we just be A/B testing?”

A/B testing has its perks, which is why it’s used to aplomb internally at RV and across the technology landscape. However, A/B style tests shouldn’t be seen as a panacea for recommender problems.

Bandits simply adopt a different approach. These distinctions often improve outcomes for everyone involved — in RV’s case: the product owner, the data scientist, the end user, etc. Now, let’s explore this mysterious bandit concept!

Imagine you’re inside a casino — perhaps one floating atop Lake of the Ozarks. Pit boss Ruth Langmore walks you through The Missouri Belle’s opulent game-room filled with slot machines.

Image courtesy of

Each machine will either reward you with 0 dollars or 1 dollar. Assume, though, that the rewards are drawn from some unknown, random distribution, so different machines different machines reward at different rates. Your goal is to maximize the sum of all rewards you receive — your cumulative reward — in spite of Ruth’s distracting stories about the time she “borrowed” a speed boat.

Along the way, it would be great to minimize the regret incurred by juggling the competing tasks of exploring all the different machines in the casino and exploiting the immediate situation to gather the highest possible reward at the current moment in time.

Note: No need to worry if those terms are unfamiliar. We’ll explain them more formally later in the piece.

How should you proceed? If you were smart, you’d head to Chapter 2 of Sutton and Barto’s Reinforcement Learning for a crash course on the multi-armed bandit problem — because you’re living it!

The term bandit refers to the colloquial name for those slot machines “one-armed bandits” who will steal your money if you aren’t careful.

When it comes to using bandits at RV, we work on a problem known as the k-armed bandit problem, where k refers to the number of potential options for assets that we could show to users. Once we decide which asset to show them, we observe a reward (e.g. whether or not they click on the asset). Here’s, formally, what that looks like.

the k-armed bandit problem
The k-armed bandit problem

Line 2 sets up a series of trials. To follow our riverboat example, this is the number of times you try your luck at slots.

Line 3 is where an arm selection occurs. This corresponds to the choice of slot machine.

Line 4 brings it all together: you observe the reward from your action.

There are a few different strategies for solving this problem. At the core of each solution algorithm (Epsilon-Greedy, Thompson Sampling, Upper Confidence Bound, etc.), the arm-selection policy is updated as more reward information becomes available. These methods differ in how they approach the competing tasks of exploration (looking holistically each different arm) and exploitation (focusing on the “best-performing” arm at a given timestep). Ultimately, the solution you utilize typically depends on the amount of information you have access to and the business needs framing your larger problem setup.

Note: A detailed explanation of these solution methods is outside the bounds of this piece. Joe’s bandit piece, that I linked to earlier, has a nice walkthrough for Thompson-Sampling, while external resources abound related to the Epsilon-Greedy and Upper Confidence Bound methods.

The big takeaway, as far as this guide is concerned, is that RV can confidently leverage these strategies to help us decide which assets to surface in a high-traffic environment like RV-operated Bankrate.

Bandits Beat A/B Testing

When deciding which assets to serve at RV, the benefits of using bandits compared to running A/B tests are numerous.

“Multi-Armed Bandit algorithms will always provide a more profitable alternative to A/B testing. The specific choice of algorithm is dependent on whether the user wants to prioritize profit or data collection, and the estimated size and duration of the experiment.” — Elaine Zhang, Towards Data Science

Here are a few more reasons we rely on them across RV.

Efficient Data Usage

If you divided up the casino into k slot machines, the A/B approach would see you evenly divide your initial trials between k machines and maintain this equal selection strategy throughout your visit. Because each machine pays out differently, you’re essentially guaranteeing that a subset of your money will be used inefficiently. With the scale of campaigns run at RV, the last thing we want to do is consign ourselves to unknowingly surfacing the wrong content to the wrong users.

Contextualization Opportunity

Efficient data usage sounds great! But looking at your dwindling cup of tokens, you’d probably like some more concrete advice.

You suddenly remember that your cousin interned with a slot machine manufacturer. And you recall that he got in trouble for posting on LinkedIn about how data integrity issues led to machines with serial numbers ending in 002 unfairly rewarding older players with higher payouts than identical trials from younger players.

Ideally, we would consider traits about who is pulling the arm and information about each machine so we could take advantage of our cousin’s tip. This type of scenario is where contextual bandits emerge.

With a contextual bandit, we have the opportunity to further reduce regret by incorporating information about each asset as well as traits about the user into our decision making process. That nuance is important. Utilizing interactions between action and user features during our selection step sets the contextual version of the multi-armed bandit apart.

To continue, I want to focus on contextual bandits, because incorporating high-quality features into a contextual bandit was the objective of my summer internship.

Context is Key

With contextual bandits, the problem statement changes. Let’s take a look at what shifts.

the contextual k-armed bandit problem
The contextual k-armed bandit problem

Line 2 doesn’t change. We’re still looking around the casino trying to figure out which bandit should take our money.

Line 3 is where things get interesting. With this new contextual setup, we can integrate that tip from your cousin into our decision making process. We can encode the color of each machine as a feature and over time learn whether machine color has an effect on observed payout. This step encapsulates the work of creating features to capture the context of the decision.

Line 4 doesn’t change, either because you ultimately still only pick one machine.

Line 5 is familiar as well, with the twist being our reward is now dependent on the feature vector we create from our observation of a given user and group of actions.

Incorporating a machine’s color into our decision making process is a good first step towards a more comprehensive exploration strategy. But can we do even better?

To start, we can disembark the Belle. Ruth’s stories, and our checking account balance, are getting a bit ragged.

Opportunities to Improve Contextual Bandits at RV

Here at RV, we use contextual bandits in order to serve the most relevant assets to the right users. Typically, assets are digital pieces of content like images, videos, forms that, when clicked, generate revenue.

Effectively connecting an asset to the users that will see it and the pages it will surface within leads to better outcomes for everyone involved.Creating high quality features describing assets, then, is an important effort!

That color feature mentioned earlier actually translates well to our setting, because one simple way to encode a characteristic about an asset is passing through its color. Maybe we have a varying number of words in an asset call-to-action (“CTA”) field; we could propagate that data point as well. If you can provide distinct, credible information about assets, the more easily the bandit will be able to make decisions in an optimal fashion.

Additionally, we have an opportunity to incorporate information about the user who is visiting a page into our context vector. A range of features can be created when user information is on the table, with one simple example being an “isMobile” flag denoting whether or not a user is visiting a page on a mobile browser.

When starting an optimization campaign at RV, it’s simple to inject these user features into the context vector that the bandit considers when making selections.

One area for opportunity is passing through information connecting the page a user is viewing to other relevant traits about their information and asset information. This would help us make better decisions!

That’s where my work comes in. Explicitly encoding the interaction between page information and asset information is a generalizable strategy for equipping a bandit with the tools to quickly make excellent decisions.

I want to quickly clarify why, in a theoretical sense, this sort of feature will add value to the decision making process. Then I’ll run down implementation details.

Contextual Bandit Feature Types

The features you can incorporate into a contextual bandit’s context vector fall into a few different camps. On the one hand, you have context features, which capture the environmental data points relevant to the decision making process. Information about the user and page being decisioned on are typically understood as context features. A simple context feature is an “isMobile” flag. Another context feature could be “content type” which captures some taxonomic indicator of a page’s contents.

Another group is action features. These features encode information about the potential actions that could be selected by the bandit. They are often metadata-like, such as the color of an asset or a boolean flag denoting whether the asset features a video.

These two categories of features, especially in isolation, won’t tell the whole story. For instance, a user reading about sports might prefer blue assets while a user reading about finance might prefer red assets. This valuable piece of information ought to be accessible by the bandit!

To make sure these conclusions are accessible, a bandit should use a classifier like a Random Forest that can learn interaction features — information that results from the collision of these different columns. These interaction features are vital for making accurate decisions.

While interaction features can be imputed by a model over the course of an optimization campaign, that process is contingent on unpredictable user behavior. And it takes time, too. You need enough sessions to have taken place in order for the model to uncover these interaction nuances.

Providing these features upfront to the decision maker ensures that higher quality information is available at the outset of a test. That’s where my project came in. Explicitly defining interaction features capturing the relationship between action and context information would improve optimization campaigns at RV.

Next, I’ll explain how I created interaction features to make this goal a reality.

Creating Even Better Features

I created a group of interaction features that connect an asset to page data. Adding these features to the context vector provides high quality information about how these asset-page combinations interact with one another.

For example, I worked with assets that possessed fairly standard data points: a headline, a subheadline, CTA text, etc. I wanted to connect how compatible these assets were to the content of each page that gets decisioned on.

Normally, with traditional action and context features, there is almost no latency required to make the context vectors. You can cache the data for each asset, pull in user information at page load for features like “isMobile” and then assemble context vectors.

The features I created function differently. At each timestep, I need information about every possible title-asset pairing. This results in dynamic features that need to be generated each time the bandit makes a decision. Thus, if there are 10 assets to choose from, you need to produce that dynamic feature 10 times per time-step.

I used word embeddings to convert the text of an asset headline and a page title into vector space then calculated their cosine similarity. This metric captures the distance between two vectors, and it gets higher the more they’re alike with a floor of 0 and a maximum value of 1. This approach would connect an asset focusing on business schools to a page that highlights the best MBA programs.

Besides embeddings, I also implemented an easy-to-interpret feature denoting keyword overlap that links the tag of an asset (“What’s it about?” “What’s its appeal?”) to word in a page title. This provides another way to capture the similarity between an asset and a page and is, helpfully, both quite simple to implement and relatively fast to initiate in a production setting. This approach, for example, would provide an at-a-glance connection between an asset that’s focused on hiking boots and a page about the best NC hiking trails.

With these new features in hand, the natural next step is to evaluate whether or not their presence increases the performance of our optimization work.

However, the RV data science team was not about to hand me the keys to the car and drop these features into an online, productionized optimization campaign. Thankfully, bandits also offer an interesting way to test out under-the-hood changes in a low-stakes setting. You can run analysis offline and produce a reward estimate. For my project, this allowed me to estimate the CTR gains from my new feature strategy.

This methodology assumes that you have access to offline (i.e. historical) data from an existing bandit campaign. These logs can be used as a baseline dataset to estimate the gains from using new features in offline analysis.

It’s great for RV since we have droves of data on past campaigns!

Using Features: Model Building and Policy Making

Another (reasonable) moment of confusion this summer occurred when my team instructed me to explore “policy making” with respect to bandit learning. Policy making? Isn’t this a data science internship? In the wonderful world of bandits, this phrase takes on an interesting meaning.

While we can employ a traditional classifier model to infer what features of a single asset contribute to rewards, we have to apply these findings in an intelligent fashion. A policy dictates which assets we show out of all of our potential options. It applies what we know about our data in an intelligent fashion. Ideally, a policy should be able to make personalized decisions given different user behavior. A policy can be built out in all different sorts of ways — this relates back to those solution approaches like Epsilon-Greedy I discussed earlier.

With feature generation code in hand, it is time to incorporate them into a decisioning system with the goal of constructing an effective new policy. To start, you need to produce a traditional classifier that utilizes some combination of context features and makes an inference about what data points contribute to a reward. For those familiar with reinforcement learning, this model serves as an estimate for the Q-function. Classifier options, of course, abound. I observed effective results using a straightforward LightGBM implementation

The objective is not to build the world’s greatest classifier on the existing historical data. Rather, use that classifier in an interesting way on a semi-synthetic version of that data, via the explode operation.

pseudocode for explode operation
psuedocode for exploding historical data and creating new policy predictions

We look at every possible combination assets for a given user/page and predict which pairing is the most likely to result in a reward. To keep things simple, my candidate policy selected from these combinations in a fully greedy fashion.

The result of this process is a bank of new predictions from the candidate policy. (To be specific, you’ll have a dataframe containing which assets the new policy predicted and the corresponding probability predictions.)

Importantly, this type of work is counterfactual. That is, after constructing a new policy, the predictions entail what would have occurred if this new learning mechanism was deployed in production. At each context step, our new policy might end up selecting the same asset that was served up historically — or not. In effect, we’ve rewritten history. The next step is analyzing this new timeline.

Offline-Policy Evaluation

Offline-policy evaluation allowed me to create a well-constructed estimate of the expected reward had my new features been used in production. This represented an important step for my project, because I needed an easy-to-digest metric for predicted success.

There are a few different methods for performing offline-policy evaluation. They’re luckily much simpler than the jargony title might let on. The primary source I drew from was a 2011 paper from Dudik et al. This work delves into two differing policy evaluation processes and then synthesizes a novel approach.

All the routines explored by the authors share a common goal: producing an estimate of the expected reward if the candidate policy were used in production. Let’s take a look at each approach.

Note: notation between my explanations and the paper is not always one-to-one, doing my best here on Medium!

Direct Method

The Direct Method is quite intuitive.

Direct Method (Dudik et al. 2011)
Direct Method (Dudik et al. 2011)
  • 𝝔 is the predicted probability of the asset the candidate model chose achieving a reward at each time step.
  • 𝛑 represents our new candidate policy.
  • S is our historically logged examples from the behavioral policy.
  • 𝔁 is the context vector.

If you take a step back, you’ll realize that this equation returns the new policy’s average reward. If using this quantity as an estimator for future performance sounds too good to be true, that’s because it sort of is.

The Direct Method presupposes an accurate approximation of the unknown optimal policy, 𝛑*. This strategy produces a biased estimate of the reward for the target policy. In practice, using this method is challenging because it requires creating an offline model that coherently comprehends the subtleties of real-world user interactions with an appropriate level of bias.

Inverse Propensity Scoring (IPS)

The next method explored is Inverse Propensity Scoring (IPS). This method recalibrates how often an asset that was shown by the historical (aka behavioral) policy should have been served. This correction only occurs when a candidate policy selects the same asset as the historical policy given the same context vector. Taking a look at the formula, there are a few changes compared to the Direct Method.

inverse propensity scoring
Inverse Propensity Scoring (from Dudik et al. 2011)
  • I(𝛑(𝔁) = a) is an indicator function that evaluates to 1 when the asset selections match, otherwise 0. This, so to speak, controls the calibration “kicking in.”
  • r is the observed reward from the historical data.
  • p(a|𝔁, 𝓱) is an estimate of the probability that the behavioral policy chose asset a given the context x. Depending on how the historical bandit you’re inspecting functioned, you should be able to recover these exact values. For my project, the historical data I compared results against selected assets fully at random, so this value was easy to calculate — it was simply the inverse of however many assets were available to show at any decision.

IPS is easy to implement. A simple lambda function running over your dataframe can work out the CTR estimation. If your historical data contains accurate probabilities, the estimate should be unbiased. Plus, since IPS does not use the new policy’s probability predictions, issues of bias don’t emerge. However, IPS suffers from variance issues, especially if the denominator is quite small over the course of your data.

Doubly Robust

To address the shortcomings of both of these methods, the authors propose a combination of the two known as the Doubly Robust estimator. The thrust of this approach lies in the fact that the Direct Method and IPS can produce an unbiased reward estimate, given the correct conditions. So, the Doubly Robust approach merges the implementations. This results in an estimator that is unbiased if at least one of the two estimator terms, 𝝔 and p, are accurate.

doubly robust estimator (dudik et al. 2011)
The Doubly Robust estimator for offline policy evaluation (Dudik et al. 2011)

Looking at the formula you’ll note the familiar indicator function, the historical probabilities, etc. For each counterfactual sample, we’ll inject the prediction from the new policy, and then counterbalance those with a residual of sorts if the new policy selects the same asset that was historically served. In short, Doubly Robust rearranges the pieces of the Direct Method and IPS to produce a more reliable estimation of counterfactual reward.

Ultimately, I utilized the results from IPS and Doubly Robust when collating my projected performance projections for my new policy. These estimations provided an excellent point of comparison to convey the success potential of using rich interaction features in a production setting.

Project Results

The list of things I’ll be taking away from my internship with RV is almost too big to count! To start, I really enjoyed learning about the specifics of policy-making and offline policy evaluation. This is an active area of research, with numerous open questions:

What is the difference between an effective model and an effective policy? How do considerations of variance affect the utility of OPE estimates? Can offline methods be reliably employed to determine feature importance?

Additionally, there is growing demand across RV for incorporating these new, rich features I created into future campaigns. The more connections we can hand off the bandit, the better! As such, I got a jump start on productionizing my implementation, which provided excellent exposure to the engineering dimension of a data science workflow.

Lastly, I observed a 40% CTR lift comparing estimations from offline-policy evaluation to historical performance. Creating business value was an exciting final take away from my internship!

I hope this piece will encourage readers to continue learning about bandits and provide a resource for people who’d like to utilize bandits in creative ways. Drop a comment if you have follow-up questions or want to quibble over notation — I’m sure I missed a parenthesis somewhere!