CTR Optimization via Thompson Sampling

Published in

Walmart Global Tech Blog

6 min readJan 14, 2020

Walmart Labs is a data-driven company. Many business and product decisions are based on the insights derived from data analysis. I worked in Expo which is the A/B Testing platform for Walmart. As part of the platform, we developed a feature to optimize content based experiments based on CTR on Walmart.com. CTR(click-through rate) is the percentage of impressions that resulted in a click. The objective of CTR optimization is to dynamically allocate traffic to variations that have a higher CTR, while allocating less traffic to variations that have lower CTR. This feature has been developed based on the multi-armed bandit (MAB) content testing paradigm.

Multi Armed Bandits

By analogy, a “multi-armed bandit” is a machine that has multiple arms to pull each with a different expected value of payoff. The player needs to figure out which arm gives the highest payout. The phase of pulling out different arms in order to identify the one with highest expected payoff is termed “exploration”. The phase of pulling the arm that’ most profitable so far in hopes of maximizing the total return is termed “exploitation”.

The goal of “multi-armed bandit” problem is to construct a strategy to strike a balance between exploration and exploitation with the goal of maximizing a particular criteria. The multi-armed bandit has both exploration and exploitation phases, which alternate in an adaptive manner based on the arms’ ongoing performance. This paradigm fits well with website optimization where we would want to dynamically shift the traffic to winning variation based on a particular criteria like CTR.

The decision to explore or exploit is dependent on the bandit algorithm being used and in our case the bandit algorithm being used is Thompson Sampling.

Some advantages of MAB over A/B testing:

For short term campaigns/applications the cost for waiting for the A/B test results make MAB a better choice so that large gains can be made by exploiting as early as possible.

MAB can also be used to automate testing process and avoid repeated intervention by analysts in order to perform repeated A-B tests.

MAB can be used to capture dynamic changes when the item being tested changes significantly over a period of time invalidating the results of an A/B test. In this case, MAB provides an optimal strategy by continuously exploring.

Goal of the Algorithm

The goal of the bandit algorithm is to dynamically allocate traffic to variations that are performing well, while allocating less traffic to variations that are underperforming. To begin with, it should display all variations to a random selection of users, and measure which variants are clicked on more frequently. Over time, it will use these observations to infer which variation has the higher CTR. Then, once the estimation of the CTR becomes more precise, it will gradually move traffic towards the winning variation.

Optimization Workflow

We provide a UI for the Optimization experiment setup. The 3 key configurations that we provide as part of the Optimization experiment setup are the following:
1. Optimization Wait period(This is to allow the optimization metric to “warm up”: to get a reasonable number of hits, so we don’t reallocate traffic based on too little data. Default is 24 hours.)
2. Optimization Interval(This is a configuration to setup the frequency of the optimization. For example for home page, we can optimize often say every 1 hour vs for a category page where views are low the optimization engine can optimize every few hours. The default is 1 hour.)
3. Minimum traffic per variation(This is to prevent optimization engine from making a “too hasty” traffic adjustment in favor of one variation and to allow to explore other variations as well. Default is 10%.)

Initial traffic allocation is at equal percentage for all the variations of the experiment. Once the experiment is started, the assignments of the visitors would happen for each variation depending on the allocated percentage. A realtime spark structured streaming job is used to collect clicks/impressions per minute for each variation and stored in a Time Series Database. The clicks/impressions are provided every hour to the bandit model to generate the new traffic percentages. Any traffic adjustment itself is done after Optimization Wait period is met and at the frequency of optimization Interval.

Explore/Exploit using Thompson Sampling

The basic idea of Thompson Sampling in a Multi Armed Bandit scenario is that the algorithm starts with a prior belief about the expected reward. During a trial, a random sample is drawn from the prior of each arm. The arm chosen is the one that has the highest sample which provides the best arm. Post feedback, the prior belief is updated to a posterior belief. Since we use conjugate priors, the prior and posterior take the same form and we can continue the sampling process treating the posterior as the new prior. The algorithm ensures that the arm with larger expected reward is exploited more while intermittently exploring the under performing arms.

The following section describes the explore/exploit approach we took and how we used Thompson Sampling to do dynamic traffic adjustments of the Optimization experiments.

We keep track of positive and negative observations. Here positive observations are clicks. Negative observations are (impressions-clicks).
The beta distribution is used as prior. We use conjugate prior which means that the posterior distribution is the same family as the prior.
Original priors are chosen as beta(20,980). Based on the 2% CTR seen historically we have used 20 positive observations and 980 negative over 1000 impressions as the prior. Before 1000 impressions we don’t want to make any movement either way. As the amount of data grows the effect of priors is negligible.

Each time the algorithm receives a feedback, we update the positive and negative observations. Whenever we need to access the posterior distribution, we compute the parameters for the beta distribution. We will use these parameters to draw a random sample from the beta distribution.

We also apply a decay factor. Everything up to 7 days ago gets weight 1, the week before gets weight 𝛌, the week before that gets weight 𝛌², and so on. Currently, the 𝛌 is set as 0.8.

Around 200K samples are from posteriors of beta binomial distribution and we construct an empirical distribution by counting the proportion of times that a given variation was chosen. The traffic split is based on the percentage of winning for each variation. For example: If 30% of time variation 1 wins as against 70% of time for variation 2. The traffic split is 30% for variation 1 and 70% for variation 2. In the steady state we expect the winning variation to have 100% of the traffic.

Following is a sample code on how Thompson Sampling can be used to do dynamic traffic adjustments based on observed clicks and impressions.

Conclusion

We have done a few pilot runs and observed a favorable lift in the CTR. We also observed that short term campaigns where large gains can be made by exploiting as early as possible by reducing the exploration phase is a good candidate for MAB.

As a next step we are looking to enhance the system to support optimization by other metrics like add to cart rate, conversion rate, revenue, minimizing bounce rate. We are also looking to enhance the feature beyond Content based treatment type. If there are multiple components in a page to continuously optimize, we are looking to use MAB approach as a framework to automate the optimization process.