EXPEDIA GROUP TECHNOLOGY — DATA

Recursive Least Squares for Linear Contextual Bandits

A simple algorithm to run a linear contextual multi-armed bandits on continuous rewards

Fedor Parfenov

Published in

Expedia Group Technology

8 min readDec 15, 2020

Abstract image showing purple and blue three dimensional waveforms on a yellow gradient background

TL;DR — Contextual bandits can be modelled as a linear regression; you can sample arms with Thompson sampling; the recursive least square algorithm can be used to efficiently update the mean and variance of the linear regression weights.

I recently re-discovered the Recursive Least Square algorithm which I have not seen for a while. I was struck by its simplicity and elegance. I also noticed that this approach carried all the necessary features to be used in a linear contextual bandit algorithm. So I decided to write this blog laying out how I would apply it and why I think this is a neat approach.

Contextual bandit algorithms

As a quick reminder, the Contextual Multi-Armed Bandit algorithms (ConBandits) are an extension of the classic multi-armed bandit algorithms which takes into account the context at the time of decision. It is very often used by e-commerce companies to optimise their website to automatically explore different variants of the page to select the best one while minimising the opportunity cost of exploring. For more on this, please check my previous entry: Multi-Variate Web Optimisation Using Linear Contextual Bandits

Formally, the objective is to explore a set of arms A given possible context C. A is usually a set with K arms:

C is the context that can be represented by a vector of J variables:

In a given context c ∈ C, we define the optimal arm a* ∈ A as the one that maximises the expected value of R which is a random variable with an arbitrary distribution.

\[a^{*}_c = \argmax\limits_a E(R|A=a,C=c)\] \[R \in \mathds{R}\]

In practice, the algorithm will receive a series of randomly chosen contexts and the objective is to minimise the overall cumulative regret Δ by having a best possible strategy when selecting arms ã ∈ A and learning from the feedback:

\[\Delta = \sum_t E(R|a^{*}_{c_t}, c_t) — E(R|\tilde{a_t}, c_t)\] \[\tilde{a_t} \text{: Selected Arm at time t}\]

ConBandits as a linear model

The contextual bandit can be driven by a linear model. We first need to map both A and C to the linear feature space 𝔛 with a mapping function m. The feature vector x ∈ 𝔛 is of length L containing an intercept; the vector of one-hot encoded a of length K; the context vector c of length J; and the first-order interactions between the context and the arms of length J × K:

\[ m: A, C \rightarrow \mathfrak{X} \] \[ \mathfrak{X} = \mathds{R}^{L} \]

Thus the length L = 1 + K + J + J × K. The first-order interactions are required to enable the model to learn context-specific optimal arms. The interactions between context variables is a possibility but this modelling decision is out of scope.

The linear relationship between an observed reward r_t and observed feature vector x_t is defined as:

\[r_t = w^{\intercal}x_t + \epsilon_t\] \[\epsilon_t \sim N(0, \sigma²)\]

With w — vector of weights of length L. Given a training set of size N with a vector of rewards r of dimension N×1 and the matrix of feature X of dimension N×L, the ordinary least square estimates of w would be:

\[\hat{w}_{ols} = \Gamma X^{\intercal}r\] \[\Gamma \equiv (X^{\intercal}X)^{-1}\]

We also know that, asymptotically, the OLS estimates will follow a multivariate Gaussian distribution:

\[ \hat{w}_{ols} \sim N(w, \Sigma)\] \[\Sigma \equiv \sigma² \Gamma\]

This is a convenient property enabling the use of Thompson sampling we will describe below. For more details, those derivations can be found in many Statistics and Machine Learning textbooks.

Contextual bandit step-by-step

The algorithm below describes the 4 key steps of the ConBandits. Some of these steps can run asynchronously and often do in production; for clarity, they are presented as sequential steps:

A bandit has 4 steps. 1.Getting a context. 2 Choosing an action. 3. Observing a reward. 4 updating the algorithm.

Step I and Step III are given by the environment:

In Step I, we will receive a random context c.
We assume that Step III produces a random reward r with an unknown function f.

In the two next sections will introduce the following implementations:

For Step II, we will use a Thompson sampling algorithm to select an arm given the context and our knowledge (parameters).
For Step IV, we will show how to apply the recursive least square algorithm to update the parameters with the received feedback.

Selecting an arm with Thompson sampling

Thompson sampling is a well-known approach that can be easily extended to the linear ConBandits[1]. This heuristic uses the distribution of the parameters to balance exploration and exploitation:

First, sample a set of random parameters w̃ from the current distribution of the OLS estimates.
Second, choose the arm ã that maximises the expected reward given those sampled parameters w̃ and context c.

Algorithm 2 describes the implementation:

Description of the Thompson Sampling. Sample random parameters and find the arm that maximises the score given the context.

Updating the parameters with recurrent least squares

When it comes to updating the parameters, one obvious way is to periodically fit a linear regression with all the data accumulated so far. This, however, does not scale well:

Memory-wise: All the data X will need to be kept in memory to be able to recompute Γ, ŵ and σ̂².
Computation-wise: Inverting XᵀX (or the sum of a square matrix) is an expensive O(L³) operation.

We need an online way to update the algorithm. There are two notable types of approaches: on the one hand, some authors have approached this with a type of Bayesian updating[2, 3]; on the other, some have used linear algebra methods like in [4] where they used block matrix inversion properties and updated only relevant parts of a model. (Please check out [7] for a review.)

Closer to the second type of approach, here is an application of the recursive least square algorithm [5, 6]. This yields a result equivalent to computing OLS estimates but using an iterative approach. In this approach, for each new feedback with reward r and feature vector x, we will update the parameters w and Γ using two operations:

\[\Gamma_{t} = \Gamma_{t-1} — \frac{\Gamma_{t-1} x_t x_t^\intercal \Gamma_{t-1}}{1 + x_t^\intercal \Gamma_{t-1} x_t}\] \[ w_

One key feature to mention is the direct use of Γ, the inverse of the sum of the square matrix. This is very useful as not having to invert XᵀX reduces the update complexity from O(L³) to O(L²) when compared to periodically recomputing the OLS estimates. Conveniently, it requires to keep Γ in memory which is also a parameter of the multivariate Gaussian distribution of the weights. Usually, w is initialised as a vector of 0 and Γ as an identity matrix, but it is possible to multiply the identity matrix by a constant λ which will be the equivalent to an l2 regularisation.

The last missing part is the variance of the error term, σ̂². The usual approach is to estimate it by calculating the variance of the residuals. However, this will require to keep all the data in memory and recompute the residuals after each update. To circumvent this, I propose to use a simple exponential smoothing on the squared residuals:

\[ \hat{\sigma}_{t}² = \alpha\hat{\sigma}_{t-1}² + (1-\alpha) \hat{\epsilon}_t² \] \[\hat{\epsilon}_t \equiv (x^\interca

This offers a nice compromise by giving more weight to newer observations and progressively ‘forgetting’ about the initial ones. One might propose to use the online variance algorithms also called the Welford’s online algorithm[8] which computes the variance online. The problem with such approach — it assumes that the residuals of data points already observed remain static over time. However, as the model continuously improves as it ingests feedback, if we were to recompute all the residuals they would on an average drop.

Algorithm 3 summarises the implementation:

Description of the Recursive least square algorithm. Updating the Sum of Square matrix, and means.

Conclusion

I presented here the application of the recursive least squares to linear contextual bandits. In my opinion, this is a simple to implement and scalable algorithm. Its main advantage is the direct use of the inverse of sum of a square matrix, which makes it computationally efficient. It also enables us to explore interesting future use cases:

We can use this model for multi-objective optimisation which will have a non-Bernoulli reward (unit interval or monetary values).
We will be able to solve very complex tasks with a lot of arms and interactions while reducing the complexity by updating only relevant regions of the feature space when we receive feedback. In the same spirit as [4], we can have an ever-changing model in which we can add and remove arms.

I hope this blog was a useful entry and can be applied as a cookbook recipe. Stay tuned for future extensions, simulations and use cases!

References

[1] Agrawal, S. and Goyal, N., 2013, February. Thompson sampling for contextual bandits with linear payoffs. In International Conference on Machine Learning (pp. 127–135).

[2] Graepel, T., Candela, J.Q., Borchert, T. and Herbrich, R., 2010. Web-scale bayesian click-through rate prediction for sponsored search advertising in Microsoft’s bing search engine. Omnipress.

[3] Chapelle, O., Manavoglu, E. and Rosales, R., 2014. Simple and scalable response prediction for display advertising. ACM Transactions on Intelligent Systems and Technology (TIST), 5(4), pp.1–34.

[4] Li, L., Chu, W., Langford, J. and Schapire, R.E., 2010, April. A contextual-bandit approach to personalized news article recommendation. In Proceedings of the 19th international conference on World wide web (pp. 661–670).

[5] Yin, Harold J. Kushner, G. George (2003). Stochastic approximation and recursive algorithms and applications (Second ed.). New York: Springer. pp. 8–12.

[6] Wikipedia: Online Machine Learning, section: Recursive Least Square. (I used the notation from this article.)

[7] Zhou, L., 2015. A survey on contextual multi-armed bandits. arXiv preprint arXiv:1508.03326.

[8] Wikipedia: Welford’s Online Algorithm