Explore-then-commit
This is the first in the series of posts to a provide pedagogical exposition of algorithms arising in the context of multi-armed bandits. The presentation is heavily influenced by Bandit Algorithms book and Online Learning course. The aim, besides clarifying my own understanding, is to help identify some questions that go beyond the obvious.
Before getting into the specifics of explore-then-commit (ETC), a few points on the general setting where the algorithm is applied:
- Multi-armed bandits (MABs) are useful frameworks to study interaction between agent and environment, where the former gets to choose an action from a set of available ones for which the latter assigns a reward. As a simple example, a user (environment) is presented with some recommendations (action) on the home screen of Netflix (agent), which results in a certain user-engagement (reward to Netflix in terms of subscription renewal).
- MABs belong to the general class of Markov decision processes (MDPs) but differ from the full reinforcement learning (RL) setting in that unlike the latter, the actions taken by the agent do not cause the environment to change the set of available actions in the future i.e., the environment is state-less.
ETC is among the simplest algorithms one could imagine to navigate the exploration-exploitation tension. If we’ve committed to play for n time-steps, ETC suggests that we play each of the k arms a fixed number m (an algorithm parameter) times and then commit to the one that had the highest empirical mean reward in the exploration phase. The general prescription is to play the k arms in a round-robin fashion but I believe that’s not a strict requirement as long as we play each arm exactly m times.
Couple of points before getting into the analysis of the algorithm:
i) If the reward for each arm is deterministic, m = 1 should suffice. In other words, we need m > 1 only when the rewards have a finite variance.
ii) It’s reasonable to expect if the reward of an arm has higher variance, it should be pulled more #times. So the algorithm above assumes that all arms have the same variance (in addition, the agent knows this “fact” about the arms).
Generally, the bandit algorithms are characterized by asymptotic behavior of cumulative regret achieved. Regret, as the name, suggests is the cost of pulling the suboptimal arm. Formally, it’s the expected value of the difference in the reward a genie (i.e., someone that knows the true expected reward for each arm).
For the setting where the arms have mean rewards μ₁ > μ₂ ≥ … and are 1-sub-Gaussian (sub-Gaussian with standard deviation of 1), we have the following following result for cumulative regret:
where Δᵢ = μ₁ — μᵢ is sub-optimality gap of arm i (since genie, with the knowledge of μ, always plays arm1). The first term is the regret during the exploration phase and the second one during commit phase (one incurs a regret of Δᵢ per time-step if at the end of mk time-steps empirical mean of i-th arm exceeds that of 1st arm, which, happens with a probability corresponding to the exponential factor since empirical estimates of μᵢ and μ₁, based on m i.i.d. samples, are √(1/m) sub-Gaussian each and their difference is √(2/m) sub-Gaussian).
I’m not aware of a closed-form expression for the choice of m for arbitrary k, but for k = 2:
which can be obtained by minimizing the first expression w.r.t. m under the assumption n >> m.
Putting these together, it’s reasonable to expect that, asymptotically, the cumulative regret would vary as ln(n) and 1/Δ. We shall defer the discussion of how close these are to the best achievable bounds to a later post.
Now let us ask some what-if questions amounting to scenarios deviating from the standard setting above:
- How do we handle the case of unequal standard deviations — σ₁, σ₂, … a) when they’re known and b) when they’re unknown?
- How would we adapt ETC to the case where we have an exploration budget lower than mk?
- What if the exploration is relatively low-stake i.e., the exploration regret is, say, a factor C (>1) lower than that during exploitation?
- How about the opposite scenario — i.e., the regret is discounted (by a factor of γ(t)∈(0,1) a decreasing function of t)? Is ETC clearly the wrong/suboptimal strategy here?
- What if the environment is non-stationary only during the exploration phase with a known Δᵢ(t)? Is round-robin pulling of arms still the best way to explore?