Introduction to RL

Estimation and Exploration

Published in

The Startup

6 min readNov 8, 2020

Welcome to the first post in what will be a series of posts on Reinforcement Learning. I am quite interested in the field of AI Safety and I believe it is crucial to have a good understanding of RL, among other things, to tackle the issues in AI Safety. I am using the incredible work, “Reinforcement Learning: An Introduction” by Richard S. Sutton and Andrew G. Barto, as my main source of learning. To really ground my understanding, I will write programs and run experiments, where necessary, to complement the natural programs that run only on our minds.

The simplest version of k-armed bandit

Learning in Reinforcement Learning involves evaluative feedback that evaluates actions taken, as opposed to instructive feedback that instructs correct actions. Evaluative feedback depends on the action taken, whereas instructive feedback is independent of the action. Although evaluative feedback can be combined with instructive feedback for a more complex and perhaps useful learning process, we will only look at training using purely evaluative feedback for now.

Let us further simplify by assuming there’s only one situation to take an action in. This means we don’t have to worry about associating different actions with different situations because there’s really only one situation.

So, here’s a purely evaluative and non associative version of the k-armed bandit problem. There are k actions available and we repeatedly choose one of the k actions. Each of these actions is associated with a reward that is initially unknown. The objective is simply to maximize the expected total reward over time.

Let us program this environment to really ground our understanding.

I only post the gist of the code — void of unnecessary details, just enough to follow the discussion. If you’re looking to reproduce the results, check out the repository.

Ten Armed Bandit where each action has a unique fixed reward.

Action value methods

So, how do we choose the action — same action every step or maybe a random action each step? Since we need to maximize the total reward received over time, we need to figure out a strategy that can get us there.

There’s the idea of actual-value methods, where we assign a value for each action. This value is the agent’s estimated reward for the corresponding action. One natural way to compute the value is to start at zero and proceed to average the rewards received for each action. At each step the agent can use the action-values to choose the action. Naturally, the agent can choose the action with the highest action-value.

However, there’s a problem with this naïve greedy approach. Simply choosing the highest valued action at each step doesn’t explore the actions whose estimates are low, but in fact are better than the other actions. So, instead of being absolutely greedy, we can choose to disregard the action-values once in a while and randomly pick one of the actions. We choose to randomize the selection also when there are multiple actions with the highest value. Asymptotically speaking, the action values computed using this method converge to the theoretical optimum. These agent designs are called Greedy and Epsilon-Greedy respectively, where epsilon refers to the probability of choosing a random action at each step.

Below is the definition of a class for epsilon greedy agent and we will use instances of it to run some experiments.

Epsilon Greedy Agent.

Let us see how different epsilon values affect the agent’s behavior and the reward received. Varying epsilon is the only difference between the agents. For each epsilon value, we run the experiment 300 times and for each run deals with a new instance of the environment and agent. Each run is a 1000 steps long and we track the average reward per step for each step and finally we take the average over the 300 runs. The results are as shown below:

Let’s look at the extreme epsilon values first. When epsilon is 0, the agent is absolutely greedy and doesn’t really deal with the inaccurate action-value estimates. It will use these estimates for the complete run which results in a low cumulative reward over time. On the other hand, when epsilon is close to 1, like 0.8, it does choose to explore most of the time improving the action-value estimates but fails to capitalize on the improvement as it continues to explore and not exploit the improving action-values. Too much exploitation and too much exploration have both resulted in low total reward.

When epsilon is low and non zero, like 0.1–0.3, the agent is much better off as it improves action-value estimates 10–30% of the time and exploits the improvements 70–90% of the time. This gives rise to a high average reward per step over time, especially when the number of steps are much higher than the number of actions.

Can you see why epsilon of 0.1 eventually beats epsilon of 0.2? When epsilon is 0.2, the agent tries to improve the estimates 20% of the time, so it improves faster, but when the estimates are sufficiently accurate it still explores 20% of the time which results in sub-optimal reward and barely improved estimates. On the other hand, when you’re exploring only 10% of the time, you improve the estimates slowly, but once they are sufficiently accurate you’re exploiting them 90% of the time. So, only 10% of the time is wasted in improving already accurate estimates while obtaining sub-optimal rewards.

A twist in the problem

Now, we will modify the problem a bit. Let us change the fixed scalar reward associated with each action to a gaussian distribution. So, now for each action a reward will be sampled from the corresponding gaussian distribution. Look below to see how reward distribution is described and how reward is sampled for each action.

Ten Armed Bandit where each action has a reward sampled from a unique gaussian distribution.

Below is the plot showing the average reward per step of different epsilon greedy agents over time.

A plot of average reward per step over a run on 10 armed bandit with gaussian reward for different epsilon greedy agents.

The results are pretty similar to the results in the case of a fixed reward. Small and sufficient epsilon eventually reaches accurate action values and begins to exploit these accurate estimates 90% of the time. And too much exploration or no exploration results in poor rewards for the reasons we already discussed above.

Can we explore better?

At any give step, when the agent chooses to explore, the agent is picking up an action at random. Surely, we can do better than random if we can figure out which actions better improve the action-value estimates. In fact that’s the whole idea behind exploring — to offset the inaccuracies of the action-value estimates. So if we are going to explore, it better be worth it!

We experiment on the idea of UCB (stands for Upper Confidence Bound). The name doesn’t justify the simplicity of the idea. At any given time step, we consider the three following factors to evaluate an action’s worth: the action-value estimate, the number of times the action has been taken and the total number of actions taken so far.

We put these factors together in the following way and UCB policy simply chooses the action that maximizes the below expression for a given action. In the expression, c is confidence level which dictates the degree of exploration and it needs to be adjusted just like epsilon.

UCB policy simply finds the action a that maximizes the above expression.

And below is the definition of a class for a UCB agent.

UCB Agent.

The results of runs of UCB agents with varying confidence levels are shown below. At a confidence level of 1, the UCB agent performs just as well as an Epsilon-Greedy agent with epsilon of 0.1. Similar to the case of Epsilon-Greedy agent where we adjust epsilon, the confidence level parameter needs to be adjusted accordingly in different environments.

Although UCB agents fare better theoretically, they might not perform as well as Epsilon-Greedy agents in practice. Also, when the environment has non-stationary reward distributions or large state and action spaces, UCB faces performance issues. UCB simply does not fare well in general reinforcement-learning settings that are closer to the real world scenarios.