Reinforcement Learning for Beginners: Q-Learning and SARSA

Dagnachew Azene
The Startup
Published in
5 min readAug 25, 2020

Reinforcement learning is a fast-moving field. Many companies are realizing the potential of RL. Recently, Google’s DeepMind success in training RL agent AlphaGo to defeat the world Go Champion is astounding.

The full documentary is available here:

But what is RL? RL is a branch of machine learning where the agent learns a behavior by trial and error. That means the agent interacts with its environment without any explicit supervision, the “desired” behavior is emphasized by a feedback signal called a reward. The agent is rewarded when taking a “good” action or it can be “punished” when it takes a “bad” action.

In RL terminology, observations are known as states. Hence, the agent learning path comprises a series of actions taken on states and getting rewards as feedback. At the early stages of learning, the agent doesn't know the best action to take in a specific state, after all that is the whole learning objective.

The agent objective is to maximize the sum of the rewards in a long-term. The maximization is long-term meaning that we are not only concerned with taking actions that yield the highest immediate reward but also, more generally, the agent is trying to learn the best strategy that gives best cumulative reward in a long run. Some of the rewards can be delayed. This objective is described as maximizing the expected return which is expressed as follows:

where R is the immediate reward and γ is known as discount factor.

When γ is closer to 0, the agent is near-sighted (gives more emphasis on the immediate reward). If the discount factor is closer to 1, the agent is more far-sighted.

The goal of RL algorithms is to estimate the expected return when the agent takes an action in a given state while following a policy. These are known as Q-values and estimate “how good” it is for the agent to take a given action in a a given state.

Q-learning is one of the most popular RL algorithms. QL allows the agent to learn the values of state-action pairs through continuous updates. As long as each state-action pair are visited and updated infinitely often, QL guarantee an optimal policy. The equation for updating the values of state-action pairs in QL is given as:

α is known as the learning rate

Note that, in QL, the agent observes the current state, Sₜ, takes an action, Aₜ, ,and observe reward Rₜ₊₁ and observes the next state Sₜ₊₁.While updating, QL considers the best possible action (the max operator) in the next state regardless of the action that will be taken by the current policy, Aₜ₊₁. Because of this rule, QL is known as an off-policy algorithm. Learning rate, α, determines how big a step the agent is taking while updating its Q-value estimates.

Another commonly applied algorithm is known as SARSA. In SARSA, the updating follows the following equation:

Note the difference between the QL and SARSA update rules: in SARSA, the value of the action chosen by the current policy in Sₜ₊₁ is considered. Hence, SARSA is known as an on-policy algorithm.

For more details on the procedures of both QL and SARSA algorithms, you can refer to “Reinforcement learning: An introduction by RS Sutton and AG Barto”.

Now, lets see an example of applying QL and SARSA in the popular cartpole problem of the openai gym python environment. Check the link below to learn more about the cartpole environment.

First, the cartpole environment has observations (states) that are continuous. You can check this using the following code:

But QL and SARSA are applicable in finite state-action spaces, hence the continuous observations about position, velocity, angle and angular velocity should be descretized (similar values are added in similar buckets). This can be done by the following code snippet:

Once, the state spaces are finite, we can write the algorithm for QL and SARSA training. The update rules of the algorithms, following the equations above, are shown in the following code sample:

These allow us to do the training according to the QL / SARSA procedure. Here, I show the sample code snippet for QL training.

Note that we are storing the rewards obtained (the agent receives +1 reward at each step as long as it is upright). The following plots show the cumulative reward obtained by QL and SARSA during a training of 5000 episodes.

cumulative reward by Q-learning
cumulative reward by SARSA

We can see that in both QL and SARSA, the agent is able to accumulate more rewards (learns a better strategy) as the training progresses. QL is shown to outperform SARSA (accumulate rewards faster).

Thanks for reading, the Jupyter notebook code is available here; https://github.com/Danaze/QL_SARSA_cartpole

--

--