An introduction to Reinforcement Learning

Sara Ghibaudo
Betacom
Published in
9 min readSep 13, 2021
Photo by Victor on Unsplash

Introduction

Welcome to the first article of a series covering Reinforcement Learning and its applications. Today, we will introduce this Artificial Intelligence technique used to learn through experience and we will show a simple application.

How does it work?

The idea behind Reinforcement Learning (RL) concerns how humans learn from experience. The interaction with the environment, objects and people allows us to improve our knowledge about the external world, and this technique tries to replicate our behaviour. In particular, we collect experience learning the consequences of our actions and we understand the best way to complete a task by trying different pathways, called episodes. In the same way, RL aims to find a sequence of actions that leads to the highest possible reward. An important aspect that arises from Reinforcement Learning is the trade-off between exploration and exploitation. Indeed, we want to choose actions that lead to a large profit during a single episode and exploit them. But, at the same time, we would like to explore the other alternatives to check if we can get a higher reward.

Let’s see the functioning of this method that aims to maximize the final return, defined as the sum of the rewards of the episode’s selected actions.

Reinforcement Learning’s problems can be described as the interaction between an agent, the decision-maker, and the environment. For example, you can think about the relationship between you (the agent) and your city (the environment) where you try to discover beautiful places and avoid the bad ones. Notice that, the environment receives as input the state s of the agent and the chosen action a, and it returns a’s reward r and a new state s’. In real applications, we do not know anything about the outside world, and in particular, we cannot know the state reward which is important to understand if an action leads to something positive or negative. For this reason, the decision-maker selects the move that maximizes the expected future reward, called the state-value function. Moreover, the action is chosen according to a probability distribution 𝛑 called policy. So, if we denote the current state s, the state-value function is defined as v(s).

The RL problem can be divided into three parts:

  • prediction: predict the state-value
  • improvement: find a better policy than the current one, so a policy that corresponds to a higher state-value
  • control: find the best policy by repeating the previous two steps until reaching the global maximum

At this point, it is important to introduce the Markov Decision Process (MDP) that is defined by a set of states, actions, rewards, and a state transition probability. It models decision-making where the results are partly random and partly under the control of the modeler as in RL. At each time t, the process is in state s and the agent can choose one of the available actions a from s. At time t+1, the process moves in the next state s’ according to the state transition probability.

The basic block of the Markov Decision Process

This image shows a basic block of the Markov Decision Process. Starting from state u, we select action a. Of course, there are situations where we can choose between multiple moves each of which leads to one or more successive states. In this case, we can arrive in v or w with probability 0.7 and 0.3 respectively and rewards 3 and -1.

This article focuses on episodic MDP, therefore, processes where there exists a terminal state reachable from every node.

Let’s see the details of the three steps of RL. At time t the agent can choose between different actions according to the conditional probability 𝛑(a|s) = P(Aₜ =a | Sₜ=s). In this formula, 𝛑 can be deterministic (it selects an action with probability one) or stochastics, Aₜ and Sₜ denote the action and the state at time t respectively. In the prediction step we define the total return, also called gain, of an episode ending at time p, as the random variable Gₜ = Rₜ₊₁+…+Rₚ. It is used to compute the state-value function v(s) = E(Gₜ | Sₜ=s) that represents how “good” is for an agent to be in the state s at time t. Similarly, we can define the action-value function q(s,a) as q(s,a) = r + v(s’) where r is the reward of the action a and s’ is the arrival state. In the control step, instead, we are interested in the optimal state-value function v*(s) = max v(s) and in the optimal action-value function q*(s,a) = max q(s,a). In the previous formulas, the maximum is computed among all the possible policies 𝛑, so, between all the feasible paths. Once we define these two quantities, it is possible to find the optimal policy as the one achieving the maximum state-value or action-value function and that specifies the best actions that the agent can take.

In real applications, we do not know the underlying model or it has too many possible paths to be used, so, to apply RL techniques, we sample experience. In the prediction step, we should compute the mean of the total return but, as already stated, we do not know the model. So, we can just approximate it using Monte Carlo (MC) methods and sampling experience following the policy 𝛑. MC learns from a complete episode so we have to wait until we reach the terminal node to compute the estimated value for the current state V(s). However, it is not a method that we can replicate in real life since it requires knowing what will happen in the future and we obviously have to decide on the present. In the following, we show the pseudocode of the algorithm called first visit MC prediction, where first visit indicates the first occurrence of a state in the episode. In this example, we suppose that the episode ends at time T, S₀ is the initial state and Aₜ ∼ 𝛑(. | Sₜ ). We define the gain G incrementally as the sum of the rewards and 𝛾 is a constant smaller than 1 that defines the discount factor used to assign smaller weight to the rewards far in time (for simplicity we will consider 𝛾=1). The if statement is used to check if the state is already visited or not. In that case, we insert the just computed quantity into the list returns(s) and compute V(s) as the empirical average of its elements. Thanks to the law of large numbers, a higher number of episodes corresponds to a more precise estimation of the mean.

In the control step, we want to find the optimal policy 𝛑. Intuitively we can proceed greedily, so we select, from state s, the action a such that a = argmax Q(s,a). In this formula, Q(s,a)=R(s,a) +V(s’) is the estimation of the action-value function and R(s,a) is a’s reward. However, this is not the best choice, indeed, as we stated at the beginning of the article, we have to take into account the trade-off between exploration and exploitation. So, we use a random policy to explore new states mixed with the greedy one used to exploit the most promising nodes. This new policy is called 𝜀-greedy and it selects actions employing the random policy with probability 𝜀 and the greedy one otherwise.

At this point, we can look at a first simple example to understand how to apply this method.

How to exit from a maze

As you can see from the title of this section, one of the simplest applications of Reinforcement Learning concerns the exit from a maze. Let’s look at the image on the left. This maze has a green cell that is the starting point, the blue one represents the exit, and the red ones are walls where the agent cannot go. The goal of the agent is to reach the blue one avoiding the walls in the smallest amount of time.

The decision maker’s position represents its state and the actions are defined by the direction that he can take: North, South, East and West. However, in some positions, he cannot choose between all the four options but only between a subset of them. For example, in the green cell, the agent can choose only between East and South. Moreover, when he hits a wall he remains in the same location as before because the environment returns the current state as the new one and a negative reward.

In the beginning, the agent has no information about the maze, the location of the exit and the walls. So, to explore the unknown environment, he chooses random actions and, after a while, he understands the position of some walls and he learns to avoid them. Indeed, as we said, the environment returns a reward for each of the agent’s actions and it is negative if the action leads him in a red cell. Let’s suppose that the reward is -1 if he runs into a brick wall, -0.5 if he goes in a white cell, and +10 when he arrives in the blue one. Since he wants to maximize the total final return, he will choose actions that avoid walls and that lead him to the end in the shortest amount of time.

Maze with two solutions: the yellow path, which is the best one, and the purple one. The arrows in the cells define the actions that the agent takes in each position.

Let’s start the simulations. In the first run, the agent chooses randomly his actions until he reaches the blue cell. But, in the next episodes, he will selects moves using the knowledge stored in the past experience. Remember that each cell s has a corresponding state-value V(s) computed using the previous pseudocode and each pair (s,a) has a corresponding action-value Q(s,a). Now, suppose that we are running the tenth episode. At this point, as already said, the agent chooses the action by taking in mind the trade-off between exploration and exploitation. So, sometimes he will select action a that maximizes the action-value, and sometimes he chooses a random move. In this way, he can select, between all the different paths that lead to the blue cell, the best one that is the one highlighted in yellow. The arrows in the cells in the image define the actions that the agent takes in each position. So in this case, the return will be -0.5*7+10=6.5. Instead, if he follows the purple path it will be -0.5*11–1*2+10=2.5. When a cell has two arrows, it means that the agent first chooses the action that leads him into a wall and he gets a reward of -1. Then, he remains in the same state since he cannot go in a red one, and so he chooses another action from the same position.

Once the agent exits from the maze, the episode ends and so the state-value and action-value of the visited positions are updated. Let’s suppose that we arrive in the blue cell following the purple path and that we have to update the state-value in cell A. First of all, we compute the value of the gain G for this episode up to A: G₆¹⁰=R₇+R₈+…+R₁₄=10–0.5*7=6.5. Here G₆¹⁰ represents the gain of the tenth episode for t=6, which corresponds to cell A, and Rₜ is the reward of the t-th action. Now, we add it to the list returns(A)=[G₁,… ,G₁₀ ] and we update the state-value of A: V(A)=(G₁+…+G₁₀)/10.

We can also update the action-value Q(A, a)=R(A,a)+V(B) where R(A,a) is a’s reward, so R(A,a)=R₆, and V(B) is the state-value of cell B that we compute in the same way of V(A). When we update these quantities it is like we are storing our experience so that, in the next iterations, we will imagine the position of some walls and we will avoid them.

After many episodes, we are finally able to reach the blue cell following the yellow path that avoids all the walls and takes the smallest amount of steps to get out of the maze.

Conclusion

In this first article on Reinforcement Learning, we have understood its main theoretical concept and we have seen a very simple example where we learn how the mathematical instruments are used. In the next articles, we will look at more interesting applications related to real-life problems.

See you next time!

--

--