Basics of Reinforcement Learning

Published in

KoçDigital

9 min readOct 28, 2022

A “Real” Introduction

Reinforcement learning is the third (and the most sophisticated) wheel of machine learning after supervised and unsupervised learning. It has a wide variety of applications in autonomous driving, robotics, dynamic pricing, recommender systems, and gaming. Here, we deal with sequential decision-making problems that an agent tries to solve by trying different decisions, like the “trial and error” method. Unlike supervised learning methods, we do not have labeled data in reinforcement learning. Then, how do we evaluate decisions? The logic behind decision-making depends on loops of trying a decision (i.e., taking an action) and observing the output (i.e., getting a reward) at each step.

In a dynamic environment, loop over:

1. Observe the current state of the environment O

2. Take an action A (which would change the state)

3. Get a reward R

4. Observe the next state of the environment O

The agent’s primary goal is to maximize the cumulative reward, called return.

Let us be more specific and consider an episode (i.e., a finite sequence of observations, actions, and rewards):

With this notation, the agent’s goal is to maximize

by choosing the best action each time. Notice that taking the greedy action, i.e., the action which gives the maximum reward at each step may not lead us to the maximum cumulative reward in the end. Consider the following example: We have an agent who needs to go from Node-1 (N-1) to Node-5 (N-5). All possible actions and their corresponding rewards are shown in the diagram below.

If the agent takes greedy action at each step, it will follow the path N-1, N-2, and N-5. However, the path which gives the maximum total reward is N-1, N-2, N-3, N-4, and N-5. Then, we face the question: How does the agent decide that one action is better than the other in the long run so that it could choose the best sequence of decisions/actions? To answer this question, let’s now dive into some basic definitions and then formalize this problem.

State:

State is the information that the RL algorithm uses to determine the action. There are two states discussed in RL studies; the environment state and the agent state, which are not necessarily the same. When the agent directly observes the environment state, it is referred to as the case of full observability. Otherwise, it is referred to as partial observability. In a fully observable case, the below equality holds

at each time step t, where the first term denotes the environment state and the second term denotes the agent state. To highlight their differences, consider a maze as the environment. In the partially observable case, the agent observes only a part of the maze, just like in the second figure below. [3]

Information State (a.k.a. Markov State):

The equality of the conditional probabilities tells us that past states do not give any further information about the next state if we know the present state. We say, ‘The future is independent of the past given the present’. [1,2] Remember the graph example above, assuming that we are at Node-3. The next state, i.e., the next node we will move to, is independent of which nodes we have passed until Node-3.

Return & Value Function:

In summary, the agent moves from one state (observation) to another while collecting a reward in each move. If we somehow knew which states lead us to the maximum cumulative reward, we would decide accordingly about which state should be the next. The simplest way to assign values to states is by summing up all the rewards which come after that state. Suppose that we are at state s at time t. Then, it seems reasonable to say that the value of state s is

In practice, we use total discounted reward instead:

where γ in [0,1] is called the discount rate. [1] The discount rate is used for two reasons. First, immediate rewards may be more important to care for than future rewards in some problems. Second, we make sure that the sum will be finite this way. We can also write the return as:

A value function v(s), which evaluates the long-term value of the state s, is the expected return starting from state s:

We will go into details of the value function later.

The multi-armed bandit problem (also known as the k-armed bandit problem) is a Markov decision process with only one state. It provides you with the basic intuition about the following topics such as MDPs, action selection methods, and more. If you have not heard of it before, you may see this page for more information and details.

State Transition Matrix:

Markov Decision Process:

A Markov decision process (MDP) is a way to formalize almost all RL problems with a tuple (S, A, P, R, γ) where S denotes a set of states, A denotes a set of actions, R denotes immediate rewards, γ denotes the discount factor, and P denotes the dynamics of the MDP such that

for all s’, s ϵ S, r ϵ R, and a ϵ A. It is the probability of observing the next state s’ and getting the reward r, given that the agent is in state s and takes the action a.

★ It is an environment in which all states are Markov.[2]

★ The environment is fully observable. (There is also a partially observable Markov decision process (POMDP))[2]

Policy:

Policy is a map of the behaviors of an agent. There are two types of policies:

Deterministic policy: No uncertainty. The policy function π takes the state as input and gives the action as output, which we write as π(s) = a.

Stochastic policy: There is a probability distribution over actions given states.

Let us consider our graph example and an agent at N-2. Then we may have the following behaviors as examples;

Notice that for stochastic policy the policy function should satisfy: [1]

In reinforcement learning, we aim to find the optimal (best) policy that will lead to the maximum expected cumulative reward.

★ From the dynamics function, we can compute the state transition matrix as follows: [2]

Next, we will dive into the state-value and action-value functions. The state-value function measures the expected total reward by starting from the state s, whereas the action-value function measures the expected total reward by starting from the state s and taking the action a in that state.

State-Value Function & Action-Value Function & Bellman Equations:

We define the state-value function (i.e., the long-term value of the state s) and the action-value function (i.e., the long-term value of taking the action a in state s) under the policy π as:

It is very important to understand these two functions because they form the basis of the algorithms, particularly variants of DQN and Actor-Critic methods, which we will talk about later. From equation (1), both the state-value and the action-value functions can be decomposed into immediate reward and the discounted value of the successor state:

These equations are known as Bellman equations which serve as a touchstone for our way from value functions to optimal policy.

The properties of the expected value function give us a useful way to rewrite the Bellman equation as

for state-value function. Moreover, we can express the Bellman equation using matrices

where n denotes the number of states.[2] Now the problem has become a linear equations system, so if we know the state-transition matrix and rewards, we can easily find the state-value functions by solving it directly.

However, this approach is not applicable in general. We use direct solutions for problems only with a few number of states because of the computational complexity. For a large number of state spaces, we use iterative methods, e.g.,

Dynamic Programming
Monte Carlo
TD Learning

In many textbooks and other sources, you might come across the following versions of the Bellman equations for the state-value function and the action-value function for an agent following the policy π: [2]

(Do not worry if it seems complicated, its implementation is easier. :) )

Optimal Policy & Optimal Value-Function:

The optimal state-value function is defined by the maximum value function over all policies:[2]

In the same manner, the optimal action-value function is the maximum action-value function over all policies:[2]

It may not be clear now, but we have the optimal policy if we know the optimal value function. Let’s see how!

I’ve mentioned about the optimal policy though I didn’t give a proper definition up to now.

Consider the optimal policy function below to understand the last two points:[2]

It is clearly explained by the way we construct the optimal policy function above.

Bellman Equations for Optimality:

where

The Bellman equations for optimality are non-linear, hence we cannot solve them by linear algebraic methods. Instead, there are many iterative solutions, which are divided into two categories; model-based and model-free methods.

Our choice of methods depends on our information about the MDP we want to solve. For instance, model-based methods require that we know the dynamics of the environment. We solve questions in such cases via dynamic programming. When we do not know much about the environment, we rely on model-free methods instead to find the optimal policy. In these scenarios, the agent first collects samples/experiences by interacting with the environment and then learns from these experiences. In real-life RL problems, we often benefit from model-free methods such as Monte Carlo, Q-learning, and PPO. Hence, I will continue with these algorithms and their implementations in my next blog post.