Introduction, MDP — Reinforcement Learning #1

6 min readJul 19, 2019

Introduction to Reinforcement Learning

3 types of machine learning

Supervised learning (semi)
Unsupervised learning
Reinforcement learning

1. Definition

Reinforcement learning is a one sort of Machine Learning that an agent learn how to interact with an environment so as to maximize some notion of cumulative reward.

2. Background concept

(a) Supervised learning: “learn by example”

Here’s some examples of good or bad, try to learn patterns of each

(b) Reinforcement learning: “learn by experience”

Here’s a world(environment), try to learn patterns by exploring(interacting with) it.

3. Basic Mathematics

In RL problems, we describe a whole world as Markov Decision Processes (MDP). An agent is supposed to decide to choose an optimal action based on its a current state for the best reward. How we can find optimal policies based on experiences of interacting between agent and environment?

Markov decision processes (MDP)

MDP is defined as the collection of the following:

Agent : The learner or decision maker

Environment: The thing it interacts with, comprising everything outside the agent

Agent-environment interaction in MDP world ( Figure 3.1 from [1] )

States: S
Actions: A
Transition model: T(s,a,s’) ~ P(s’|s,a)
Rewards: R(s), R(s,a), R(s,a,s’)
Policy: π(a|s) ~ P(Aₜ=a |Sₜ=s)
Discount factor, γ ∈ [0,1]
Sequence or trajectory : S₀ A₀ R₁, S₁, A₁, R₂, …..
Dynamics : p(s’,r |s, a) = Pr{Sₜ=s’, Rₜ=r | Sₜ₋₁ =s, Aₜ=a}
In a Markov decision process, the probabilities given by p completely characterize the environment’s dynamics. That is, the probability of each possible value for Sₜ and Rₜ depends only on the immediately preceding state and action, Sₜ₋₁ and Aₜ₋₁, and, given them, not at all on earlier states and actions.

In RL, the goal of the agent is formalized in terms of the reward, passing from the environment to the agent. (Goal is little different from reward)

The agent’s goal is to maximize the total amount of reward it receives. (the maximization of the expected value of the cumulative sum of rewards)

In general, we seek to maximize the expected return, where the return, denoted Gₜ, is defined as some specific function of the reward sequence as

This approach makes sense in applications in which there is a natural notion of final time step , that is, when the agent–environment interaction breaks naturally into sub-sequences, which we call episodes, such as plays of a game, trips through a maze, or any sort of repeated interaction.

Each episode ends in a special state called the terminal state, followed by a reset to a standard starting state or to a sample from a standard distribution of starting states.

There exists some cases don’t have terminal state → continuing tasks

Discounting

This concept is required since we would like to give more weights on near future reward when we calculating cumulative sum. Thus, the agent tries to select actions so that the sum of the discounted rewards it receives over the future is maximized.

Policies and Value functions

Value functions ( v(s), q(s,a) ): The function for estimating how good it is for the agent to be in a given state ( or state-action pair)

policy is a mapping from states to probabilities of selecting each possible action. If the agent is following policy π at time t, then π(a|s) is the probability that Aₜ = a when Sₜ = s

**value function of a state s under a policy π**

**Action value function of a state s under a policy π**

For any policy π and any state s, the following consistency condition holds between the value of s and the value of its possible successor states (Bellman Equation):

Value function Example ( Grid world )

Figure 3.2 grid world example in textbook

The cells of the grid correspond to the states of the environment
Four actions are possible : north, south, east, and west (same probability)
Reward:
— Actions that would take the agent off the grid = -1 (state won’t changed) - — Other actions= 0
— From State A, all four actions yield a reward of +10, taking the agent to A’
— From State B, all four actions yield a reward of +5, taking the agent to B’
Discount factor: 0.9

Discussion

State A is the best state to be in under this policy, but its expected return is less than 10, its immediate reward, because from A the agent is taken to A’, from which it is likely to run into the edge of the grid. State B, on the other hand, is valued more than 5, its immediate reward, because from B the agent is taken to B’ , which has a positive value. From B’, the expected penalty (negative reward) for possibly running into an edge is more than compensated for by the expected gain for possibly stumbling onto A or B

Optimal Policies and Value Functions

Solving a RL task → finding an optimal policy

A policy is π is defined to be better than or equal to a policy π’ if its expected return is greater than or equal to that of π’ for all states. In other words,

π ≥ π’ if and only if Vπ(s) ≥Vπ’(s) for all s ∈ S

An Optimal Policy: There is always at least one policy that is better than or equal to all other policy.

Optimal value (action-value) function

Optimality and Approximation

Even if we defined optimal value functions and optimal policies above for toy problem (grid world example), in practice, it rarely happens. Optimal policies can be generated only with extreme computational cost in real life. The basic solution for this is to approximate those things.

Then, Memory availability is an important constraint in RL problem. In tasks with small, finite state sets, we can use array or tables (state or state-action pair). For the case of more states, functions must be approximated using some sort of parameterized function representation.

How we can effectively approximate these values? Some states and action is good for learning, other states and action pair are bad in terms of getting reward. The online nature of reinforcement learning makes it possible to approximate optimal policies in ways that put more effort into learning to make good decisions for frequently encountered states, at the expense of less effort for infrequently encountered states. ( exploitation vs exploring, ϵ-greedy algorithm)

Reference

[1] Sutton, Richard S., and Andrew G. Barto. Reinforcement learning: An introduction. MIT press, 2018.

[2] https://youtu.be/zR11FLZ-O9M