Ch 12:Reinforcement learning Complete Guide #towardsAGI

Yo what’s up people?? How it it going?

Great that’s good to hear.

So far I talked about Supervised learning, In this chapter I am focusing on Reinforcement learning, another type of machine learning and it is a complete different approach to make the system intelligent.

Let’s get started!

Why??? Why would I have to learn this??? Why RL????

I am a big fan simon sinek so I always start with “why”

You probably know, in supervised learning, we give the data and it’s labels to the algorithm ( mapping between X and Y ).

in unsupervised, we just give the data and let the system find the patterns in the data and give the different classes/outputs/odd man outs.

let’s take a simple baby example

You want your baby to recognize animals so what do you do?

you show animals images to the baby and tell the baby , this is dog, this is cat, that is donkey and etc… many times.

and finally baby learns it and starts recognizing animals from pictures.

That’s cool! we know that it’s fully supervised and its under our supervision.

okay! second task is you want your baby to walk? what do you do??

Initially, the baby can’t walk so you support the baby as much as you can, but you can’t be there all the time so the baby has to learn it by itself.

Finally the baby learns to walk and come closes to you #it steals your heart.

The first task is supervised learning and The second task is Reinforcement learning.

You can’t teach the baby fully supervised, cause we know many things fall in the process.

like when the baby tries to attempt to take a step forward, the baby might fall or roll in the slippery and the environment is not much predictable.

so the baby has to learn it from experience (trial and fail).

So in the same way , when we want a system to accomplish a task by understanding the task by itself, then the system has to apply the “trail and error” rule, Where it is really difficult to use supervised or unsupervised learning methods to accomplish the tasks, there we need new methods like RL.

So that’s why we need Reinforcement learning. #Period.

Ok. How does it work???

Okay we already know where we can apply RL so let’s think of a problem first ( let’s just take earlier baby walking example only)

In RL, Here are few steps that should be defined before applying RL

  1. We need to define a clear “Goal” and an “Environment”
  2. We need to define a set of possible “Actions” an agent can take
  3. We can define a system called “Rewards” if goal is achieved, +ve reward else -ve reward.

Note : we assume that we know the environment, its state, action, and reward and all are finite for a while (it is possible that environment is completely unknown but still RL can solve it) #willdiscuss

The core Idea of RL is

“An agent takes an action from the present state to enter into a new state,based on the action we give an instant reward (+ve or -ve), The agent has to get the maximum cumulative reward and complete the goal”

The rewards system should be set by us, it depends on the problem. In some problems we only give the reward at the end (either +ve reward for accomplishment or -ve reward for failure).

In some cases, we give a little/constant instant reward for every action the agent takes and at the end, the agent will be given a huge +reward for accomplishment or -ve reward for failure. The agent has to get the maximum cumulative reward.

The reward feedback does not tell the agent directly which action to take. Rather, it indicates how valuable some sequences of states and actions are

A typical RL follows this


Things we should define before the RL solution for the problem we took are

  1. The goal of the baby is to take 10 steps without falling and the environment is a floor with tiles.
  2. The baby can take steps(Actions) forward right leg step, forward left leg step, fall and get up.
  3. If the baby completes the task then the baby gets a candy(+ve reward)

Note: Here there is no negative reward for falling and no instant reward for the steps(We may add it later) so the goal is to get the final reward .

So the baby follows above approach to complete the goal to get the candy.

Initial position of the baby is get up, then the baby tries to take a forward step (right or left), if the step is taken successfully then the baby moves other leg to take an another step and so on..

if anywhere the baby is fallen , then the baby has to count it from step 1. It’s like a game.

Initially the baby falls many times in the process of trailing and failing, but the baby tries a lot to walk so eventually the baby manages to walk or take 10 steps without falling.

I know this is a simple example for sake of explanation and

That’s how Reinforcement learning works! #Period.

So you mean to say “Systems learn to do tasks by itself” right???

Cool! Now What is Reinforcement learning????

Reinforcement learning is the process of making the systems learn from the environment by itself.

There is a clear goal that the system needs to accomplish so the system has to complete the goal by applying the trail and error approach.

Supervised & Unsupervised → Learning from the data.

Reinforcement learning → learning from the environment.

So far we have the knowledge of RL ( Why, How, What ) theoretically only but what is the practical way or solution to make this work???

Markov Decision Process(MDP).

You can think of MDP as a framework for RL or a design pattern or an approach/process/technique for solving RL problems.

if MDP is solved then RL problem is solved.

Before I dive into MDP, there is something you should understand which is

Markov Property

“ The future is independent of the past given present.”

Which means Knowing the current state of the agent helps the agent decide what action to take ( the past history can be thrown ).

The baby example: let’s say the baby took 4 steps then the baby fell to the ground so the current state of the baby is “Fall” .

so based on “Fall” the next best action is “get up” and it does not matter if the baby completed 4 steps.

The probability of taking the “get up” action is much higher than taking the “step” action based on the current state which is “Fall”.

#Wikipedia → In probability theory and statistics, A stochastic process has the Markov property if the conditional probability distribution of future states of the process depends only upon the present state.

Markov decision process is a tuple of 5 components →(S,A,P,R, γ)

S → A set of states in the environment

A → A finite set of actions an agent can take in the environment

P → Transition Probability Matrix (for all states)

R →Reward component for each state

γ → Discount factor ( γ ∈ [0,1] )

And it follows the Markov property.

let’s relate to our problem (I made it simple to explain)

S = {“fallen”, “Stand”, “Rightlegstep”, “leftlegstep”} #think of body position

A ={“ForwardstepRight, “ForwardstepLeft”, “Fall”, “Getup”}

P = State Transition probability Matrix

Just sample probability distribution.

to illustrate, if current state is S1 (Fallen) , the probability of get up(S2) is high 0.7. if current state is S3 then the probability of the baby falling is same as making it to the next step(S4).

We can also think like “ from S1 if we take an action A1 we get to S2 (lets say 0.5 ) and if we take an action A2 we get to S2 (lets say 0.3 ) and so on.. for all states S1,S2,S3 and S4 “ # in above picture I did not specify actions but those are also included #

We assume for now that the transition probability is given explicitly, although in many practical circumstances we might need to estimate this from examples (e.g. supervised learning).

Let’s add some rewards. for each step the agent takes gets a candy , for falling the agent looses all the candy the agent might have, for getting up after falling 2 candies, and for finishing the task(10 continue steps) gets 10 candies.

The circles represent the states and the values in the states represent the rewards in candies.

Assume there are no self loops and arrow indicates one side otherwise two sides(can be a loop) except stop(it is not a state, it is the end after 10 continues steps).

I apologize if the diagram/concept doesn’t make sense ( i could think of this only at the time of writing ,hope it gets clear as we go through).

and last but not least γ → Discount factor

so the discount factor tells how much discount is applicable for the future rewards. It must be between 0 and 1.

γ is so and so percentage that gets multiplied with the reward for every step the agent takes so we don’t get full reward,we get discounted reward.

let’s say at time step(t=1) I got 10 candies but if γ → 0.9 then 0.9 * R → 0.9*10 =9 candies #will make sense as we go through.

The larger the gamma, the smaller the discount (so We get decent future rewards). This means the agent focuses more about the long term reward.

On the other hand, the smaller the gamma, the bigger the discount(so We get terrible future rewards). This means our agent focuses more about the short term reward.

Let’s just keep it aside for our problem so set γ → 1 (so we look for long term reward which is 10 candies).

Why γ → Discount factor???

well, there are few reasons as far as I know

  1. It is for Math convenient(uncertainty of future rewards)
  2. to avoid infinite returns in cyclic Markov process

For this problem it is set as 1, the more problems you solve the more idea you get about the γ Discount factor.#let’s ignore for a while.

So RL Goal using MDP is to get the maximum expected cumulative reward (Gt)

we set γ as 1. if γ value is present then the actual equation is

so far We just understood the components now let’s focus on the process and how it works / solves the problem.

Before we understand the process letmme define some concepts.

Policy (π)

You can think of policy as a strategy or agent’s behavior function

It maps from state to action. in other words, it specifies what action to take at each state.

  • Deterministic policy → a = π(s)
  • Stochastic policy → π(a|s) = P(At=a,St=s) → based on some probabilities

where π(a|s) is the probability of taking action a in state s under policy π


The model stands for the simulation of the dynamics of the environment

It predicts what the environment will give next

Transition probability → predicts the next state

Reward model → predicts the next reward

what is the next state based on current state s and action a and what is the expected reward based on the current state s and action a.

if we use policy then

Value function

The value function estimates the expected future reward by following the policy π for each state.

v(s) = E [Gt | St = s]

There are two types of value functions

Taken from David silver notes.

State value function determines how good it is for the agent to be in a given state while Action value function determines how good it is to perform a given action in a given state. ( How good → Expected future rewards).


we know now all the pieces of MDP and required vocabulary for solving MDP. so all we need to do now is try connecting the dots.

Connecting the dots……

so a policy chooses actions from states so the best policy or optimal policy chooses the best/right actions from states.

so choosing the right actions from states is the main aim of RL so finding the optimal policy is the solution.

so What’s the optimal policy???

The optimal policy is the policy which maximizes the expected reward for each state.

A policy π is defined to be better than or equal to a policy π 0 if its expected reward is greater than or equal to that of π 0 for all states.

In other words, π ≥ π 0 if and only if vπ(s) ≥ vπ0(s) for all s ∈ S.

There is always at least one policy that is better than or equal to all other policies. This is an optimal policy.

Note: We can have more than one optimal policy for a problem so all can be denoted π∗

π∗ = the policy of maximum expected reward for each state.

so we know that expected reward means value function( state and action ) so

State value function
Action value function.

Okay! lets go way back, to find the optimal policy we need to have the value function, to calculate value function , we need to have Gt(Total cumulative expected reward), to calculate the Gt we need to have R ( rewards for every action) Rt+1,Rt+2,Rt+3+…..

so how can we predict future rewards as we don’t know what action to take every timestep ??? and

how long can go I into the future ???

before I explain this, we really need to thank this mathematician Richard E. Bellman

Value function = immediate reward + Value function of next state (ignore gamma)

This is called Bellman equation.

we have two value functions so the bellman equation for those two are (in david silver classes , this has been explained way better).

We got the value functions that’s cool. now what??

Our job is to find the optimal values for V and Q

The optimal state-value function v*(s) is the maximum value
function over all policies.

The optimal action-value function q*(s, a) is the maximum
action-value function over all policies

These equations are called Bellman Optimality Equations.

so now we can find the optimal policy as we have optimal value functions

I know your head is spinning, its normal in the beginning( the more effort you put the easier it will get).

Thanks Steve!

Okay! if we think these values are parameters ( just like in supervised ) then we need to learn these parameters, in other words we need to do training.

No…. we are not. it is not that of a training, we don’t need a GPU.

There are few ways for this learning

  1. Dynamic programming methods
  2. Monte carlo learning method
  3. Temporal-Difference learning

In this story I cover DP only which is a Model based Algorithm

so in next stories I will explain other two methods.

Model based Algorithms

If we remember we have transition probability P ( it tells us how likely to enter a specific state given current state and action)

this transition probability has to be given explicitly in Model based algorithms these work well for finite states and actions, however these are impractical as the state space and action space grows.

Model free Algorithms

model-free algorithms rely on trial-and-error to update its knowledge. As a result, it does not require space to store all the combination of states and actions. Q-Learning is an example of model-free learning algorithm.

All the algorithms discussed in the next stories fall into this category.

Dynamic programming

Our job is to find the optimal policy right!, one can try out all the policies and pick the best one #sounds simple right?? but

The no of possible policies is equal to the number of actions to the power of the number of states.

for many RL problems , the number states would be way more so it’s a problem. and this is one of main challenges in RL called

“curse of dimensionality” by bellman.

DP can be used to compute the value functions we can easily obtain optimal policies once we have found the optimal value functions, v∗ or q∗, which satisfy the Bellman optimality equations.

A efficient method is to incrementally find the value function for a specific policy and then use the policy which maximizes this value function for the next round. Called “Policy iteration”

The policy iteration algorithm

This algorithm has 3 steps

  1. Policy evaluation (based on policy π (initially random) ,calculate value function)
  2. Policy improvement (improve the policy based on the value function)
  3. Repeat 1 and 2 until π converges
David silver’s

This process of policy iteration always converges to π∗

You may find different ways of writing this algorithm ( notation may change) but just understand “all components are interconnected.”

If you know V, you can get Q and policy and if you know Q, you can get V and policy.

Another way of explaining the same algorithm

Taken from standford RL pdf

Don’t worry if you don’t understand, when we code we will understand well.

Policy iteration = { Policy evaluation + Policy improvement }

This is just pseudo code only to get some idea but when we code these algorithms, a lot of details and computation will be added.

Here is the python script for the algorithm.

Policy evaluation and improvement.

You can find the full code at my GitHub Deep math machine

The code repo takes a simple grid problem and solves it using this algorithm.

so another way of finding optimal policy is

The value iteration algorithm

This algorithm has two steps

  1. it focuses on finding the optimal state value function, once the optimal state value function is found then
  2. it extracts the optimal policy from it.

so how does it find the optimal value function??

We can easily derive the optimal policy from the optimal state value function

Value iteration = optimal value function + policy extraction

I said we need to repeat these blocks right? so there might be a question like how long or how many iterations??

This depends on the problem so we will write a small piece of code ( checks ) to see if the algorithm (policy ) converged.

From Stackoverflow

Here is the python script for the algorithm

You can find the full code at my GitHub Deep math machine

The code repo takes a simple grid problem and solves it using this algorithm.


→ Reinforcement learning is all about learning from the environment through interactions.

→ Finding the optimal policy / optimal value functions is the key for solving reinforcement learning problems.

→Dynamic programming methods are used to find optimal policy/optimal value functions using the bellman optimality equations.

→Dynamic programming methods are model based methods, require the complete knowledge of environment. such as transition probabilities and rewards.

well, that’s pretty much it solving RL using these algorithms.

There is a lot more information to talk (we can write a book on RL) so my intention is to give you a start! we will slowly cover all if possible.

A lot of details and other methods will be explained in next stories.

so That’s all for this story, I will see you next story. #seeya

PS: Don’t worry if you did not understand right away, There is so much content out there on internet so make sure to check them out.


Reinforcement Learning: An Introduction by standford book

David silver video lectures

Markov Decision Process pdf book