Ch:13: Deep Reinforcement learning — Deep Q-learning and Policy Gradients ( towards AGI ).

Published in

Deep Math Machine learning.ai

14 min readSep 10, 2018

One of the most exciting developments in AI is #DeepRL. Today we are gonna talk about that #getready

In this story I only talk about two different algorithms in deep reinforcement learning which are Deep Q learning and Policy Gradients.

Spoiler alert!!!

Before I get started , I assume you have checked my other stories from the previous chapter “Reinforcement learning Part 1 and part 2. if not please check those out otherwise it’s difficult to catch up with me.

okay.

Why Deep reinforcement learning???

you know I am a simon sinek fan. I always start with “why” then “how” and “what”.

in the last story we talked about “Q-learning”

Q-learning estimates the state-action value function(Q_SA) for a target policy that deterministically selects the action of highest value.

Here we have this table Q of size of SxA.

to calculate the current state-action value , it takes the next best action from the table Q (max_a` Q(S`,A`))

let’s say we have a problem which has a total of 4 different actions and 10 different states, then we got the Q of 10x4 size.

it works perfectly fine if we have a limted state space/action space

but what if the state space is much bigger?? let’s say we take an Atari game as an example where each frame in the game is treated as a singe state then we have millions of states(depends on the type of game).

Or we can think of a robot walking/moving ( here action space could be more and unknown)

so we need to have millions of records stored in a table in the program memory. that’s something we don’t do so we need a better solution.

so we use a function approximation ex: neural network to calculate the Q values.

Which is why we use supervised learning to build better things with reinforcement learning.

so how do we do this??

Q-learning function for estimating Q values from the last story

Q(S,A)← Q(S,A)+ α(R + γ max A` Q(S`, A`)- Q(S, A))

And this is how neural network Q function works

The network takes a state(‘s) as the input and produces the Q-values for every action in the action space ( no of actions depends on the game/environment).

The neural network job is to learn the parameters so assume that the training is done and we got the final network ready so..

at the time of prediction, we use this trained network to predict the next best action to take in the environment so we give a input state and the network gives the Q values for all actions then we take the max Q-value to take the corresponding action in the environment.

best_action = arg max(NN predicted Q-values ).

Which means → for this state , this is the best action to take in the env.

Simple right!

Now lets talk about the training part

This is a regression problem so we use any regression loss function to minimize the total error of the training data.

neural network loss function for predicting the Q values

l2_loss = (predicted — actual ) **2

assume the “predicted” gives the list of action values from the neural network where we take the maximum value(maximum reward).

Since we don’t know the “actual” as it’s a reinforcement learning ( not supervised so no labels) so we have to estimate the “actual” value

actual= R + γ max A` Q(S`, A`)

R → the current immediate reward

S` → Next state

max A` Q(S`, A`) → max( NN output list of Q-values)

γ → the discount factor γ → {0,1}

The reason for this is

if you take that particular action you will know the reward R and predict the next max reward with some discount.

assume the next max reward predicted by the model is completely wrong because the model just started training but we still have the R reward that is 100% right so slowly we adjust network parameters to predict the reasonable predictions.

In a sense it predicts it’s own actual value but since R is not biased ( when we step into the environment we observe the reward which is fully right ), the network is gonna update it’s gradients using backprop to converge.

so we got the inputs , labels , rewards and loss function, we can build a neural network which predicts the Q values for every action value.

which means → for this state , these are the actual values for all actions.

Here is the final image

at the time of predicting it seems very fine but at the time of training you may have some doubts how this is gonna work??

I expect you to have a doubt here , if you don’t have the doubt , here is my question to you

“Since we don’t have labels, we said actual= R + γ max A` Q(S`, A`)

but how do we get the values for R and max A` Q(S`, A`) from the environment for a particular state???? ”

Hint: Usually we get the reward R and Next state S` when we perform an action in the env only. #thinkaboutit.

There are few tricks that I have to explain to make you understand it well,for the time being just assume that we know the action and reward R (we ll see more about it down below).

Anyway, that’s how it works , we got labels ,rewards ,losses and parameters so we can do the network training and that acts as a function approximation to estimate the Q values.

so now what is Deep reinforcement learning???

Deep reinforcement learning = Deep learning+ Reinforcement learning

“Deep learning with no labels and reinforcement learning with no tables”.

I hope you get the idea of Deep RL.

now let’s take a problem to understand it’s implementation better.

in 2013 Deepmind developed the first deep learning model to successfully learn control policies directly from high-dimensional sensory input using reinforcement learning.

The model is a convolutional neural network, trained with a variant of Q-learning, whose input is raw pixels and whose output is a value function estimating future rewards.

they apply their method to seven Atari 2600 games from the Arcade Learning Environment, with no adjustment of the architecture or learning algorithm. they find that it outperforms all previous approaches on six of the games and surpasses a human expert on three of them. they published a paper.

let’s take the paper Playing Atari with Deep Reinforcement Learning

I am not gonna explain the paper and code deeply because there are a lot of awesome articles out there talking about it, here are some

Beat Atari with Deep Reinforcement Learning! (Part 1: DQN)

Note: Before reading part 1, I recommend you read Beat Atari with Deep Reinforcement Learning! (Part 0: Intro to RL)

becominghuman.ai

https://neuro.cs.ut.ee/demystifying-deep-reinforcement-learning

I would mainly focus on the concepts, math and some ideas.

In simple terms

Take the image, turn it to a gray scale image and crop the necessary image part.
Apply some convolution filters and full connected layers with the output layers.
Give that image ( we call “state”) , calculate the qvalues , find the error and backpropagate.
Repeat this process as long as you want.

There are some problems in this process which led us to the Tricks I mentioned before.

Deep Q learning Tricks

Skipping frames: in just one frame we can’t observe the motion of the picture ex: we cant tell in a picture whether the ball is moving up or down by just seeing one frame so we use this trick called Skipping frames which skips every 4 frames to take an action.

earlier in this story I asked a question, how do we get the values for R and max A` Q(S`, A`) from the environment for a particular state???

the answer is we can take random actions to collect the data and learn from it. but the environment is a continuous state space so there is so much correlation between one frame and the subsequent frame so the network is gonna forget what it has seen long ago which makes the network very biased and bad. so…

2. Experience replay : we store every experience(current state, current action, reward, next state ) in the memory then we take a sample of batches from the memory.

in this way, we can make training better and learning is reasonable.

3. Fixed Target network: since we don’t have labels , we make this equation to get the labels for the training.

actual= R + γ max A` Q(S`, A`)

and max A` Q(S`, A`) → is the NN output list of Q-values.

here we maintain another network we call the target network, and we assign the actual network weights to the target network for every N no of frames /iterations.

so now max A` Q(S`, A`) → is the target network output list of Q-values.

The reason for this is we update the actual network gradients for every frame/action so the network keeps on changing so it’s not feasible to use the same network for calculating the actual reward.

4. Reward Clipping : every game / environment has it’s own reward system, some games it might have points like 100,200,300 and so on some games might have 1,2,3 and so on. so t o normalize the reward and penalty uniformly across all settings of environment, reward clipping is used.

so here is the final algorithm

The training network Q and The target network Q` will get updated as we run more and more episodes.

I trained this model on 16 cores CPU with 16 GB ram (i5 Processor) it took me 2 days to give the below results.

here is the small snippet of code ( written in #python #keras #tensorflow #gym #ubuntu #sublime)

You can find the full code here,

Note: The code has been slightly modified after this results I made little tweaks so feel free to play with it and final model files are also available there to let the computer play the game.

with that, Deep Q learning is done!

Since DQN was introduced , a lot of improvements have been made to it like Prioritized Experience Replay, double DQNs, dueling DQN etc.. which we will discuss in the next stories.

Policy Gradients

In Q learning, for a given state we calculate the Q value for every action in the action space and we pick the max value and it’s corresponding action ( so choosing actions depends on the Q value )

This method is called a Value-based method . and there is another method called a Policy based method.

so why policy based methods ??

There are couple of reasons.

The ultimate goal of RL is to find an optimal policy ( it can be done using the value function, so far we have seen it and it can also be done without the value function)

in the above approach, we calculate the Q values for all possible actions in the action space ( we only got less actions so we could follow the above approach but in real world , the number of actions can be a lot or can’t be imaginable)

Ex: Robot walking ( the action space is high or not imaginable )

This one is continuous and high dimensional action space.

So we need to learn the optimal policy for higher action space which policy based methods do exactly.

in policy based methods, we directly learn our policy function π without worrying about a value function.

which means we can choose actions without calculate the Q(S,A) values.

Deep-Q-learning is a value based method while Policy Gradient is a policy based method.

There are couple of advantages using the policy gradient methods

It can learn the stochastic policy ( outputs the probabilities for every action ) which is useful for handling the exploration/exploitation trade off.
Often π is simpler than V or Q.

How does policy gradient work???

as we know in policy gradients , we directly learn the policy function π which gives the probability distribution of actions.

π (A|S) → Probability of taking action A given state S.

we use function approximation ( ex: neural net) to find this policy function π

π (A|S;θ) → Probability of taking action A given state S with parameters θ.

The diagram is similar to DQN but this is a classification problem while DQN is a regression problem.

The network takes state as an input and produces the probability distribution of actions.

The process in simple terms → we give a state ,get the action probs, take the action given by pi in the env, observe next state & reward, repeat it till the end of the game and if end, update the parameters in the network based on the rewards we collected (backprop).

let’s talk about the cost function

DQN is a regression problem so we have used L2 loss function as cost function and we minimize that loss during the training. but PG is a classification problem so we use log likelihood or cross entropy.

since we don’t have labels ( for a given state this is the best action ) , we have to create “fake labels” to continue with training.

How to make the fake labels??

we just sample the actions that need to be performed in the environment, we just label those actions as fake labels we happened to sample for given states.

Σlogp(yt | xt ; ϴ)

xt and yt are training examples and labels respectively.

Our goal now is → to increase the log probability for actions at the last layer.

Policy gradients are similar to supervised learning except a change in the loss function.

The fake labels can’t be real so we focus on the outcome ( in case of a game we focus on the final reward) to improve the fake labels to get close to the actual labels ( if we were to say).

let’s take an example of the pong game ( there is an excellent explanation by andrej karpathy on this here. make sure to check it)

I borrowed this image from that and I hope to explain by adding couple of things for this concept for the rest of the story)

Process

we take a policy and roll out an episode ( a game)

There are n no of frames per second and each frame is considered as a state

2 . for every state we sample an action(fake label), perform it in the environment and observe the reward & next state

3 . if end, if the reward is positive( it tells us the actions we took in the episode are good) so we push up the probabilities for all the actions taken in that episode , if the reward is negative( the actions we took in that episode are kinda bad) so we push down the probabilities for all the actions taken in that episode.

so our final loss function should look like this

Σlogp(yt | xt ; ϴ) * At
At = Rt ( reward at time step t)

our goal now is → to increase the log probability for actions that worked and decrease it for those that didn’t with an advantage A .

We can use the reward as the advantage , cause if we get the positive reward we push up the probabilities, if we get the negative reward we push down the probabilities.

but there is a problem here

yes!, lets say we lost the game and we get the negative reward so we are about to push down the probabilities for all actions that we took in that episode, some actions may be super good in that episode but at the end we got negative reward so we blame that action also which is bad.

blaming all the actions equally is not a good idea, so we slightly modify the actual reward.

at each time step/state we get the reward so we use discounted reward ( remember from last stories??) with a little change in notation .

I notated as timestep 1 is the first step, some start with timestep 0

Gt is the total reward, γ being the discount factor.

let’s talk about the training!

so we got our labels, loss function, discounted rewards and gradients

Setup the network architecture ( based on the game/environment).
Run a policy π (A|S;θ) to predict the actions (it includes net output and action picking choice).
Take those actions ,observe the reward and next state , calculate the loss and gradient and store them all in memory till the end of the episode.
At the end, calculated the discounted reward and calculate the gradient w . r. t θ and update the network weights using any optimization algorithm.
Repeat 2–4 as long as you want.

The policy gradients algorithm is

As Andrej karpathy already has written a blog post and a 130 lines of python script written from scratch for the Pong time using policy gradients here

Some of things that you can add to it to make it better.

Andrej took only one neuron (one action) at the output layer , we can try multiple actions in the network.
We can try CNN’s in the first layers of the network.
We can try the advanced optimizers like ADAM for the network.

Summary

→ Deep reinforcement learning is like adding a neural network to an environment to accomplish the goals in that env.

→ DQN is like taking some random actions and learning from them through the Q value function and it’s a regression problem (L2 loss is used) where two networks are used for training.

→ PG is like learning the optimal behavior directly from the experiences (no value function) and it’s a classification problem( maximum log likelihood is used with a some minor change ).

Here is the final image that summarizes PG.