[ Archived Post ] Deep Reinforcement Learning (John Schulman, OpenAI)

7 min readDec 26, 2018

Please note that this post is for my own educational purpose.

Outline of this talk, introduce RL and overview of some leading methods and discuss them.

RL → study of how an agent makes a decision in a certain environment. Any case in which there is a goal we want to achieve can be described in this term. (deep RL → NN as function approximators → can be used to approximate the policy, or even inside the world itself.).

Robotics is a classical example of how RL can be used. (either walk or stay up etc…). Other uses of area exist as well, in inventory management. (Even in ML, there are methods such as attention that people are looking into.). ( biology → protein structure prediction.).

RL can be applied to non-gradient methods.

Supervised learning environment. (this method can be related to contextual bandit problem. )

Bandit kind of problems can be applied to the personalized recommendation, study between exploration and exploitation. ( big difference is we do not have the loss function, there is no function to take the gradient respect to.).

With full RL, there exist multiple states and now previous actions matters.

Very good summary of the difference between RL and supervised learning. But when should we use this RL?

Other optimization methods can work as well, additionally, a lot of problems can be thought of as contextual bandit problem. (There is no state and this area have a much better theoretic understanding.).

However, different areas can definitely benefit from RL. (Such as robotic manipulation or GO playing.).

In heart of RL, MDP gets used, in which have a state, action and transition probability. (typically we optimize for the cumulative reward.).

Training an RL agent can be in an episodic setting in which, at the end of the episode calculate the reward. (the termination can either be good, bad or something in between.).

Policy → function that the agent uses to choose it’s action either can be deterministic or stochastic.

First sample state → first action → next state and reward, this repeats.

Graphical point of view. The policies can be parameterized by different settings.

Now we are going to cover policy gradient methods. The objective is same, maximize the reward, over some policy. (run a bunch of episodes and optimize for the best decision.).

There are methods in which want to do a better action by taking the derivative.

A very fundamental concept in policy gradient method. (Wanna compute the gradient of the expectation respect to theta). (Unbias estimator of the gradient → after a lot of samples will converge to the right value.).

Another method to deriving the derivative, important sampling method.

If we have a really good function value than we are going to push up the log probability. (simple intuition, even if f(x) is discontinuous this theory still applies, or does not have access to the original function.).

What SFGE is doing in a graphical sense. (general method, can be used in different settings.).

Now that function is applied to policy.

And now we are just moving the terms inside the expectation. (but we are looking at the future reward → lower variance.).

By introducing the baseline we can further reduce the bias value. (estimator → quantity inside the expected value.).

A discount method is also one of the methods which reduce variance. (recent reward is more important.).

The method looks like a hack, but there is a solid mathematical reason why.

A general outline of the policy gradient algorithm. (this method itself is good.).

RL → current function affects what input data we are going to next in the next episode. (So learning rate is very important.). A policy will get worse.

We can make it in such, the policy update is not too different. (Trust regions.).

Now instead of using function b just as a baseline do something advance with it. ( decrease variance a lot, but it increases bias.).

Q Function learning methods (not optimizing the policy directly → rather just measure how good the actions are. Solve MDP very good.).

Q Function → how good is state and action pair
V Function → expected reward of this state
A Function → How much better is this action, compared to the policy would have done.

So we are going to explicitly store Q functions instead of policy Pi, and we are going to use the Bellman equation. (consistency equation).