[ Archived Post ] A Tutorial on Reinforcement Learning I

Please note that this post is for my own educational purpose.

video from this website

What is RL → Agent acting in the world takes action. Interaction with the env and get some response from the env (take that as the input). This is some sort of critic learning, in this case, the critic is the env, and this is different from supervised learning since we don’t have a direct gradient.

How to make a decision under uncertainty, there is much application related to RL. Different from ML and AI Planning?

We don’t know how the world works, do we know what the domains is? The number of states? Explore (how to actively gather the data?) Delayed reward, and how to deal with that.

RL → A lot of overlap between active learning. Standard RL setting → MDP → Set of state and set of action → random transition between states.

Go forward but sometimes stays in the same place, there exists some sort of randomness. (discount factor → right now or later?). Compute a policy and we saw how to optimize this, linear system of equation solving.

Planning → Know how the world works, just wanna know the optimal behavior.

Markov Property is already applied there, but we can put the full history in the state space as well.

First set the Q value all zero, but update the Q values from the table via the iterative method. (This will converge, at least the finite setting → basically have a table.)

Another approach is policy iteration, at any point if we have a Q value → we can take the argmax on how to act in this state → this is very similar to know which direction to take in the maze world.

MDP → Learn the model (dynamics) state action value function or policy directly. (There exist intersection and more).

Model-based RL → Use the experience, use it to estimate how to model the world works (max likelihood estimation). (But this is the estimated model).

Why? → Use data very efficiently, think of this as a simulator. But can be computationally expensive.

Q-Learning → not directly learns how the world works, but implicitly take into account how the world works. State to another action → TD Error → Temporal Difference Error.

In the model-based → compute entirely new models after new data points, and update the Q values.

In Q learning update the Q function for one state action pair → the state we were in and the action we just took. Q function randomly starts → then slowly move to the Q estimate → only the current state and the action we just took.

Direct Q-learning is less robust but computationally cheap. There are ways to get around that. (either update the current state and the action → walk down a long way → Q learning would only update the current state → Model-based would update all of the states that have to lead up to the goal. )

State Space is big → Model-based very slow, Q-learning is slow but better.

Policy Search → indirectly throws out Q Function → just say there is a function space that describes the policy and searches for that.

Exploring (How to act? How to gain the goals?)

Does it matter? → Not know how the world works and get some data to estimate the Q values. (Even in the data might be misleading and only viewed the sampled data) → Lead to the wrong estimation.

The idea here is to try things that you have not tried, but the end goal is to gather high reward. The agent itself is responsible for gathering the data of the world to achieve high reward over the long term. (Greedy or Bayesian).

Multi-Arm Bandit Problem → No state or one state. Unknown probability distribution, gather the unknown distribution to maximize the reward. But we do not know which arm will give the best values.

Second idea → Optimism under uncertainty. Computer the upper confidence bound.

It means → Arm that I pull is the best arm or not a good arm and we will get more information to prove that it is not a good arm. (there is no assumption then the award is bounded.)

Distribution overworlds → bayesian

Choose an arm with a probability with optimal action → we can do this since we have the distribution of the arms. (Thompson sample simplifies this.)

Each of the Arm is Bernoulli distribution → we have the beta prior and update the posterior. (Bayesian regret style.)

Want to do → Maximize the Goal

The above directly reason exploration vs exploitation. This is the best we can ever hope to do. (not yet computationally feasible.)

Learning Problem → planning problem. (Depend on the number of steps left) (Sparse sampling and Monte Carlo tree search)

Exploration → e-greedy, be optimistic, or Bayesian.

In general, RL is the same as bandit setting but delayed reward. Optimistic for being uncertainty is also very similar as well.

How good are they?

Performance → Reward we get back at the end of the day or the amount of data we need. (Computational efficiency is also another method.)

Quality of the algorithms → particular domain and see how they perform they work between one another. (but the theory is different between practice.). Theoretical guarantees are very important as well. (Atari)

AO → we assume that we are going to converge at the end of the day. Decay the rate we are doing exploration that does work out. (Converge at the limit).

The algorithm is PAC MDP → for all n steps the algorithm is near optimal. Need to guarantee this with high probability. (Gamma → discount)

Greedy → Maintain a Q function (estimate) take the argmax over the Q function → how it behaves. (Couple difference instance of MDP) Input as a Q function → known state action pair → if known then transitional using the real MDP → but not known use the self-loop → set the reward in the input Q (this process seems overly complicated.)

Have the form of algorithm → Accuracy / Optimism some axis of measurement.

PAC RL proof and PAC RL seems useful, in some cases.


  1. e-greedy 2. optimal in uncertainty 3. Optimize the planning

Another Good Short Video

Video from this website


  1. A Tutorial on Reinforcement Learning I. (2018). YouTube. Retrieved 29 November 2018, from https://www.youtube.com/watch?v=fIKkhoI1kF4&t=202s
  2. An introduction to Reinforcement Learning. (2018). YouTube. Retrieved 29 November 2018, from https://www.youtube.com/watch?v=JgvyzIkgxF0