[ Archived Post ] A History of Reinforcement Learning — Prof. A.G. Barto

Please note that this post is for my own educational purpose.


video from this website

Prof. A.G. Barto did his PhD at the University of Michigan.

Supervised learning → There is an error signal, on how much the model did wrong. (gradient sometimes called) and this error rate is used to optimize the model. (parameter of the model.)

Unsupervised → Clustering

RL → Now there is no error signal rather just evaluation of the model. In other words, this means learning with critic rather than a teacher.

This idea came from Edward L, Thorndike, having a cat in a box and make the cat get out of the box. And after few trials, most of the cats were able to get out of the box easily. (Trial and Error learning.)

Prof Batro sees RL as search + Memory.

Rather than starting from the ground up every time, we can start from the certain state from memory. (resulted from search, and trial and error have not been so much studied in ML point of view)

TRIAL AND ERROR =/= Error Correction!!!!

There is no gradient!

Of course, there were exceptions, however, in general, there was no progress.

However Prof Barto created the associative search network, neural network had noise within them and this created a stochastic search.

And this evolved into actor-critic architecture. (This is where the temporal difference algorithm came in, error correction but different since predicting external signal but also it’s own future prediction. ).

Optimal Control and Dynamic programming came into play. (originally developed in 1953.

Multi-layer network via backdrop (Monte Carlo) → Way to beat the curse of dimensionality. RL had a reputation of being slow, but now back prop made it super fast.

A connection between the TD error and Dopamine → Have lots of effect on the target, including reward. (Dopamine is triggered by reward, but after a long period of trial the effect decrease → This is exactly what TD error is doing. )

People have hypothesized that an actor-critic architecture can be applied to human brains as well. (Monte Carlo Tree Search is also a key method for alpha go as well.)

Reward Modulated Spike Timing Dependent Plasticity → very interesting theory.

Challenges in RL → Designing reward function

(Habits → None Model-based learning system, model-free system have advantages as well as model-based. All of these algorithms can support each other. )


Reference

1.A History of Reinforcement Learning — Prof. A.G. Barto. (2018). YouTube. Retrieved 29 November 2018, from https://www.youtube.com/watch?v=ul6B2oFPNDM&t=614s