# Reinforcement Learning — Part 2

Part 1 here.

Basic Idea:

- Receive feedback in the form of rewards.
- Agent’s utility is defined by the reward function.
- Must act so as to maximize expected rewards.
- All learning is based on observed samples of outcomes.

Still assume a MDP:

New twist : we don’t know T or R

- that is we don’t know which states are good or what actions to do.
- Must actually try actions and states out to learn

**Unknown MDP : Model Based Learning**

- Learn an approximate model based on experiences.
- Solve for values as if the learned model were correct.
- E[A] = ΣP(a).a

**Unknown MDP : Model Free Learning**

Passive Reinforcement Learning :

- Simplified task: policy evaluation

1. Input : fixed policy Π(s)

2. You don’t know the transitions T(s,a,s’)

3. You don’t know the rewards R(s,a,s’) - Direct Evaluation:

Goal : Compute values for each state under Π

Idea : Average together observed sample values.

Act according to Π.

Everytime you visit a state, write down what the sum of discounted rewards turned out to be.

Average those samples.

What’s good about direct evaluation?

Its easy to understand.

It does’nt require any knowledge of T,R.

It eventually computes the correct average values, using just sample transitions.

What’s bad about it?

It wastes information about state connnection.

Each state must be learned seperately. So, it takes log time to learn.

- Sample based Policy Evaluation:

Take samples of outcome s’(by doing action!) and average.

We want to improve our estimate of V by computing these averages. - Temporal Difference Learning

Big Idea: Learn from every*experience!*Update V(s) each time we experience a transition (s,a,s’,r)

Likely outcomes s’ will contribute updates more often.

Temporal difference learning of values:

Policy still fixed, still doing evaluation!

Move values toward value of whatever successor occurs : running avg

Sample of V(s) : sample = R(s,Π(s),s’) + ϒV(s’)

Update to V(s): V(s) ⇐ (1-α)V(s) + (α)sample

Can also be written as : V(s) ⇐ V(s) + (α)[sample — V(s)]

Problems with TD value Learning:

TD value learning is a model — free way to do policy evaluation, mimicking Bellman Updates with running sample averages.

Idea : Learn Q-values, not values. Makes action selection model — free too!

Active Reinforcement Learning :

- Value iteration:

1. Start with Q(s,a) = 0, which we know is right.

2. Given Qk, calculate the depth k+1

3. Q- values for all Q-states:

Qk+1 (s,a) ⇐ ∑ T(s,a,s’)[R(s,a,s’) + ϒmax Qk(s,a)] - Q-learning:

Learn Q(s,a) values as you go

1. Receive a sample (s,a,s’,r)

2. Consider your old estimate : Q(s,a)

3. Consider your new sample estimate:

sample = R(s,a,s’) + ϒmax Q(s’,a’)

4. Incorporate the new estimate into a running average:

Q(s,a): V(s) ⇐ (1-α)Q(s,a) + (α)[sample]

Q-learning converges to optimal policy — even if you’re acting suboptimally!

This is called off-policy learning. - Caveats:

1. You have to explore enough.

2. You have to eventually mae the learning rate small enough.

3…but not decrease it too quickly.

4. Basically, in the limit, it doesn’t matter how you select actions!

Alright that’s it for now! Thank you for spending your time. Cheers!