# Reinforcement Learning — Part 3

Part 1 here.

Part 2 here.

**Exploration**

How to explore?

*Several schemes for forcing exploration:*

Simplest: random action (Ε — greedy)

1. Every time step, flip a coin.

2. With (small) probability Ε, act normally.

3. With (large) probability 1-Ε, act on current policy.

Problems with random actions?

- You do eventually explore the space, but keep thrashing around once learning is done.
- One solution: lower Ε over time.
- Another solution : exploration functions

Take a value estimate u and a, visit count n and returns an optimistic utility e.g : f(u, n) = u + k/n

**Regret**

- Even if you learn the optimal policy, you still make mistakes along the way.
- Regret is a measure of your total mistake cost : the difference between your (expected) rewards, including youthful suboptimality, and optimal (expected) rewards.
- Mimicking regret goes beyond learning t be optimal — it requires optimally learnin g to be optimal.
- Example : random exploration and exploration functions both end up optimal, but random exploration has higher regret.

**Approximate Q- learning**

Generalizing across states:

- Basic Q-learning keeps a table of q-values.
- In realistic situations, we cannot possibly learn about every single state!
- Too many states to visit them all in training.
- Too many states to hold the q-tables in memory.

Instead, we want to generalize:

- Learn about some number of training states from experience.
- Generalize that experience to new, similar situations.
- This is a fundamental idea in machine learning.

Solution: describe a state using a vector of features (properties)

- Features are functions from state to real numbers (often 0/1) that capture important properties of the state .
- Can also describe a q-state (s,a) with features.

**Linear Value Functions:**

- Using a feature representation, we can write a q-function (or value function ) for any state using a few weights -

V(s) = w1f1(s) +w2f2(s)…+wnfn(s)

Q(s,a) = w1f1(s,a) + w2f2(s,a)+…+wnfn(s,a) - Advantage: Our experience is summed up in a few powerful numbers.
- Disadvantage: States may share features but actually be very different in value !
- Q- learning with linear Q-functions:

Intuitive interpretation:

- Adjust weights of active features.
- Example: if something unexpectedly bad happens, blame the features that were on : disprefer all states with that state’s features.

**Policy Search:**

Problem: often the feature-based policies that work well aren’t the ones that approximate V/Q best.

Solution: Learn policies that minimize rewards, not the values that predict them.

Policy Search: start with an OK solution (eg: Q-learning) then fine-tune by hill climbing on feature weights.

Simplest policy search :

- Start with an initial linear value function or Q-function.
- Nudge each feature weight up and down and see if your policy is better than before.

Problems:

- How do we tell the policy got better?
- Need to run many sample episodes!
- If there are a lot of features, that can be impractical.

Better methods exploit look-ahead structure, sample wisely, change multiple parameters.

Alright that’s it for now! Thank you for spending your time. Cheers!