Reinforcement Learning — Part 3
Part 1 here.
Part 2 here.
How to explore?
Several schemes for forcing exploration:
Simplest: random action (Ε — greedy)
1. Every time step, flip a coin.
2. With (small) probability Ε, act normally.
3. With (large) probability 1-Ε, act on current policy.
Problems with random actions?
- You do eventually explore the space, but keep thrashing around once learning is done.
- One solution: lower Ε over time.
- Another solution : exploration functions
Take a value estimate u and a, visit count n and returns an optimistic utility e.g : f(u, n) = u + k/n
- Even if you learn the optimal policy, you still make mistakes along the way.
- Regret is a measure of your total mistake cost : the difference between your (expected) rewards, including youthful suboptimality, and optimal (expected) rewards.
- Mimicking regret goes beyond learning t be optimal — it requires optimally learnin g to be optimal.
- Example : random exploration and exploration functions both end up optimal, but random exploration has higher regret.
Approximate Q- learning
Generalizing across states:
- Basic Q-learning keeps a table of q-values.
- In realistic situations, we cannot possibly learn about every single state!
- Too many states to visit them all in training.
- Too many states to hold the q-tables in memory.
Instead, we want to generalize:
- Learn about some number of training states from experience.
- Generalize that experience to new, similar situations.
- This is a fundamental idea in machine learning.
Solution: describe a state using a vector of features (properties)
- Features are functions from state to real numbers (often 0/1) that capture important properties of the state .
- Can also describe a q-state (s,a) with features.
Linear Value Functions:
- Using a feature representation, we can write a q-function (or value function ) for any state using a few weights -
V(s) = w1f1(s) +w2f2(s)…+wnfn(s)
Q(s,a) = w1f1(s,a) + w2f2(s,a)+…+wnfn(s,a)
- Advantage: Our experience is summed up in a few powerful numbers.
- Disadvantage: States may share features but actually be very different in value !
- Q- learning with linear Q-functions:
- Adjust weights of active features.
- Example: if something unexpectedly bad happens, blame the features that were on : disprefer all states with that state’s features.
Problem: often the feature-based policies that work well aren’t the ones that approximate V/Q best.
Solution: Learn policies that minimize rewards, not the values that predict them.
Policy Search: start with an OK solution (eg: Q-learning) then fine-tune by hill climbing on feature weights.
Simplest policy search :
- Start with an initial linear value function or Q-function.
- Nudge each feature weight up and down and see if your policy is better than before.
- How do we tell the policy got better?
- Need to run many sample episodes!
- If there are a lot of features, that can be impractical.
Better methods exploit look-ahead structure, sample wisely, change multiple parameters.
Alright that’s it for now! Thank you for spending your time. Cheers!