# Reinforcement Learning — Part 3

Part 1 here.

Part 2 here.

Exploration

How to explore?

Several schemes for forcing exploration:
Simplest: random action (Ε — greedy)
1. Every time step, flip a coin.
2. With (small) probability Ε, act normally.
3. With (large) probability 1-Ε, act on current policy.

Problems with random actions?

• You do eventually explore the space, but keep thrashing around once learning is done.
• One solution: lower Ε over time.
• Another solution : exploration functions
Take a value estimate u and a, visit count n and returns an optimistic utility e.g : f(u, n) = u + k/n

Regret

• Even if you learn the optimal policy, you still make mistakes along the way.
• Regret is a measure of your total mistake cost : the difference between your (expected) rewards, including youthful suboptimality, and optimal (expected) rewards.
• Mimicking regret goes beyond learning t be optimal — it requires optimally learnin g to be optimal.
• Example : random exploration and exploration functions both end up optimal, but random exploration has higher regret.

Approximate Q- learning

Generalizing across states:

• Basic Q-learning keeps a table of q-values.
• In realistic situations, we cannot possibly learn about every single state!
• Too many states to visit them all in training.
• Too many states to hold the q-tables in memory.

• Learn about some number of training states from experience.
• Generalize that experience to new, similar situations.
• This is a fundamental idea in machine learning.

Solution: describe a state using a vector of features (properties)

• Features are functions from state to real numbers (often 0/1) that capture important properties of the state .
• Can also describe a q-state (s,a) with features.

Linear Value Functions:

• Using a feature representation, we can write a q-function (or value function ) for any state using a few weights -
V(s) = w1f1(s) +w2f2(s)…+wnfn(s)
Q(s,a) = w1f1(s,a) + w2f2(s,a)+…+wnfn(s,a)
• Advantage: Our experience is summed up in a few powerful numbers.
• Disadvantage: States may share features but actually be very different in value !
• Q- learning with linear Q-functions:

Intuitive interpretation:

• Adjust weights of active features.
• Example: if something unexpectedly bad happens, blame the features that were on : disprefer all states with that state’s features.

Policy Search:

Problem: often the feature-based policies that work well aren’t the ones that approximate V/Q best.

Solution: Learn policies that minimize rewards, not the values that predict them.

Policy Search: start with an OK solution (eg: Q-learning) then fine-tune by hill climbing on feature weights.

Simplest policy search :

• Nudge each feature weight up and down and see if your policy is better than before.

Problems:

• How do we tell the policy got better?
• Need to run many sample episodes!
• If there are a lot of features, that can be impractical.

Better methods exploit look-ahead structure, sample wisely, change multiple parameters.

Alright that’s it for now! Thank you for spending your time. Cheers!

Like what you read? Give Akhilesh a round of applause.

From a quick cheer to a standing ovation, clap to show how much you enjoyed this story.