Reinforcement Learning — Part 3

Part 1 here.

Part 2 here.


How to explore?

Several schemes for forcing exploration:
 Simplest: random action (Ε — greedy)
 1. Every time step, flip a coin.
 2. With (small) probability Ε, act normally.
 3. With (large) probability 1-Ε, act on current policy.

Problems with random actions?

  • You do eventually explore the space, but keep thrashing around once learning is done.
  • One solution: lower Ε over time.
  • Another solution : exploration functions
     Take a value estimate u and a, visit count n and returns an optimistic utility e.g : f(u, n) = u + k/n


  • Even if you learn the optimal policy, you still make mistakes along the way.
  • Regret is a measure of your total mistake cost : the difference between your (expected) rewards, including youthful suboptimality, and optimal (expected) rewards.
  • Mimicking regret goes beyond learning t be optimal — it requires optimally learnin g to be optimal.
  • Example : random exploration and exploration functions both end up optimal, but random exploration has higher regret.

Approximate Q- learning

Generalizing across states:

  • Basic Q-learning keeps a table of q-values.
  • In realistic situations, we cannot possibly learn about every single state!
  • Too many states to visit them all in training.
  • Too many states to hold the q-tables in memory.

Instead, we want to generalize:

  • Learn about some number of training states from experience.
  • Generalize that experience to new, similar situations.
  • This is a fundamental idea in machine learning.

Solution: describe a state using a vector of features (properties)

  • Features are functions from state to real numbers (often 0/1) that capture important properties of the state .
  • Can also describe a q-state (s,a) with features.

Linear Value Functions:

  • Using a feature representation, we can write a q-function (or value function ) for any state using a few weights -
     V(s) = w1f1(s) +w2f2(s)…+wnfn(s)
     Q(s,a) = w1f1(s,a) + w2f2(s,a)+…+wnfn(s,a)
  • Advantage: Our experience is summed up in a few powerful numbers.
  • Disadvantage: States may share features but actually be very different in value !
  • Q- learning with linear Q-functions:

Intuitive interpretation:

  • Adjust weights of active features.
  • Example: if something unexpectedly bad happens, blame the features that were on : disprefer all states with that state’s features.

Policy Search:

Problem: often the feature-based policies that work well aren’t the ones that approximate V/Q best.

Solution: Learn policies that minimize rewards, not the values that predict them.

Policy Search: start with an OK solution (eg: Q-learning) then fine-tune by hill climbing on feature weights.

Simplest policy search :

  • Start with an initial linear value function or Q-function.
  • Nudge each feature weight up and down and see if your policy is better than before.


  • How do we tell the policy got better?
  • Need to run many sample episodes!
  • If there are a lot of features, that can be impractical.

Better methods exploit look-ahead structure, sample wisely, change multiple parameters.

Alright that’s it for now! Thank you for spending your time. Cheers!

Like what you read? Give Akhilesh a round of applause.

From a quick cheer to a standing ovation, clap to show how much you enjoyed this story.