Baby steps in Reinforcement Learning — Q Learning
Q learning. What is the definition of Q at the first place? why Q has to learn while A, B, C, D, E are playing?
With these doubts in our mind, let’s start to crack these doubts one-by-one. First, we start with terminology:
- State : position / situation ([Dark, Light…], [Head, Tail…])
- Action : action ([increase, stay, decrease]. Most of time is verb)
- Reward/Penalty : a score which is awarded to the taken action at that particular state.
- Policy : define the allowable actions from one state to another state.
- Quality of policies, Q-table : store the score of each policy based on the given reward.
Usually, state can be used to define the positions or situations of an object. For example : when we flip a coin (Object), the coin can only be in either Head position or Tail position when it is landed on the floor.
Next, we would like to define the allowable actions that can be taken by the coin to change from one state to reach another state. we could have head-flip-tail, head-throw-tail, head-blow-tail… These allowable (state-action)s combination is called policies. But which policy is the best? To answer this question, we will construct a Q-table and assign a score to each policy. You might have question marks in your mind now. Let’s me clear your doubt with a state diagram below.
In the policy, you specify that state_1 can only go to state_2 by taking action “increase”. Similarly, you also specify that state_3 can only go to state_2 by taking action “decrease”. Policy can be written in the list form as follow:
policy 1 = [state_1, increase, state_2],
policy 2 = [state_2, increase, state_3],
policy 3 = [state_3, reduce, state_2],
policy 4 = [state_2, reduce, state_2]
Q-table = [policy 1, policy 2, policy 3, policy 4]
I hope that policy, Q-table is clear to you now. Let’s crack the learning part of Q-table. This is an interesting part.
The learning objective of Q-table is to select the best policy which is comprised of the best action at each state.
Here, we need to know the definition of “best action”. “Best action” is the action that can give us the highest reward. Let’s me give you a scenario to visualise state, action and reward.
Exam season is coming. Usually, I study for 5 minutes. I increase my study time from 5 minutes to 10 minutes hoping to get good result and my parent will reward me with 1 dollar. If my result is poor, my parent will steal 1 dollar from my piggy bank. Such a bad parent!
On the exam day, I get a good result. This mean my action (increase 5 min to 10 min) is correct. This also convince me that I should always take “increase” action during exam season.
From this simple scenario, we clearly understand the states (5 min, 10 min), actions (increase, reduce) and reward (1 dollar). With full understanding of the terms, we can move to formulate this scenario where we will train the policy to always take the “increase” action at state_1(5 min) to optimise the reward(+ 1 dollar).
Initially, we will assign a random value to every action of every state. In programming language, we will assign random value to 2D arrays. Q[s][a] = random(). Then, we would like to update only the value of the taken action of the particular state. From our example earlier, we would like to update only the “increase” action taken from state_1(5 mins) using the formula below:
Q[s][a] = Q[s][a] + ALPHA *
(reward + GAMMA* maxQ[s_next][A] — Q[s][a] )
This is a simple formula. Let’s me fill in the information from our example. This will be clearer and understandable.
Q[5 min][increase] = Q[5 min][increase] + ALPHA *
( 1 dollar + GAMMA* max(Q[10 min][increase],Q[10 min][reduce]) — Q[5 min][increase])
Basically, ALPHA is the learning rate. This is a constant where you can put any value to speed up the updating process. And GAMMA is the discounting factor that you want to assign to future state. The idea behind GAMMA is that, you want to select some percentage of future state to update your current state. Let’s say, you are 100% confident that the future state(10 min), regardless of what ever action, will affect my current state and action; then you can assign GAMMA to 1. Or, if you do not want the future state to affect my current state and action update, you could set the GAMMA to 0.
One more important thing to take note is the selection of best action in the next state using max function: max(Q[10 min][increase],Q[10 min][reduce]) function. In this function, we would like to select the maximum value from either Q[10 min][increase] or Q[10 min][reduce]. This is the gist in Q-learning. By now, I am sure that you have all the ingredients to update only the taken action of the particular state. If you are familiar with gradient descent, you may notice this formula is another form of gradient descent.
Q learning is as simple as that. If you are interested in the formula, you could dive deep into Bellman Equation. And here, I only explain the exploitation part which mean I will always increase 5 min study time during exam season. There is a possibility where we could reduce the study time and get good result. And what are the strategies that we could take to explore the “reduce” possibility? There are 3 strategies (Epsilon Greedy, UCB1 and Thomson sampling)explained here : https://alex-yeo.medium.com/reinforcement-learning-baby-steps-to-epsilon-greedy-and-ucb-1-ae84a5907f49
If you find this article helpful, please like and share with your friends. Thank you!
You may also refer to this clip.