Q-Learning — An Introduction

Jeremi Nuer
18 min readDec 2, 2021

--

Machines are getting smarter than humans. And better than them too, at a lot of things. Recently, an AI was able to beat the current world-champion of GO at his very own profession. How did the AI do it? Reinforcement Learning.

Real footage of GO champion Ke Jie questioning life after being completely demolished

This article is quite long, so feel free to stop and pick it back up later. The article is divided into 7 parts:

Brief Overview of Reinforcement Learning

Transition Probabilities & Expected Return

Policies & Value Functions

The Bellman Equation

Introducing A Q-Learning Example

Q-value Equation & Implementation

Final Thoughts

A Brief Overview of Reinforcement Learning

Feel free to skip this if you already feel comfortable about the basics of RL.

Reinforcement Learning is a unique type of machine learning, one not described by supervised or unsupervised learning. But the core concept is actually pretty easily understood, and digestible. Let’s break it down.

Reinforcement Learning is composed of two main entities. The agent, and the environment. The agent is an AI, or model that you are building, and the environment is well, just that. It is the environment that the agent interacts with, and is residing in. These two entities interact with each other in a way that can be described by the Markov Decision Process.

Explanation below

The Markov Decision Process is the process in which an agent interacts with the environment, tries to fulfill an objective, and learns how to take correct actions in order to achieve that goal.

Firstly, an agent can take actions. These are predetermined, and for our cases finite(but in more complex situations this may not be the case). These actions affect the environment in some way, and change it, just as would happen in real life. Since the environment has changed, it is now in a new state.

The state is the current aspects of the environment that are sent back, or observed by the agent. If we think of a case in real life, this could be the information gained from the five senses that any animal has as it moves in its environment. Along with the state, the agent receives the reward from the environment.

This is an appraisal of the last action the agent just took. The reward is centered completely around what the goal of the agent is. If the goal of a squirrel is to get as many acorns as possible, then rewards would be based on how effectively the squirrel’s actions got them an acorn or not. After getting the information from the state of the environment, the agent takes another action, and the process all repeats again. This is the framework, and backbone of reinforcement learning.

Within Reinforcement Learning, there are many branches and methods of going about executing a Markov Decision Process, each with their own benefits and costs. Today, we will be focusing specifically on Q- learning. But before we dive deeply into it, there are quite a few complicated looking equations, all weirdly notated, so it’s important we understand what the different variables mean

Transition Probabilities & Expected Return

This isn’t specific to Q-learning, but it is very important to understand the workings of any RL model! (including Q-learning, which is why I’m mentioning it lol)

The first thing you should know is that every state has a value, and every action in a specific state has a value. The ‘value’ of an action in a specific state is simply how good taking this action is in order to maximize the reward. The question is, how does our agent find out how good any action is to take, and in which states are these actions good?

This is where transition probabilities come in. If our agent is trying to guess how good any action is to take, it must be able to make good educated guesses on A: what the reward from taking this action will be, and B: what state we will end up in after taking this action. This information changes based upon two factors: the preceding state, and the action taken at that state. Formally, this is written as:

The left side is simply a condensed version of the right side

Written in pseudo equation, it states that the probability of state prime(any chosen state) actually being the next state (at time step t+1), and any reward actually being the reward we get, is based off of the preceding state at St, and the preceding action at At.

The reason transition probabilities are important is because with the knowledge of being able to make educated guesses on what the reward will be in the future, we can start to find the return.

The return is the sum of all future rewards. The expected return is what our agent expects the sum of the future rewards to be. At any time step t, we can analyze what we expect the return to be from that point onwards. The expected return from starting at any time step is notated Gt, and written out formally, we state:

You’ll often see Gt show up in equations for RL

The capital T is the final time step in our series of time steps, where our agent either ‘succeeds’ in its goal, or ‘fails.’ In the case of a video game, ‘RT’ could be described as either defeating the boss and winning, or dying and needing to restart.

A series of time steps leading to a clear break or end is called an episode (not Netflix people!). Not all scenarios where Reinforcement Learning is applied will have clear separation of time steps, and episodes. But for basically all video games, you can break it up into multiple episodes (another way to think of it is ‘attempts’ at achieving the goal)

However, we do not use the expected return as it is currently formatted towards our goal of q-learning. Consider this: if from any state, you could theoretically achieve the goal in some distant future, then every state and action, besides the ones that result in immediate failure, would all have the same expected return, because far in the distant future, you could still achieve that goal.

We must find a way to prioritize speed, to maximize short term rewards. This might seem counter-intuitive, but this way of prioritizing short term rewards actually helps our agent have a clearer understanding of far off rewards, and how to actually have a clear goal in mind. We do this by taking a discount from rewards that we will receive in the future. This is defined as the discount rate, or gamma. In formal equations, we refer to gamma as γ, a weird looking y.

The discount rate is a number between 0 and 1, and you can think of it exactly like a discount rate. For future rewards, we will only take a certain percent of the future reward into account, instead of the full reward.

In mathematical terms, this means:

This would go on infinitely until the final time step

Notice that as we get further into the future, the discount gets exponentially larger. This helps our agent prioritize the actions that will get us the maximum reward as fast as possible. We can also do some handy math to make this equation even more powerful, through simple factoring.

We’re just factoring. No magic to see here

This is super handy. Essentially, this is saying that the discounted expected reward at any time step is simply the reward you will get after this next action, plus the discounted expected reward at the next time step (t+1).

What this means for us, is that when our agent is calculating the discounted return at a time step, it simply needs to add the reward we will get at the next time step, and the discounted expected return from that next time step.

Our agent will never calculate down a long path of rewards to find the discounted expected return. It will always just add the reward from Rt+1, to the discounted expected return of t+1. How does it know the discounted expected return of that next time step, I hear you ask? Patience, my young Padawan. We will get there.

For now, what I want you to really grasp is that the goal of the agent is to maximize the discounted expected return. Q-learning, as a whole, is predicated on the discounted expected return.

Although our equations will get more complicated, the simple truth is that an agent is trying to choose the right actions that will maximize the sum of rewards we will get in the future, and this is the expected return. In Q-learning, the agent learns what states/actions will lead to what rewards, and then it will base it’s actions off that knowledge.

Policies & Value Functions

We will briefly discuss policies, although they’re not super important to understand at a fundamental level.The policy is denoted as π. The policy is simply what action our agent will pick in a specific state. Often it will be said that an agent “follows a policy.”

In more mathematical terms, the policy is a probability distribution that maps the likelihood of our agent choosing any of the actions at any given state. This can be described as:

*The probability of taking action At is based off State St

The policy really just is what action the agent chooses when it is in any state. For Q-learning, our policy is simply to choose the action that has the highest discounted expected return.

Now, let’s talk about value functions. There can be two types of value functions, state-value functions and action-value functions. In Q-learning, we only use action-value functions, but we will briefly mention state-value functions as they do appear and are important in other areas of reinforcement learning.

State-value functions describe how good a specific state is for the agent, under a policy. Remember that a policy just lets us know exactly which actions the agent will take in different states. So, if for any policy we know which action we will take at every state, we can conclude what rewards we expect to get if we start at this state and follow this policy until the final time step T.

In other words, the state-value function is simply the discounted expected return from starting that state, and knowing which actions you will take at any states you encounter until the episode is over.

Mathematically, we define the value of a state as a function Vπ(s), and we set it equal to

The capital E stands for ‘expected.’ Remember Gt is the return

Action-value functions are similar to state value functions. The action value function helps determine how valuable any given action is in a specific state. Once again, the value of a state-action pair can only be stated in terms of the policy we are using. What is the action value function? Well, it’s actually just the discounted expected return of being in a state, choosing an action, and following a policy from then onwards.

No really, it’s just as simple as that.

Value of action a, in state s

The capital E means ‘expected.’ Remember that we declared that Gt stands for the discounted return. So, in pseudo-equation, this is really just saying: the value of taking an action in a state, and following a policy onwards is equal to the discounted expected return, considering we take this specific action in this specific state.

The Q in Q-learning stands for the quality of taking an action in a specific state. So when we refer to the value of an action a, in state s, we call this the q-value of a state-action pair. This equation we marked above is called the q-function.

However, we must consider something. This q-function was defined only in terms of the policy we have. Well, how do we know which policy will maximize the rewards? How do we know what the right actions even are to take?

A policy is just which actions our agent takes in different states. The policy for q-learning will always simply be to choose the action that has the highest q-value from the possible actions that we can take in any given state. Well, most of the time. But we’ll get into that in a later section of this article.

Another way to describe this is the optimal action-value function.

The optimal value function, denoted q*(s, a) gives the largest expected return achievable by any policy π for each state action pair. Since a policy can be thought of as simply which action you choose when in any given state, this can be reworded as “the optimal value function gives the largest expected return achievable from taking any action in the future states”

When we calculate the q-value of a state-action pair, we always use the largest expected return achievable after taking this action. Why? Well, let’s think of it in terms of a video game.

Let’s say there is an action that can advance the agent to the goal of defeating a boss. However, to be successful, it must do a sequence of actions after said action, and if the agent messes up any of those actions, it will not receive the reward. Should we then base the value of this action off of the scenario where it does succeed, or off the scenario where it doesn’t succeed? Obviously we want to base it off the best case scenario.

The Bellman Equation

One of the properties of the optimal action value function is that it must satisfy the bellman equation. This equation can look really intimidating, so I’ll introduce it, and then I’ll break it down and explain each little part.

Bellman Equation:

Before I begin explaining this, I want to make this come across as much as possible. There is literally nothing in this equation that is new. You know all of it. Trust me.

Let’s start with the left side of the equation, q*(s, a)

This part simply represents the optimal value function for any given state-action pair. So, the right side of the equation must equal the highest possible discounted expected return we can get from taking a certain state in a certain action. Remember, the optimal value function is simply the largest return we can expect to get after taking a specific action in a given state.

The E outside of the brackets simply represents ‘expected.’ So we can assume that the things that are in the brackets are the highest possible return. To explain this, I want to remind you of the equation for discounted return.

Equation for discounted return
The part of the bellman equation that represents the same thing

We stated that the discounted return is simply the reward taken at time step t+1, plus the discounted return at time step t+1. Well for the maximum expected return possible from a state-action pair, we can state that it is simply the reward we will get after taking this action, plus the action in the next state that has the highest maximum expected return.

Another way you can think of it is like this: there is an optimal value function, or maximum expected return for every action. Some actions will have a larger q-value (the output of the optimal value function) than others. You can assess the q-value of a state action pair by simply looking at the next state (s’), and taking the largest q-value in that state from any of the actions possible at that state. Adding that q-value to the reward received at that state(Rt+1) will give you the q-value of the state-action pair in the preceding time step.

Why do we only have to look at the very next state? Why do we not have to go through every state possible until the final time step T? Because, the q-value for any of the actions at that state (s’) will contain the maximum expected return you could ever get from taking that action. The q-value of a state-action pair represents the highest expected discounted reward ever achievable after taking said action.

This bellman equation is used all over in reinforcement learning. We use it in q learning, to calculate the optimal action-value function of any state-action pair (or the q value).

With the ability to calculate the maximum expected discounted reward (whew that is a large term 😉), we can begin to see what a q-learning model even looks like.

Introducing a Q-Learning Problem

To do this, we’re going to use a simple game as an example. Let’s take a 3x3 grid.

My little makeshift game :)

In this grid, there are 9 possible states. Each grid represents a different state. The stick figure, our agent (let’s call him Mario), has 4 different possible actions: move up, move down, move left, and move right. The game ends in victory if Mario can make it to the 10 gold coins safely, and it ends in defeat if Mario ends up in the lava. The agent gets a reward of -1 if it moves to a state with nothing in it, a reward of +1 if it finds the gold coin, a reward of +10 if it finds the 10 gold coins, and a reward of -10 if it gets electrocuted by lightning.

In Q learning, we have a massive matrix, or table, to map all the possible q-values of every state-action pair. Reminder: the q-value of a state-action pair is the maximum discounted expected return possibly achievable from taking the specific action in the given state.

Our Q table will look something like this:

At first, all of the q-values will be initialized to zero. But as we take different actions in different states, we can update the q-values by looking back at the state-action pair, and updating it to equal the reward we got for taking that action, plus the largest q-value from any of the possible actions we can take at the new state.

Since in the beginning, we do not know the q-values of any states and actions, at first the q-values we calculate will simply be the immediate reward we got from taking that action. But as we fill more and more q-values, when we go back and visit a state for the second time, we can update it’s q-value, now that we know much more about the rewards we will receive in future states.

Reinforcement Learning truly is trial by fire.

There are just a few things left we have to cover before we can actually begin updating this q-table. Firstly, we must discuss the age-old dilemma: exploration vs. exploitation.

In the beginning of the learning phase, where the agent knows nothing, it will randomly choose actions, and in the process learn more about the environment. This is called exploration.

But imagine a scenario where the agent discovers the single coin in the middle square, but never learns about the 10 gold coins. If it were to follow the q-table, it would simply go in and out of the box of single coins forever, since that is the action that it perceives to have the most value, and it would never learn about the 10 gold coins.

To combat this, we have our agent spend all of its time in the beginning exploring, and slowly as time progresses we will allow our agent to look at the q table and begin exploiting its knowledge. This strategy is called The epsilon-greedy strategy.

In practice, we do this by defining a variable epsilon, ∈. This variable is set to 1, and will slowly decay over time. In the beginning, when it is 1 or close to 1, the agent’s actions will be completely random. As epsilon decays towards 0, the agent will be more likely to exploit the knowledge it has gained.

Code example in Python

In a program, we would do this by having a random number between 0 and 1 generated each time we’re about to take an action. While epsilon is close to 1, the likelihood that the random number is greater than it is very low. So, the likelihood that we exploit in the beginning is very low.

The last thing we have to be aware of is how we update the q-table. When we visit a state for the second time, take an action, we don’t update the new q-value of that state-action pair by completely erasing the old q-value, and calculating a new q-value with more current information. Instead, we define a learning rate by which we move towards the direction of the new q-value, but not completely.

This is helpful for non-deterministic environments, where the result of your action is not guaranteed. Let’s say there is an objectively good action to take in a certain state, but one time you got unlucky and died while taking that action. We shouldn’t override our previous knowledge of the value of this action in this certain state, but we certainly shouldn’t ignore that sometimes we do die when we take this action.

The learning rate, or alpha α helps us take both into account. Oftentimes the learning rate will be a variable set between 0.9–0.99. I won’t go into detail of how the math works, and how the number is used to calculate how much you consider the old q-value when calculating the new one. All I’ll say is that a higher learning rate means placing more emphasis on the new q value, whereas a lower learning rate means placing more emphasis on the old q value

Q-Learning Equation & Implementation

Finally, with all of this knowledge, we can introduce the equation for updating the q-value for any state-action pair.

The learned value is simply the part of the bellman equation used to calculate new q-value

This equation is essentially using the bellman equation to update the q-value, while making sure to take into account the learning rate. Let’s do a small example on the q-table we made above with arbitrary q-values.

We start in empty 1 (first row), and take action move right to end up in empty 2 (second row)

Let’s say we start in empty 1, and we move to the right to empty 2. Since there was nothing there, our agent gets a reward of -1. The alpha, or learning rate, is set to 0.99 The discount, or gamma, is set to 0.9. We just found Rt+1, which is -1. The old q-value for the state-action pair just took was -1. If we look in the new state we are in, (empty 2) The q-values for all the possible actions in that state are (0, -1, 2, 10). We can clearly see that maxq*(s’,a’) = 10. We can plug in all of the different numbers into our equation, and update the q-value accordingly. Do your own math as you follow along, to make sure you understand how these values are calculated.

we have all of our necessary values
we calculate the new q-value for this state-action pair

Therefore, our new q-value for moving right, when in the state ‘empty 1’ is 7.91. Our agent has learned that taking this action in this certain state has positive benefits for it, and it will prioritize it in the future.

Since all artificial intelligence in programed, let’s see what this looks like in code:

Sorry there is a line going through the middle, unfortunately can’t be helped

Final Thoughts

Just like that, we have created a q-learning algorithm! Unfortunately, this method can only really be useful in small state/action scenarios. For little video games, it’s very effective/fun! But if we want to model a real world situation with reinforcement learning, have an agent learn to do truly incredible things, we will need to use deep reinforcement learning

Deep reinforcement learning combines reinforcement learning and neural networks, to tackle some truly huge problems. But, that is an article for another day. 👀

Regardless, you should be inspired! This is the first step towards Reinforcement Learning mastery, and we have some foundation down. Even if you don’t walk away feeling awed, I want to leave this very important impression on you:

Reinforcement Learning can do big things, and they’re coming sooner than you think.

Don’t believe me? Check this out.

Self Driving Cars:

This car was taught to drive itself with RL. In a DAY.

Self driving cars are a hot topic these days, and for good reason! This car trained on only a limited area, but it’s still remarkable for how fast it learned. Check them out, they’re called Wayve.

Robots:

Boston Dynamics, the company that created this robot, used mostly hard-programming. But there are reinforcement learning robots out there right now that can open doors, do basic tasks! It’s only a matter of time before little robot helpers are running around in our everyday lives.

But besides the future, RL is being utilized right now.

Product Recommendation

Ever wondered how Netflix recommends their movies to you? Well, Reinforcement Learning is a key method used by not only Netflix, but many other companies to recommend products.

Congratulations on making it to the end! I know this was a lot to digest, so if your feeling unsure about any of the concepts, I highly recommend giving this online course a watch. It breaks everything down in an easy to understand way, and isn’t too long.

Happy learning!

Me, being oh so very proud of you

--

--

Jeremi Nuer

What does the future hold? I’m exploring emerging technologies such as AI and Quantum Computing