A Comprehensive Guide to Deep Q-Learning

Jeremi Nuer
13 min readMay 19, 2022

Have you ever seen clips of artificially intelligent robots playing video games, and crushing humans at them?

Well, as an avid gamer myself, I decided that I wanted to understand exactly how AI was being built which could outperform humans (a mind-blowing task if you ask me!) The answer, of course, was human ingenuity. And a method called Deep Q-Learning.

Deep Q-learning is the amalgamation of Reinforcement Learning and Neural Networks. Simple, yet very effective. Deep Q-Learning is a powerful tool for creating agents that can solve complex tasks. From Chess to Atari Breakout to FPS games, DQN(Deep Q-learning Networks) can learn them all.

This article is quite long, so feel free to stop whenever and come back later. It’s broken up into 3 major parts:

1. Briefers

a) Reinforcement Learning Briefer

b) Q-Learning Briefer

c) Convolutional Neural Networks Briefer

2. Main Concepts

a) Deep Q-Learning Fundamentals

b) Fixed Q-Targets

c) Experience & Replay Memory

d) Epsilon-Greedy

3. Contextualization

a) Putting it All Together

b) Implementation

There are some articles that are prerequisites to this article. You must understand the concepts of Convolutional Neural Networks, and Q-learning. I’ll briefly go over those topics, but for an easier understanding, you can read articles that I’ve written on both subjects here (for CNN’s) and here (for Q-learning).


1a. Reinforcement Learning Briefer

As a brief reminder of what Reinforcement Learning is. There are two main components, the agent and the environment. The agent is our AI, which could differ depending on whatever method of RL you are using. The environment is what the agent resides in, be it a game, a simulation, or real life.

The agent receives some information from the environment called the state. This information could be image data, or perhaps the coordinates of the agent and its surroundings. Regardless, the agent has to use the information from the state to choose an action. The action it chooses will then cause the environment to change, and a new state will be outputted to the agent from the environment.

Think of this as Mario deciding to jump in the video game–he might hit a brick, and a gold coin would appear. This is all information which would be conveyed in the next state. This state comes along with a reward — an appraisal of the agent’s last action. The reward is completely based upon the goal that you set for your agent to achieve. If the goal of a RL agent is perhaps to survive in a video game, they might receive positive reward for continuing to survive, and negative reward for dying. Each time an agent receives the appraisal from its previous action, it is able to learn which actions are the best to take, and under what circumstances.

1b. Q-Learning Briefer

Q-learning is a Reinforcement Learning method. It is based on value iteration, and works in a finite state and action space (this just means there is a specific amount of actions the agent can take, like move left, right up or down; and a specific amount of states the agent can be in, like the number of spots on a monopoly board. The fundamental idea behind Q-learning is that each state-action pair (taking a certain action at a specific state) has some value attached to it.

The same action taken at two different states would have different values, and two different actions taken at the same state would have different values.

The ‘value’ of each state-action pair corresponds to the immediate reward after taking said action at said state added to the expected sum of all future rewards if we take the best action possible until we succeed or die. Essentially, the value is how good our action was at progressing us towards our goal.

I say expected because it is impossible for the computer to know for sure what the best action is, or what the reward will be so far in the future. This is where value iteration comes into play.

Each time an action is taken, a q-value is computed. If we’re in the very beginning of training the agent, there might not be any information on the actions that follow the one we just took. If so, the q-value would only amount to the immediate reward we just received from taking that action.

But, later in the future, as we explore more and more states, and take more and more actions, we’ll have information about other state-action pairs. We’ll have calculated more Q-values.

Remember that the q-value represents the best information we have regarding that state-action pair: the immediate reward gained from that state action pair, summed with the greatest reward we have seen that state-action pair lead to. From this comes a handy trick: we can calculate the Q-value of a state action pair by summing the immediate reward we get from it with the highest Q-value we predict from the following state.

As such, the equation for calculating the Q-value for any state action pair is as follows:

(note that this is not the full equation, but it describes essentially what is going on. The only difference in the full equation is that there is a learning rate where the equation takes into account what the previous calculation for the Q-value was)

In normal Q-learning, the Q-values for every state-action pair are calculated, and stored in a large table. However, that is where Deep Q-Learning Networks differ, which use neural networks for that task instead.

The two most important things to understand going into the bulk of this article are: that the Q-value of each state-action pair is the reward received from taking that action, along with the sum of all future rewards if we take the best actions going forward, and that the Q-values of the state-action pairs are constantly being iteratively updated, as we learn more information about what rewards we get from different state-action pairs.

1c. Convolutional Neural Networks Briefer

I won’t spend too much time talking about CNN’s, and I would highly recommend you check out my article on it to get a fundamental understanding of how exactly it works. For now I’ll just talk about what it does.

Convolutional Neural Networks are a form of neural networks, for the most part used for image data analysis. Neural Networks are essentially a composition of functions, which take data and transform it. So for CNN’s, image data is inputted as the form of an array of pixel values, which are transformed. By the end, the output is still an image, but some pixel values are altered to exaggerate certain features of the image. This makes it much easier to analyze the images, which is done in the second half of CNN’s by just using normal neural network layers.

Main Concepts

2a. Deep Q-Learning Fundamentals

Deep Q-Learning is the combination of Q-learning, and neural networks. Whereas before, Q-learning was only good enough to solve simple problems, which could easily be done by humans, Deep-Q Learning is truly where the power of Reinforcement Learning shines.

The concepts behind Deep-Q Learning are much the same as regular Q-learning. Each state action pair has an associated Q-value, which determines how much value we can get from taking the specified action at the specific state.

As mentioned earlier, this value is essentially the sum of all possible future rewards, discounted.

The way DQN (Deep Q Neural Networks) differ from normal Q-learning, is that instead of having a table that stores the q-values for every single state action pair, the value is approximated as the output of the neural network.

The way this plays out is that as our agent moves throughout the environment, it interprets the state by looking at the image displayed on the screen (just as a human would). It does this by form of neural network: the pixel values of the image are the inputs to the neural network, and the q-values for each possible action at that state is the output of the neural network.

In this sense, the purpose of the agent is to get as good as possible at estimating the Q-value of state-action pairs. At first, of course, the guesses will be completely random- the neural network’s weights and biases are initialized completely randomly.

But we need a way to train the agent, to calculate some form of loss (link) for the estimated Q-values. So how do we do this? We compare the original guess our neural network made to a slightly better guess it makes once it has more information on what happened. So, we need a way of calculating a slightly better guess.

Well, we use the same method of calculating Q-values in regular Q-learning for our educated guesses. We state that the true Q-value is equal to the immediate reward from taking the action at the specified state plus the discounted best Q-value from any of the other ensuing actions at the next time step.

In practice, we calculate this by taking the immediate reward received from the state action pair, and adding it to the highest q-value that our neural network spits out for the next state.

So, we are comparing our originally guessed Q-values to a slightly better educated guess, that takes in the reward received from the state action pair, and the estimated best potential reward in the future (in other words, the highest Q-value we get from any of the actions at the next state).

2b. Fixed Q-Targets

There’s one issue with this approach however. By using the same network for our guess, and our optimization, we are effectively chasing our own tail. Let me explain.

If our neural network makes a bad estimate for the optimal Q-value (the second calculation), our AI will be optimizing for bad habits, as we are not actually going in the right direction. Our network’s weights and biases will be updated, enforcing that previous bad guess. Since the networks are the same, as the same state action pair comes up later, that bad guess will be even further enforced as our ‘optimal Q-value’ is based on an optimization of the weights and biases that was invalid.

The solution to this is to use two neural networks, one for training and estimating the Q-value of the state-action pair (the original guess), and the other used to calculate the optimal Q-value (the educated guess), and calculate the loss for the original network. The second network has the same weights and biases as the first network, except it only updates to equal the first network every so often. This way, the first network is given enough time to ‘reach’ the values that the second network had calculated before the second network is updated to more accurately represent the agent’s knowledge of the environment.

What this means in practicality: there is a second neural network for calculating the Q-values of the state after the one you are training on. That second neural network is updated to equal the first neural network after a certain amount of time steps.

2c. Experience & Replay Memory

Another important aspect of Deep-Q Learning that is very effective, and broadly (but not always!) used, is experience and replay memory.

Essentially, the idea behind this concept is that instead of training on each state action pair right after we go through them, we instead store the ‘experiences’ (which contain the state, the action taken, the reward received, and the next state) in one long list. This list is called the replay memory, because it has a ‘memory’ of all the experiences the agent had. We then ‘replay’ the memory for training by randomly sampling an experience from the memory.

The reason we do this is to eliminate the correlation that comes up when we train on experiences that happen one right after another. By randomly training on different situations at different times, we’re able to more objectively learn what actions actually correlate to a high q-value.

Finally, let’s talk about a few important concepts from Q-learning that are also implemented in Deep Q-Learning.

2d. Epsilon-Greedy

The main idea behind reinforcement learning in general, is to use information and knowledge we have learned about the environment overall, to make good, informed actions that maximize the reward.

One issue that often comes up, is that an agent can find one thing that works semi-effectively, and just stick to that tactic, maximizing the reward for this tactic, when it isn’t even the best one that exists. In this scenario, the agent did not explore the environment enough to learn about other, better tactics.

The best way for an agent to act, is to explore the environment in the beginning phase, and learn as much as possible as to what different scenarios will lead to what rewards. Once it has sufficiently explored, it will then begin to exploit its knowledge, and take the actions that it knows will result in the highest Q-values.

This strategy is called The epsilon-greedy strategy.

In practice, we do this by defining a variable epsilon, ∈. This variable is set to 1, and will slowly decay over time. In the beginning, when it is 1 or close to 1, the agent’s actions will be completely random. As epsilon decays towards 0, the agent will be more likely to exploit the knowledge it has gained.


3a. Putting it All Together

This might all seem like a jumbled mess, so let’s put the pieces together to show the step by step process of how Deep Q-Learning works in practice.

Initialization: We have two neural networks that are initialized with the same weights and a list for storing all the experiences. We also have a host of numbers having to do with the different processes which are initialized e.g. the batch size for the neural networks, the learning rate, etc. It’s really not important to go over these, we’ll see it in the code, and these variables generally already have a lot of research done in terms of what the optimal value is, so besides a little fiddling, there isn’t anything that needs to be done for that.

The “training loop”

  1. The Agent starts the game. It will choose an action via Exploration/Exploitation

a) In the beginning, the actions will be mostly random. Towards the end, the actions will be mostly intentional

b) When actions become intentional, they are chosen in the following process: the state is input into the neural network, then the neural networks output a list of q-values, and finally the action which has the highest q-value is the one which is chosen.

2. After taking an action, the agent records the following information: the state the agent started in, the action it took, the reward it got, and the following state it was in.
a) One ‘set’ of these numbers is known as an experience.

b) The experiences are stored in the Replay Memory, which is just a list of all the experiences. We will use the experiences later.

3. Every couple of time steps, the agent indexes a set of experiences for the networks to train on.

a) The network will take the original state as input into the neural network, and it will spit out the q-values for each possible action. But, we only care about the q-value of the action that was actually taken in the experience.

b) Since we have the reward and following state in the experience from taking that action, we have slightly more information on what the Q-value should be. We calculate the optimal Q-value (using the target network!) by adding the reward, and the largest Q-value outputted from the network at the following state. This satisfies the Bellman Equation (see 1.3)

c) We train the (policy) network by comparing the original Q-value for the action the agent took in the experience to the optimal Q-value that we calculated in step 3b. This is the ‘loss’ of our neural network

d) The target network is not trained yet, as its parameters remain fixed and update infrequently.

4. Every dozen or so time steps (the number varies) the target network’s parameters are updated to match the parameters of the policy network.

3b. Implementation

So how can we apply our new-found knowledge to actual scenarios, real video games? Well, there’s three ways to go about it. The first option is to create your own environment — maybe you want to simulate solitaire. Well, you would have to simulate all the rules, figure out how much reward to give the agent depending on what happens, and figure out exactly what information to convey in the state.

The second option is to create your own way to streamline data from a video game into the program. If you can code a way to take each frame the video game displays, record its pixel values, and send the numbers into your program, then you could play traditional video games.

Neither of these options do I know how to do–yet! Next project I will dive into creating my own environment. But for now, I’ll settle with the third option.

The third, and easiest option, is to use a library which already exists, and has pre-made games specifically for reinforcement learning agents. Enter: OpenAI Gym Library. OpenAI (one of the biggest companies in the AI space) has a gym library, where they create video games which you can easily interact with, and gather information from the environment. It makes connecting the program to the video game much easier.

This is what I’ll be doing–creating a DQN agent on OpenAI Gym’s Cartpole environment. This is a simple game where the AI tries to balance a pole that is on a cart–by moving the cart left and right.

If you’re confident in your coding abilities, and want a challenge, try to code a DQN agent which can play this game effectively on your own! But if you want to follow along more closely with a tutorial, check out this youtube video. I go in depth into every line of code, and what it means.

Part 1 of a 4 part series. You can find the other videos on my channel

There aren’t many prerequisites to this. You can use any coding platform you would normally use that supports python. If you don’t have anything downloaded, look up any tutorial for downloading a platform–could be VS code, could be anything. Additionally, if you don’t want to download any software, you can use Google Colab.

Once you have a way to write code, there will be important libraries that we will need to download. I’ll go more in depth into that in the youtube video. The most important thing to note is that we will be using Pytorch as our machine learning framework, and we’ll be training on the Cartpole environment from openAI gym.

And thus concludes this article! Good luck if you’re planning to continue by coding an implementation of this on the Cartpole Environment. It may not seem like a lot, but what we’re doing here is a big step on our journey towards AI mastery.

I’ll see you on the other side



Jeremi Nuer

What does the future hold? I’m exploring emerging technologies such as AI and Quantum Computing