Introduction to Reinforcement Learning with Deep Q-Learning

As Neural Networks (NNs) usage is popular to solve computer vision, regression or classification problems, another Machine Learning branch with great highlight is Reinforcement Learning (RL).

Reinforcement Learning

To explain RL behavior we can think about it as animals learning process works. Picture a person who does not know the flavor of some food and tastes it for the very first time. This person may identify that food as something good or bad and this new knowledge will be used to decide if eat or not that food next time.

This same concept is adaptable in RL, the algorithm has some knowledge about the task it is trying to perform and with that makes the best choices to accomplish the work.

In this context, we have some components that are present in Reinforcement Learning algorithms: agent, environment, state, action and reward. The image below shows the relationship among these parts.

Relationship between agent and environment. Source:

The main idea is that agent can choose which action is better during certain state, that is, which gives better value. The challenge of RL is creating algorithms that are able to make these choices in the best way possible.

In this article, the focus is the solution called Deep Q-Learning proposed by DeepMind Technologies researchers.

Deep Q-Learning

Before we try to figure out what “deep” stands for in this context, let’s understand what is Q-Learning.

Q-Learning is an algorithm that uses a table (Q-Table) to map actions to values. So our agent only needs to look at the table to choose an action.

The values in Q-Table are calculated performing an exploration phase. During this stage the agent choices random actions and based on the corresponding rewards the table is updated, according to the equation:


  • Q(s, a) is Q-Function;
  • s is the state;
  • a is the action;
  • α is the table learning rate;
  • r is the reward;
  • γ is a penalty applied to the new state actions, since these actions are not 
    guaranteed to happen.

The exploration phase previously mentioned is important to avoid that the agent exploits actions that produces good rewards, but not discovers new states that could offer even better ones. Think in this problem like a dish choice task, it is more likely that you choose a dish that you already know in a restaurant than a new one that may not be good for you. In ML, this relation is known as Exploration/Exploitation.

A drawback for solutions that use Q-Table is that for problems with a huge number of states the computational cost to create the table is high. Imagine games where each frame is a new state, depending on the number of pixels per frame we may have a huge states count.

With that in mind, the researchers proposed the called Deep Q-Learning, it uses Neural Networks to replace Q-Tables and allows higher dimensional inputs.

The principle is an NN that receives a state as input and its outputs are the values linked for each action possible, so the best one is chosen.

What we want with NN is minimize the error between Q-Function’s value for current state and the next one:

Now we are going to implement a simple Deep Q-Learning solution.


For our example, we are going to use OpenAI Gym, a tool with several reinforcement learning problems. Specifically we will train an agent to play “CartPole-v1” environment.

In this environment the challenge is keeping an inverted pendulum in vertical position moving a cart to the left or right. The episode ends if the pole if more than 15º from vertical axis.

The complete implementation code may be observed in the repository:

As we can see, a proposed optimization is called Experienced Replay. It is a buffer that records states, actions, rewards and next states for each step, then during training phase a random set is sampled from this buffer to update the network. This avoids that the network "forget" past behaviors and breaks correlations due consecutive frames usage.

The training result is showed in the GIF bellow:

Trained agent


With a simple Neural Network we were able to train an agent to play the “CartPole-v1” environment from OpenAI Gym.

Since the Deep Q-Learning algorithm is generic it could be applied to other environments without much modifications. DeepMind, for example, has already tested this with Atari 2600 games, as we saw in their article previously.