Geek Culture
Published in

Geek Culture

Photo by @cpwrm on Unsplash

A Simple Guide To Reinforcement Learning With The Super Mario Bros. Environment


Let’s say we want to design an algorithm that will be able to complete the first level of the Super Mario Bros. game. How will we do that?

The goal of this game is quite simple — get the flag at the rightmost end of the level as fast as possible. To do that, we need to observe the current state of the game s, move Mario by pressing control buttons (let’s call it an action a), and check how far we are from the goal after that (in reinforcement learning action’s performance is evaluated by reward r), observe some new state s’ (new frame in our case) and make the next move. As a result, we will get a sequence of actions, states and rewards called a trajectory:

A trajectory of (s,a,r) tuples

In more formal terms, this environment can be described as a Markov Decision Process if we act by considering multiple frames at once to incorporate the character’s speed. It means that transitions between states only depend on the latest pair of state and action, without a prior history.

Let’s call an estimator (it could be a neural network) that takes a state and tells us which action to take a policy, so our goal is to maximize the expected return over a trajectory:

Policy performance J

Policy Gradient

To maximize the expected return, we would like to optimize the policy parameters by gradient ascent, just like any other neural network:

The gradient of policy performance J is called the policy gradient. Now we need to express the policy gradient in a form that will be computationally feasible. First, let’s derive the analytical form of policy gradient:

As you can see in the last formula, here we take a logarithm of the probability of a trajectory that could be expressed like this:

As a result, derivation of policy gradient takes the form:

But, actually, we should only increase the possibility of actions considering their consequences, and also introduce discount factor γ between [0,1] as a penalty to the uncertainty of future rewards:

We can estimate this expectation by sampling by collecting a set of trajectories where an agent produces actions according to the current policy. This family of optimizations is called on-policy methods.


Let’s consider a parameterized probability distribution over a random variable, so:

This means we could add or subtract from the policy gradient expression any baseline function which only depends on the state without changing the resulting expectation:

In practice, the most common choice of the baseline is the on-policy value function V(s) that is approximated by a neural network (sometimes called critic), which is updated concurrently with the policy. In general, baseline helps to reduce variance and increase the stability of the training process.

Proximal Policy Optimization With Generalized Advantage Estimation

The vanilla policy gradient method has quite a poor sample efficiency because of the high variance of gradient estimation. It is caused by the difficulty of credit assignment to the actions which affected the future rewards, especially in environments with long trajectories. This variance is sometimes reduced by employing the baseline as mentioned above. But there is another way to estimate the rewards of a trajectory — importance sampling. In this method, the expected reward can be computed with a different policy and later refined by the importance ratio r:

So the objective function takes the following form, where the A could be the discounted return as in the policy gradient method or the advantage function as in GAE (explained below):

If both policies are the same, then this formula directly translates into the policy gradient:

Actually, the importance ratio only depends on the policies, not the environment:

Another reason for the low sample efficiency of the policy gradient is that it requires collecting new trajectories of samples using the new policy to compute the next gradient update. Old samples collected using the old policy are not reusable. So, to overcome this problem, we could introduce a surrogate objective function optimization of which guarantees a policy improvement as long as new and old policies are similar. This error could be bounded by the KL divergence between these two policies. The added constraint helps us not to take over-optimistic actions that will hurt the training progress.

Considering these facts, the TRPO paper introduced a novel approach, which aimed to maximize the objective function while enforcing the KL-distance between old and new policies small enough (so-called trust region constraint):

TRPO objective function

The TRPO authors proposed a quite sophisticated method of approximately solving this problem, but OpenAI created a new way to limit the size of the policy update by introducing a Clipped Surrogate Objective function in their Proximal Policy Optimization paper:

The first term inside the minimization is the same as above, but in the second term, the ratio is clipped between (1 ε, 1+ε).

When applying PPO on the neural network with shared parameters for both policy (actor) and value (critic) functions, in addition to the clipped surrogate, the objective function is combined with an error of the value estimation (the second term) and an entropy bonus (the third term) to encourage sufficient exploration (with c(1) and c(2) as hyperparameters):

The error of the value estimation is just an MSE loss between actual returns and the ones estimated by the critic.

Also, because of the clipped surrogate objective function, the PPO algorithm allows us to run multiple epochs of gradient ascent on the same samples without harming our policy. This approach dramatically increases sample efficiency.

Advantage Function

The last thing we need to figure out about the PPO algorithm is how to calculate the advantage values. Unlike the Policy Gradient method, the PPO algorithm implies the usage of two estimators — actor and critic. The second one — the critic — returns an estimation of the expected return for a given state. So, naturally, we want to use this estimation to calculate advantage — a value that tells us whether the actions chosen by the model lead to a better or worse reward than we expected:

We take an exponentially weighted average of the advantages to balance bias and variance. This is called Generalized Advantage Estimation:

Double Deep Q-Network

Another approach to solving reinforcement learning problems is to approximate an optimal action-value function, which gives us the maximum expected return if we start in the state s and take some action a:

Optimal Action-Value function

This function obeys the Bellman equation, which tells us that if the optimal value Q for the state s is known for all possible actions a, then the optimal strategy is to select the action a maximizing the reward plus the value of state you land next:

The Bellman equation (the next state s’ is sampled from the environment’s transition rules P)

The simplest (and the oldest) algorithm for estimation optimal Q values for all state-action pairs is Q-learning. It assumes we have a large look-up table for all such pairs, and within one episode, estimation occurs as follows:

  • Observe the current state s at time step t.
  • Select and perform an action a with the highest Q value for the current state (or pick the random action sometimes — it’s called the ϵ-greedy approach).
  • Observe the next state s’ and the reward r’ at time step t+1.
  • Update the Q-values according to the formula below. Here we estimate Q’ out of the best Q values for the next state, but which action a’ leads to this maximal Q is not quite important:

Memorization of all the Q values for all state-action pairs is impractical. A much better way is to approximate Q values using a function. This is where the neural network comes into play.

An algorithm called Deep Q-Network greatly improved training stability and reduced resource requirements by introducing two innovative approaches — experience replay and the periodically updated target network.

Experience replay mechanism uses a single replay memory of fixed size where the N last (s, a, r, s’) tuples are stored. These samples are randomly drawn from the replay memory during training. It significantly increases sample efficiency and reduces correlations between sequences of observations.

A periodically updated target network means keeping a separate cloned instance of the neural network whose weights are frozen and synced with the leading network only every C step. This network is used to estimate target Q values during training. This mechanism reduces the impact of short-term fluctuations and thus stabilizes the training process.

The loss function for the Deep Q-Network looks like this (where U(D) is a uniform random sampling from the replay memory):

Deep Q-Network loss function

Unfortunately, Q-Learning (and the DQN based on it) algorithm has a major drawback — it tends to significantly overestimate action values. This happens because the Q-Learning algorithm uses the same set of samples to find the best action (with the highest expected reward) and to estimate the action value. So if the action’s value was overestimated and it was chosen as the best action, then the Q value is overestimated too. If the overestimations over all the Q values are not uniform (this largely depends on the environment’s transition rules and size of the action space), we will spend more time exploring such non-optimal states, and the learning process will be slow. You can find a more formal explanation here.

That’s why in 2010, Hado van Hasselt introduced a new way to estimate Q values called Double Q-learning. This method uses two estimators A and B, which are updated alternately. If we want to update estimator A, then the Q value for the next step is evaluated using estimator B. This approach fixes the overestimation problem because one of the estimators might see samples that overestimate action a1, while the other sees samples that overestimate action a2.

So in 2015, the same team published the paper with an updated version of the DQN algorithm called Double Deep Q-Network that has the following loss function:

Double Deep Q-Network loss function

As you can see here, the selection of the action in the argmax is decided by the online network, so this value estimation follows the greedy policy according to the current values, but we use the target network to evaluate the value of this policy. We can update the target network by periodically syncing its weights with the online network or switching roles of these two networks. This loss function will be used later in the practice section of this article.


In the practice section of this article, we will use the first level from the Super Mario Bros. game as an environment. By default, the single observation is a 240 x 256 pixels RGB image, so we need to write a few wrappers to transform it to a grayscale image with a resolution of 84 x 84 pixels. Also, not all observations are useful, so that we will use only every fourth observation and stack them together:

Now we can create our environment and set the random seeds to the fixed values to get reproducible results. Also, for simplicity, the action space was limited to two actions — moving to the right and a combination of moving to the right and jumping:

In all the implementations below, I tried to use the same model architecture, optimizer, learning rate, and common hyperparameters such as gamma to directly compare the algorithms’ efficiency later.

Policy Gradient

This implementation of the Policy Gradient algorithm directly follows the algorithm described in the previous section and uses the PyTorch Distributions package:

Policy Gradient implementation
The plot of average reward per 10 episodes

Double Deep Q-Network

The DDQN implementation differs from the theory described above only in one detail — SmoothL1Loss is used instead of MSE for better results:

Double Deep Q-Network implementation
The plot of average reward per 10 episodes

Proximal Policy Optimization With Generalized Advantage Estimation

The PPO+GAE algorithm is implemented by following the theoretical explanation as close as possible with the introduction of random minibatches for training stability:

PPO+GAE implementation
The plot of average reward per 10 episodes


As you can see from the plots above, the relative sample efficiencies of the described algorithms follow their theoretical estimations.

This project is also available on my GitHub.



Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store