# A Simple Guide To Reinforcement Learning With The Super Mario Bros. Environment

# Theory

Let’s say we want to design an algorithm that will be able to complete the first level of the Super Mario Bros. game. How will we do that?

The goal of this game is quite simple — get the flag at the rightmost end of the level as fast as possible. To do that, we need to observe **the** **current state of the game s**, move Mario by pressing control buttons (

**let’s call it an action**), and check how far we are from the goal after that (

*a***in reinforcement learning action’s performance is evaluated by reward**observe some

*r*),**new state**and make the next move. As a result, we will get a sequence of

*s’*(new frame in our case)**actions, states and rewards**called

**a trajectory:**

In more formal terms, this environment can be described as a **Markov Decision Process**** **if we act by considering multiple frames at once to incorporate the character’s speed. It means that transitions between states only depend on **the latest pair of state and action, without a prior history.**

Let’s call an estimator (it could be a neural network) that takes **a state** and tells us which action to take **a policy**, so our goal is to maximize the expected return over a trajectory:

# Policy Gradient

To maximize the expected return, we would like to optimize the policy parameters by gradient ascent, just like any other neural network:

The gradient of policy performance ** J **is called the

**policy gradient.**Now we need to express the policy gradient in a form that will be computationally feasible. First, let’s derive

**the analytical form of policy gradient**:

As you can see in the last formula, here we take a logarithm of the probability of a trajectory that could be expressed like this:

As a result, **derivation of policy gradient** takes the form:

But, actually, we should only increase the possibility of actions considering their consequences, and also introduce discount factor **γ **between **[0,1] **as a penalty to the uncertainty of **future rewards:**

We can estimate this expectation by sampling by collecting a set of trajectories where an agent produces actions according to the current policy. This family of optimizations is called **on-policy methods.**

## Baselines

Let’s consider a parameterized probability distribution over a random variable, so:

This means we could add or subtract from the policy gradient expression any **baseline **function which only depends on the state without changing the resulting expectation:

In practice, the most common choice of the baseline is **the on-policy value function** ** V(s) **that is approximated by a neural network (sometimes called

**critic**), which is updated concurrently with the policy. In general,

**baseline**helps to reduce variance and increase the stability of the training process.

# Proximal Policy Optimization With Generalized Advantage Estimation

The vanilla policy gradient method has **quite a poor sample efficiency **because of the **high variance of gradient estimation**. It is caused by the difficulty of credit assignment to the actions which affected the future rewards, especially in environments with long trajectories. This variance is sometimes reduced by employing the **baseline as **mentioned above. But there is another way to estimate the rewards of a trajectory — **importance sampling.** In this method, the expected reward can be computed with a different policy and later refined by the **importance ratio r:**

So the objective function takes the following form, where the ** A** could be

**the discounted return as in the policy gradient method**or

**the advantage function**

**as in**

**GAE**

**(explained below)**:

If both policies are the same, then this formula directly translates into **the policy gradient**:

Actually, **the importance ratio** only depends on the policies, not the environment:

Another reason for the **low sample efficiency of the policy gradient** is that it requires collecting **new trajectories** of samples using **the new policy** to compute **the next gradient update**. Old samples collected using the old policy are not reusable. So, to overcome this problem, we could introduce a **surrogate objective function** optimization of which guarantees a policy improvement as long as **new and old policies are similar**. This error could be bounded by **the KL divergence** between these two policies. The added constraint helps us not to take over-optimistic actions that will hurt the training progress.

Considering these facts, the **TRPO**** **paper introduced a novel** **approach, which aimed to maximize the objective function while enforcing the **KL-distance** between old and new policies small enough (so-called **trust region constraint**):

The **TRPO** authors proposed a quite sophisticated method of approximately solving this problem, but OpenAI created a new way to limit the size of the policy update by introducing a **Clipped Surrogate Objective** function in their **Proximal Policy Optimization** paper:

The first term inside the minimization is the same as above, but in the second term, **the ratio is clipped between (1 **–** ε, 1+ε)**.

When applying PPO on the neural network with shared parameters for both policy **(actor)** and value** (critic)** functions, in addition to the **clipped surrogate**, the objective function is combined with an **error of the value estimation (the second term)** and an **entropy bonus**** (the third term)** to encourage sufficient exploration (with **c(1)** and **c(2)** as hyperparameters):

**The error of the value estimation** is just an MSE loss between actual returns and the ones estimated by the critic.

Also, because of the clipped surrogate objective function, **the PPO algorithm allows us to run multiple epochs of gradient ascent on the same samples** without harming our policy. **This approach dramatically increases sample efficiency.**

## Advantage Function

The last thing we need to figure out about the PPO algorithm is **how to calculate the advantage values**. Unlike the Policy Gradient method, the PPO algorithm implies the usage of two estimators — **actor and critic**. The second one — **the critic **— returns an estimation of the expected return for a given state. So, naturally, we want to use this estimation to calculate advantage —** a value that tells us whether the actions chosen by the model lead to a better or worse reward than we expected**:

We take an exponentially weighted average of the advantages to balance bias and variance. This is called **Generalized Advantage Estimation****:**

# Double Deep Q-Network

Another approach to solving reinforcement learning problems is to approximate **an optimal action-value function**, which gives us the maximum expected return if we start in the state ** s **and take some action

**:**

*a*This function obeys **the Bellman equation, **which tells us that** **if the optimal value** Q **for the state

**is known for all possible actions**

*s***, then the optimal strategy is to select the action**

*a***maximizing the reward plus the value of state you land next:**

*a*The simplest (and the oldest) algorithm for estimation optimal ** Q** values for all state-action pairs is

**Q-learning**. It assumes we have a large look-up table for all such pairs, and within one episode, estimation occurs as follows:

- Observe the current state
at time step*s**t.* - Select and perform an action
with the highest*a*value for the current state (or pick the random action sometimes —*Q***it’s called the ϵ-greedy approach**). - Observe the next state
and the reward*s’*at time step*r’**t+1.* - Update the
**Q-values according to the formula below. Here we estimate**out of the best*Q’***Q values**for the next state, but which actionleads to this maximal*a’***Q**is not quite important:

Memorization of all the ** Q** values for all state-action pairs is impractical. A much better way is to approximate

**values using a function. This is where the neural network comes into play.**

*Q*An algorithm called **Deep Q-Network** greatly improved training stability and reduced resource requirements by introducing two innovative approaches — **experience replay** and the **periodically updated target network**.

**Experience replay** mechanism uses a single replay memory of fixed size where the ** N** last

**tuples are stored. These samples are randomly drawn from the replay memory during training.**

*(s, a, r, s’)***It significantly increases sample efficiency and reduces correlations between sequences of observations.**

**A periodically updated target network **means keeping** **a separate** **cloned instance of the neural network whose weights are frozen and synced with the leading network only every ** C** step. This network is used to estimate target

**values during training.**

*Q***This mechanism reduces the impact of short-term fluctuations and thus stabilizes the training process.**

The loss function for the **Deep Q-Network **looks like this (where ** U(D)** is a uniform random sampling from the replay memory):

Unfortunately, **Q-Learning** (and the **DQN **based on it) algorithm has a major drawback — **it tends to significantly overestimate action values**. This happens because the **Q-Learning** algorithm uses the same set of samples to find the best action (with the highest expected reward) and to estimate the action value. So if the action’s value was overestimated and it was chosen as the best action, then the ** Q** value is overestimated too. If the overestimations over all the

**values are not uniform (this largely depends on the environment’s transition rules and size of the action space), we will spend more time exploring such non-optimal states, and the learning process will be slow. You can find a more formal explanation**

*Q***here**.

That’s why in 2010, **Hado van Hasselt** introduced a new way to estimate ** Q** values called

**Double Q-learning**.

**This method**

**uses two estimators**

**and**

*A***which are updated alternately. If we want to update estimator**

*B,***, then the**

*A***value for the next step is evaluated using estimator**

*Q***. This approach fixes the overestimation problem because one of the estimators might see samples that overestimate action**

*B***, while the other sees samples that overestimate action**

*a1*

*a2.*So in 2015, the same team published the paper with an updated version of the **DQN **algorithm called **Double Deep Q-Network**** **that has the following loss function:

As you can see here, the selection of the action in the **argmax **is decided by** the online network**, so this value estimation follows the greedy policy according to the current values, but we use **the target network** to evaluate the value of this policy. **We can update the target network** by periodically syncing its weights with the **online network** or **switching roles of these two networks. **This loss function will be used later in the practice section of this article.

# Practice

In the practice section of this article, we will use the first level from **the Super Mario Bros****. **game as an environment. By default, the single observation is a **240 x 256 **pixels RGB image, so we need to write a few **wrappers**** **to transform it to a grayscale image with a resolution of **84 x 84** pixels. Also, not all observations are useful, so that we will use **only every fourth observation** and **stack them together**:

Now we can create our environment and **set the random seeds to the fixed values to get reproducible results**. Also, for simplicity, **the action space was limited to two actions** — moving to the right and a combination of moving to the right and jumping:

In all the implementations below, I tried to use the same model architecture, optimizer, learning rate, and common hyperparameters such as gamma to directly compare the algorithms’ efficiency later.

## Policy Gradient

This implementation of **the Policy Gradient** algorithm directly follows the algorithm described in the previous section and uses the **PyTorch Distributions** package:

## Double Deep Q-Network

The **DDQN** implementation differs from the theory described above only in one detail — **SmoothL1Loss**** **is used instead of **MSE**** **for better results:

## Proximal Policy Optimization With Generalized Advantage Estimation

The **PPO+GAE** algorithm is implemented by following the theoretical explanation as close as possible with the introduction of random minibatches for training stability:

## Result

As you can see from** **the plots above, the relative sample efficiencies of the described algorithms follow their theoretical estimations.

**This project is also available ****on my GitHub****.**