Deep RL Debugging and Diagnostics

Taking a stab at the fine art of parameter tuning

Published in

The Startup

8 min readFeb 4, 2021

If you are here, you have probably realized by now that developing a working reinforcement learning agent is not an easy task. In a recent project, struggling to debug my agent, I dodged hours of careless parameter tweaking and instead set down and came up with a list of tangible debugging steps.

This post summarizes what I have learned probing the web for reliable advice on debugging deep reinforcement learning models methodologically. It presents a diagnostic toolkit that will, hopefully, help you as well.

Notice, this post assumes a solid basic understanding of reinforcement learning theory.

Deep RL models are notoriously difficult to debug as they present many closely interconnected systems. Regardless of where the issue is, it is likely to propagate quickly and affect all other modules. Organizing deep RL models into their tractable components is a great first step in the debugging process.

So let’s break it down and take a proper look at a deep RL model’s components individually. I generally recommend that you test the modules independently (either via unit tests or isolated, structured development). Additionally, using ‘assert’ statements to ensure expected dimensions, ranges, etc., can also support this process.

In this post, I chose to focus on the six following topics. Below, you will find a section focusing on each of them.

Make sure the environment makes a viable reinforcement learning problem.
Standardize your data.
Debug the optimization process.
Use metrics to examine the model’s performance.
Be mindful of parameter tuning.
Ensure your model’s stability and robustness.

1. Environment Design

When approaching a new task or developing a customized learning environment, one must ensure the task is reinforcement learning compatible. Basically, we can formulate the task as a Markovian process (usually a POMDP).

Try answering these questions to get an intuition about whether your task is Markovian.

Can the task be occasionally solved by a random policy? If so, an agent is likely to succeed, given its optimization core, which will seek and reward good behavior.
Are the observations given to your agent clear enough so that you generally understand them? Could there be another way to represent them?

Try to simplify the environment: reducing the task while capturing its essence allows you to eliminate potentially problematic moving parts.

Approach 1: Simplify the input feature space. Provide exact measures, rather than letting your agent approximate them. This can help the agent focus on explicitly learning the task, which can later be complicated by also asking it to interpret the environment.
Approach 2: Simplify the reward function. (1) Design continuous feedback to allow the agent to receive rewards quickly. If the reward is too sparse, the agent may fail to encounter it at all, making it harder to inspect the agent’s learning. (2) If the reward is stochastic, try starting with a deterministic one. This provides a less confusing learning environment.

2. Data Standardization

Unstandardized data may impair the model’s stability and can dramatically hinder learning. You can easily inspect the data’s statistics by plotting a histogram of it.

Observations- A good rule of thumb is to Z-transform the observations to avoid outliers. It is crucial to standardize consistently over the whole dataset; otherwise, you will practically change the model’s objective and introduce more instability. A further step you can take is to clip the standardized observations. However, be careful of clipping too aggressively, which tends to slow down learning.
Rewards- Scale, but do not shift. The agent’s will to live depends on the reward mean; thus, shifting it may interfere with learning.
Value function- standardizing the model’s prediction targets could further support the learning stability (however, it may be harder to do over the entire dataset).

3. Optimization Design

Deep reinforcement learning algorithms often represent the policy (or other learned control functions) as a neural network. Thus, debugging the neural network is essential in optimizing the agent’s control.

If you have picked up on the theme here, you know that my first piece of advice will be: simplify. Minimize the number of layers, and avoid fancy activations. Add them later, as you see fit.

Now, let’s debug.

First, and most importantly, double-check the neural network’s architecture: inspect the shape of your input and output and the activation of your output layer. Different action-selection strategies require different activations; ensure your network outputs the correct statistics.

Second, review the loss function. The loss is the model’s primary way of evaluating its performance and, in turn, is used to set important parameters. Ensure that the loss function is appropriate for the environment, compatible with the action-selection method, and matches the neural network output.

Third, inspect the optimizer. Frankly, any stable, relatively new optimizer should do fine. But, even a good optimizer could underperform if either the learning rate or the batch size is not tuned properly.

A hint that your learning rate is too low can be rapidly decreasing gradient values. This behavior could result in slow (to no) convergence or getting stuck in local minima. However, a learning rate that is too big can cause divergence and yield an unstable learning curve.
A common mistake is using too small a batch size. A small batch size may converge quickly but at the cost of noise, resulting in optimization difficulties. However, too large a batch size may prefer sharp minima and result in low robustness.

Lastly, if a neural network is not learning- there could be some underlying issues with its structure. A few common causes of this are mentioned below:

Dying ReLUs: a ReLU neuron might get stuck on the negative side and always output 0. Due to ReLU’s shape, once a neuron becomes negative, it’s unlikely to recover. This problem is likely to occur when the learning rate is too high or has a large negative bias.
Vanishing or exploding gradients: Very large or very small gradient updates can indicate a learning problem. Besides inspecting the gradient values themselves, there are a few possible indicators: (1) NaN loss values or (2) extreme weights values (i.e., very close to 0 or very large). Gradient clipping may help (for exploding gradients). On the other hand, very low values of gradient updates can result in vanishing gradients. Another approach here is reducing the number of layers.
Vanishing or exploding activations: A good standard deviation for the activations is on the order of 0.5 to 2.0. Significantly outside of this range may indicate vanishing or exploding activations, which in turn may cause problems with gradients. Try Layer/Batch normalization to keep the activations’ distribution under control.

4. Useful Metrics

We tend to repeatedly inspect our implementation (which is great and important) BUT forget to look at the data itself. Examining the agent’s behavior and its evolution over time can provide insight into the learning dynamics and help identify issues.

The episode value can be measured via the episode return, but also its length (the solve rate). If you are yet to see a change in rewards, a hint of improvement could be seen as a longer life or faster solution. Inspect the mean and the standard deviation, alongside the minimum and maximum values, for a better picture of the extreme cases.
The loss value can behave differently than you would expect in classic supervised learning. Since the data distribution depends on the policy and changes with training, the loss functions don’t have to decrease monotonically for training to proceed. In other words, take the loss curve with a grain of salt. If the environment is non-stationary, it may even increase with time.
In policy methods, the action space entropy could convey the model’s exploration tendency (i.e., how random the policy is). Read more about entropy here. If the entropy is decreasing too fast, the policy is becoming deterministic and will cease to explore. On the other hand, if the entropy does not decrease at all, then policy remains random. A potential solution could be in the shape of a KL-penalty or an entropy-bonus.
Another policy specific metric is KL-divergence, measuring the distance between action distributions (i.e., the policy update size). Read more about KL divergence here. KL divergences spikes or generally large values reflect a big policy update, revealing a loss in performance. Too high a learning rate may be a reason for large policy updates.

5. General Guidelines to Parameters Tuning

The model’s discount factor determines how far back it is assigning credit. A good thought experiment is to think of the model’s horizon. Should it be biased to short-term gains or include more past data? The ‘correct’ answer depends on the task and the method.
The discretization of the space determines how far your Brownian motion goes. Your agent is not truly continuous. Thus, dictating its action frequency in the context of the environment time affects the agent’s exploration and ‘reaction time.’
If using a greedy policy, try introducing annealing, maximizing initial exploration, and shrinking it over time.
The balance between the learning rate and the batch size impacts the optimizer convergence rate and stability. Consider annealing the learning rate (see more here) or the batch size.
If using a memory replay, the buffer size may need to be bigger than expected, especially in value-based methods.

6. Model’s robustness and generalizability

Whether developing a new algorithm or implementing a common one, there are a few helpful tips that can get you going:

Simplify the problem your new algorithm is facing by using a low dimensional state space environment.

Why? Most importantly, it is faster. Thus, you can perform a quick hyperparameter search. Additionally, it is easier to diagnose the learning dynamics when the environment is well understood and easy to unravel—for example, the Pendulum problem, which has a 2D state space.

2. Construct your own toy environment to allow a thorough examination of your algorithm in a customized setting.

When building your own toy environment, think of extreme cases and situations where you can easily predict the behavior.
Be careful not to overfit! Remember, you are not optimizing your model to the toy environment — rather using it to test and diagnose.

3. Explore the sensitivity of the parameters. If the algorithm is sensitive to small changes in the parameters, you may have gotten lucky, but your model is not robust and is not likely to generalize across problems.

4. A good ‘health’ indicator of the model is its gradient and value function performance (see above) rather than the final return.

5. Use multiple random seeds to avoid overfitting or leaning on noise.

Last but not least, automate your experiments. Yes, it is worth your time.

I hope this post helps in starting to build a sound toolkit for debugging deep RL. To summarize, let’s have a quick recap:

Ensure you frame the task as a reinforcement learning problem.
Standardized your data.
Focus on the optimizer.
Use adequate metrics to inspect the agent’s behavior.
Be intentional with parameter tuning.
Pay attention to the model’s stability and robustness.

Found this post useful? Think it’s missing something? Comment below with your feedback or questions :)

Resources (these are REALLY good, if you have the time — take a look)