Test-Driven Reinforcement Learning Development — Deep Deterministic Policy Gradient

Will Guedes
7 min readJun 10, 2018

--

After reading the post’s title you’re probably asking yourself: WTF?
This post is about one thing: Implementing a Deep Deterministic Policy Gradient (DDPG) algorithm. What makes it different from other blog posts and tutorials is how we’re going to approach the implementation.

Before we start
This post is a work-in-progress. I decided to post before full completion in hope to receive any feedback to help reshape the structure of this post. Please let me know how I can change things up to make this post better.

Most RL tutorials focus strictly on the algorithm — and they’re great! They’re concise, with just one or two files and few lines of code. I’d like to be less concise in this post. The algorithm will “feel” secondary. Yet, I hope you’ll end up with a good intuition around DDPG when you’re done reading. Apologies in advance, it’s going to be a long post.

If at any point the code snippets stop making sense, you can fall-back to the full source code on Github.

Reinforcement Learning

I’m going to assume we have just a little bit of prior knowledge about RL. If that’s not the case, there’s a truly phenomenal tutorial by @awjuliani that should give you enough background.

The Deep Deterministic Policy Gradient (DDPG) Algorithm

DDPG is a policy gradient algorithm that typically leverages neural networks that serve as function approximators to help estimate the best action for a given state (the actor) and how good an action-state pair is (the critic). Think of the actor as a soccer player making a decision to kick or pass the ball while the critic is the coach, which will later tell the player how good the decision to kick (or pass) was so the player can choose more wisely next time.

As I mentioned earlier, our focus is not necessarily to implement DDPG. Implementing the algorithm will serve as a platform to show what I believe to be good software engineering practices when developing maintainable infrastructure. For a tutorial more focused on the core DDPG algorithm, please see this awesome write-up by Patrick Emami.

The diagram below describes what happens in DDPG. Let’s ignore all the tiny words and focus on the core components of the algorithm: The environment, actor, critic, replay buffer, target actor and target critic.

We’re going to keep everything very high level. No need to talk about function approximators, neural networks, replay buffers, TensorFlow, etc. We’re going to talk about “thingys” (plural for thing.)

Environment
A thingy that allows you to use an action to step from the current state into the next state.

Actor
A thingy that given a state, predicts the best action to take.

Critic
A thingy that given a state and the predicted action, returns how “good” that action is.

The Algorithm (“Agent”)
1) The environment gives the agent its current state. 2)The agent asks the actor what action it should take given the current state. 3) The agent uses that action to interact with the environment. This interaction yields a reward, the next state, and whether the next state is terminal. 4) The agent uses the reward to train the critic. 5) The trained critic can tell the actor how to select even better actions in the future. This is done by looking at how the “goodness” value returned by the critic changes as we fiddle (increase/decrease) with the action (the critic’s gradient). 6) asse

Believe it or not, we have enough information to implement the “pipes” of DDPG! So let’s start by implementing the agent. If you’re going to code along, you can setup the required dependencies by running: pip install gym pytest mock in your virtualenv.

The agent does exactly what we’ve described under the “The Algorithm” section above. To make it easier to follow, the comments in the code are exactly what I wrote in that section.

As you may have already noticed, the implementation of the agent above is quite incomplete. In the next few sections, we’re going to make the modifications necessary in order to build a working DDGP agent.

** We have now reached what I really want to talk about in this post! **

Anytime you modify code, there’s a very good change you’ll introduce bugs. Bugs are specially difficult to track when building machine learning models. Here’s ow it usually goes:

While sipping your coffee, you implement an RL algorithm by following someone else’s public Github work. You’d like to really learn so you avoid the copyin’ and pastin’. Once you try to run the code for the first time, you run into syntax/interpreter/compilation errors. These are super easy to fix — you just follow what the error message says. After the 100th run attempt, your code runs! You now patiently waits 5 minutes, 10 minutes, 15 minutes for your model to start yielding reasonable rewards — however it doesn’t. You can now do one of three things: 1) cry — I’ve tried and it doesn’t make your model converge any faster. 2) Clone the reference repo, run it, see the model successfully converge and then go back to crying. 3) Start over but this time go much more slowly and use split-screen, your code on the left, and the reference code on the right.

I believe we can reduce how often the situation I described above occurs by writing tests along our model implementation.

So far we’ve written a basic (incomplete) agent. Let’s write some tests to make sure our code behaves properly as we make modifications to the agent.

We’ve actually left out many things that make DDPG truly great: the replay buffer, offline target networks, hyper-parameters such as gamma, etc. However, with the pipes we’ve just created, adding these new components won’t be too difficult and we have many tests in place to make sure our changes don’t break things! Let’s add the remaining components, by first writing tests that fail and then make them succeed by properly implementing DDPG.

Episodes
The run of our agent interacts with the environment only once. RL agents need to interact with the environment thousands/millions of times. These interactions happen in episodes. Episodes have a max length: how many times the agent can interact with the environment. One episode (many interactions) is also not enough, we need many episodes. The code below shows how we modify our current agent (and test) to support this episodic behavior.

Add the following test to agent_test.py

If you run agent_test.py, it will fail. We have yet to implement support for the episodes and episode_max_length arguments (see code below).

Note that in the code below, we’ve renamed the run method to _run_helper and created a new run method which calls _run_helper.

If you re-run agent_test.py, it should succeed.

There’s a problem though. Right now, if our agent interacts with the environment and the environment returns a terminal state in the middle of an episode, the agent will ignore it and continue interacting as normal. Let’s modify the code (and test) to account for that scenario.

The call to _run_helper now returns if the new state is terminal. If it is, we stop the current episode and move on to the next one.

Now the agent no longer interacts with the environment during an episode when a terminal state is reached.

Target Actor & Target Critic
So far we’ve been using the reward given by the environment to directly train the critic: self._critic.train(self._state, action, reward).

This is flawed. What we really want to do is combine the reward with an estimated value of how good it is to be in the next state (Q value of next state). self._critic.train(self._state, action, reward + next_state_goodness). How do we estimate how good it is to be in the next state? First ,we ask the actor what is the best action given the next state. Then, we ask the critic how “good” it estimates taking that action in the next state.

Below we show how we’d change the _run_helper method to use reward + next_state_q_value to train the critic (don’t use the code below):

However, some very smart folks figured out that using the actor and critic to estimate the best action and Q value for both the current state and for the next state doesn’t work very well. To make predictions about the next state, we want to use an, offline, target network (the same applies to the critic network). You can find out more about why that’s the case by reading this amazing paper: Continuous Control with Reinforcement Learning. For now, let’s just believe doing so increases the algorithm’s stability and chances of converging.

Below we update our code (and test) to use these “target” actor and critic objects.

As usual, the test will fail if you try to run before modifying agent.py as indicated below.

Recap
If I have not bored you out of your mind by now, you should have the following two files:

We’re close to a working DDPG algorithm — I’ve omitted important aspects of the algorithm in the code so far such as the replay buffer, hyperparameters, and the actor and critic implementation themselves! Integrating the missing components will be quite straight forward! We’ll do this in the next post!

--

--

Will Guedes
Will Guedes

Responses (1)