🤯AI: Deep Reinforcement Learning

9 min readMar 17, 2023

A rewarding journey.

[This article is a part of the 🤯AI series]

Let’s cut to the quick. What is Reinforcement Learning and what makes it Deep?

Reinforcement Learning is concerned with solving sequential decision-making problems. Deep Learning techniques are used with Reinforcement Learning because the deep neural networks of deep learning are excellent at approximating functions, and reinforcement learning is pretty much all about approximating an objective function (standby for more on that). The other reason deep learning mind-blowing in its support of reinforcement learning is that with deep learning, the neural network model does the feature learning as a part of the training process (an expert human doesn’t need to specify the features manually, as is typically needed in traditional machine learning).

Comparison of feature learning between Machine LEarning and Deep Learning

We have an objective (or more literally an objective function, but we’ll come back to that) when solving these problems. We take actions and get feedback from the world about how close we are to achieving the objective. Reaching the goal involves taking many actions in sequence, each action changing the world around us. We observe these changes in the world as well as the feedback we receive before deciding on the next action to take in response.

More formally, we have an Agent that observes the state of the Environment, uses a policy that evaluates the current state to select an action. This action is taken against the Environment, which responds be transitioning to the next state and possibly providing some reward.

I like to think of these agents as glorified cookie monsters, who want to do everything they can to eat the most cookies they can- that is their objective. The cookies, those are their reward.

Some example problems that can be framed this way:

Playing video games
Robotics, Autonomous Systems, Driving a Car, Balancing a Pole
Algorithmic Trading

Deep Reinforcement Learning is a topic almost as wide as it is deep. Before we dive it, let’s cover some interesting results as motivation.

Can you write a program to master the Atari game Breakout?

Mnih et al in 2013 published the first deep learning model to successfully learn control policies directly from high dimensional sensory input using reinforcement learning.

The method applied to play seven Atari 2600 games in the Arcade Learning Environment without any adjustment to the latter’s architecture. The results outperformed all previous approaches on six games, and surpassed expert human players on three games.

🤯 2013 Paper: “Playing Atari with Deep Reinforcement Learning”

By 2016, most Atari 2600 games were mastered, except for one: Montezuma’s Revenge. The trouble with this game was its brutality- pretty much everything killed you, quickly and you only got a reward for completing a level (which was complex on its own). An agent trying to get by on trial and error gets almost no reward signals.

The insight was to enhance the agent with two new signals that gave it intrinsic motivation to explore: novelty (have I seen this before) and surprise (predict what you expect to see, compare to what is seen).

🤯 2016 Paper: “Unifying Count-Based Exploration and Intrinsic Motivation”

Deep Reinforcement Learning was used so flexibly to master Atari games over 6 years ago, and the pace of innovation has only accelerated.

“HELLO WORLD” Deep Reinforcement Learning Style

The Hello World scenario of DRL is CartPole. Thousands of papers have been written using it.

Basically you can visualize the scenario like this: you have a cart that moves on track, and a pole on the cart. Keep the pole upright for as long as you can. Oh and you can only do so my forcefully nudging the cart left or right.

Let’s break this down a little further.

Objective:
Keep the pole upright for 500-time steps
State:
[Cart Position, Cart Velocity, Pole Angle, Pole Angular Velocity]
Action:
0 to move cart fixed distance left OR
1 to move cart fixed distance right
Reward:
+1 for each time step pole remains upright
Termination:
When poll falls over ( > 12° from vertical) OR
Cart moves off screen OR
Max time step of 500 reached

Keep this scenario in mind as we proceed (and as you proceed in learning about DRL- it is a surprisingly useful scenario to ground your understanding with).

How do RL Agents Learn?

The agent learns to pick good actions by interacting with the environment in a process of trial and error The agent uses the reward signals it receives to reinforce good actions

What are these signals?

The signals exchanged are (state, action, reward) often written as (s, a, r)
Each (s, a, r) tuple for a single time step is called an experience
The time horizon from start to finish is called an episode
The sequence of experiences within an episode is called a trajectory
An agent typically needs many episodes to learn a good policy

What can an Agent Learn?

An agent can learn one of these primary functions:

A policy
A value function
A model of the environment

It can also learn both a policy and a value function.

What is a Policy?

The sum of the rewards an agent receives is called the objective. The agent’s goal is to maximize the objective. It does this by selecting good actions. The function that an agent uses to decide on an action is called a policy.

The policy is the core item that the agent learns, though not always directly

A policy π maps a state s to an action a:

A policy can be stochastic (random) such that it may probabilistically output different actions for the same state.

In this case it is expressed as an action sampled (indicated by ~) from a policy:

Let’s go back to CartPole to apply what we just learned.

Example policy for CartPole:
π(s) = always move left

Analysis: Good to correct the initial lean, bad in the long run as it will cause the pole to fall on the right.

How is the reward used?

Assume we have this trajectory of experiences (from first to last) from an episode:

The return (the total of rewards the agent sees) is defined as:

Which can be compactly re-written as:

Let’s go back to CartPole to apply what we just learned.

Recall the Reward for CartPole:
+1 for each time step pole remains upright
The undiscounted return of the episode (assuming it ends at s2):

R(τ) = 1+1+1=3

Discounting Rewards over time

In the previous definition of return, you may have wondered what the γ (gamma) term was all about.

Typically, you don’t take the full value of the reward at each time step, but rather a sum of the discounted rewards over all time steps.

The γ term basically means that the rewards that come earlier in the trajectory can be weighted more heavily (counted more completely) than those rewards earned later, which can be used to encourage the learning behavior we want. For example, we can:

focus only on the current reward (set γ to 0.0) or
focus more on recent rewards (set γ closer to 0.0) or
focus less on recency by letting rewards that occur much later have an impact (set γ closer to 1.0) or
treat all rewards with the same importance (set γ to 1.0)

If you have a background in a different field, like digital marketing, you might think this is a credit assigment problem. It is! The calculation of the Return is a way to assign the right “credit” to actions that ultimately led to the desired outcome, and not assign credit to those that did not help or worse.

What is a Value Function?

A value function takes a state or a state-action pair
and estimates the expected return of the trajectory:

Value functions help an agent understand how good the states and available actions are in terms of the expected future return. The state-value function (V) you can think of as the average value of being in that state, irrespective of the action taken. The action-value function (Q), you can think of as providing the value of taking a given in action from that state.

Let’s go back to CartPole to apply what we just learned.

Recall the State for CartPole:
[Cart Position, Cart Velocity, Pole Angle, Pole Angular Velocity]

Let’s look at some example states and hypothetical values that could result.

Because the pole is straight up.

Because the pole is going to fall over past 12 degrees from upright (90 degrees).

Because this will help the pole get back upright.

What is an Environment Model?

The agent can learn the transition function implicitly used by the environment. If the agent can learn this function, it can “predict” the next state to which the environment will transition, enabling it to plan good actions without interacting with the environment.

Think of it like learning the rules of chess. Now that you know the rules, can predict the next state for an action.

This is referred to as Model-Based RL. In many cases the environment model is impossible to learn, so other approaches are taken to learn a Policy without having this model. This is referred to as Model-Free RL.

😅phew! That was a lot of theory. Let’s get our hands dirty.

Deep Reinforcement Learning in Azure

Without further ado, let’s see how we can run some state-of-the-art deep reinforcement learning algorithms to master CartPole, the Atari game Breakout and attempt a Robotics example training a Cheetah to move.

As usual, we’ll run these examples in Azure Machine Learning. As we are just getting started in DRL, rather than write all of our agents from scratch, we’ll show using the industry standard implementations from Stable Baselines 3, which provides a set of reliable implementations of reinforcement learning algorithms in PyTorch, and we’ll use those algorithms from the RL Baselines 3 Zoo which provides scripts for training, evaluating agents, tuning hyperparameters, plotting results and recording videos of agents in action.

Grab a copy of this notebook and import it into your Azure Machine Learning environment.

You won’t need too powerful a compute instance for this, I used a `STANDARD_DS12_V2`.

Once you loaded the notebook into your environment and spun up your compute instance, step thru the network executing each cell.

Check your understanding!

After you’ve experimented with notebook, see if you can answer these questions.

Training CartPole-v1

What hyperparameters were loaded? From where?
How many environments were created to train your agent?
What “policy” was used?
What was the average episode reward of your last episode? How long did the episode run for? Why are they the same?

Evaluating Your Trained Agent

What was the average reward achieved by your agent? How long did the episode run for to get that reward? Why are they the same?
How did this compare to the pre-trained, fully-tuned agent?
After training your agent just a little more, what reward were you able to achieve?

Pre-Trained Breakout and Half Cheetah

If you watch the breakout gameplay, where are the reward signals coming from? What techniques did the agent learn?
If you watch the half cheetah, something is clearly wrong…any thoughts on what you could do to improve it?

Pretty 🤯 stuff, right?!

🤯AI: Deep Reinforcement Learning

Written by Zoiner Tejada