How Policies Are Represented By Neural Networks

Published in

Mind Magazines

9 min readJul 6, 2022

It is not enough to understand the theory behind policy-based methods (what policies are, what the policy gradient theorem is, etc). We must also know how to take our understanding of the theory and translate it into real, applicable code.

Today, we’re going to take our understanding of what policies actually are, and apply it to what the code behind a policy looks like. More specifically, we’re going to be looking at how the output of policies (i.e the actions) is represented.

Before we get into that, let’s briefly go over the main types of policies.

Deterministic vs Stochastic Policies

There are, of course, different types of policies, which are represented in different ways. The two most important distinctions are stochastic policies and deterministic policies.

The explanation is simple: deterministic policies are certain; the output of a deterministic policy is the action that the agent will take. Stochastic policies have randomness; they do not output an action, but rather a probability distribution of how likely it is that the agent will take any action at a given state.

where ‘a’ is the action, ‘s’ is the state, and ‘pi’ is the policy

The image above depicts a deterministic policy. The policy views a state, and outputs the action that the agent will take.

You can read this as “the probability of a given s“ for all possible actions

The image above depicts a stochastic policy. The input is still a state, but instead of the output being an action, it is the probability that the agent takes an action for all possible actions — a probability distribution.

Policy —> Neural Network

Now comes the question: how do we represent the policy in code? We have two scenarios — one in which we must create a function that takes in a state and outputs an action, and the other which takes in a state and outputs a probability distribution across all possible actions.

In both cases, the most versatile and computationally powerful way of representing the policy is through a neural network. I’m going to assume you are familiar with the basics of deep learning and neural networks. Here’s a briefer on neural networks if you need one

To input the state through our policy, it’s as simple as passing the state through the neural network.

The state could be represented by pixels of an image, or any other data that the environment gives us

No, the tricky and confusing part, is how our output is represented.

And thus brings us to the original difficulty: representing the output of our policy.

One of the reasons this is difficult is that some environments are going to have continuous action spaces, while others are going to have discrete action spaces. Depending on which it is, you need to configure your neural network differently.

Deterministic Policies x Continuous/Discrete Actions

As such, we need to deal with four different scenarios — deterministic or stochastic policies; discrete or continuous action spaces.

Let’s start with deterministic policies, and see how to configure a policy neural network for continuous and discrete action spaces.

In deterministic policies, the output layer of the neural network would be the action — the output of the policy. We only have one node, and the number it displays represents the action we take.

In this circumstance, it is simple and intuitive to represent our policy. We create the neural network, and we simply return the output as the action.

The only thing we have to worry about, is that the output of the neural network might not fall within the action space. If the output of the neural network is 4.3, but such action is not possible, we have a problem.

In continuous action spaces, this is not a huge deal. We can simply pass the output through an activation function which will normalize the number to the range that we want.

Let’s say we have a continuous action space of -1.0 to 1.0. All we would need to do is pass the output through a Tanh activation function, and it would normalize the number to that range.

The code to implement this is quite simple. Here is an example of how it could look in Pytorch (my preferred ML framework):

def forward(state):
    state = F.relu(self.layer1(state))
    state = F.relu(self.layer2(state))
    output = F.tanh(self.output(state))return output

This block of code only contains the forward pass method. The state is passed through the 2 fully connected layers and the output layer, then passed through the tanh function to transform the data to the correct range.

When it comes to discrete action spaces, things get blurry. Yes, we could round numbers to whole integers, or use some other way to mathematically transform the output to a discrete number.

But, there aren’t many reasons to do so. At this point, you would be better suited using DQN to solve your problem. Policy-based methods are more meant for stochastic policies or continuous action spaces. Value-based methods, where the policy is implicit, would work better in this scenario.

My answer for you in this scenario is: formulate the problem in a different way. Create a stochastic policy instead, or used a value-based method.

Stochastic Policies x Discrete Actions

Now we enter the realm of stochastic policies, where the output of our neural network must represent a probability distribution. For discrete actions, this is not an issue.

For each possible action we can take, we have a corresponding node in the output layer which represents that action. The number outputted by that node is the probability that we take the corresponding action.

Let’s say there are four possible actions in our environment. Then, we can configure our neural network to have four nodes in the output layer, each outputting the probability of their own action.

The only issue with this, is that the output of a neural network is not constricted automatically to between 0 and 1. Furthermore, even if it was, the sum of all the output nodes must equal 1, else it does not represent a probability.

Fortunately, there is a simple function which can normalize the outputs to represent a probability (being that they are between 0 and 1, and the sum of all outputs equals 1). This function is the Softmax function.

The output layer (second layer) is transformed into probabilities via Softmax function

If we look at the code of this for Pytorch, the code is quite simple:

def forward(state):
    state = F.relu(self.layer1(state))
    state = F.relu(self.layer2(state))
    output = F.Softmax(self.output(state))    return output

Once again, I only included the forward pass method. The code here looks very similar to what we wrote when constructing a network for deterministic policies in continuous action spaces. The key difference which is not shown here is that in the deterministic policy case, we only had one output node which represented the action we would take. Here, we have as many nodes as possible discrete actions, representing a probability distribution.

As we can see, the state is passed through two fully connected layers, as is normal. After being passed through the output layer of the neural network, the numbers are passed through the Softmax function, and we return a distribution of probabilities.

Stochastic Policies x Continuous Actions

This is where things start to get tricky. We can’t directly represent the output of our policy (the probability distribution) through the output layer.

With continuous action spaces, it’s impossible to have a node represent an action — we would need an infinite amount of nodes for every possible action.

Instead, we use the output of the nodes to create a distribution which we can sample from. What do I mean? Well, let’s look at a gaussian distribution, the most common way of representing continuous action spaces.

“Gaussian” and “Normal” distributions are different words for the exact same thing.

An important thing to note here, is that this distribution is a probability density function — it does not directly model the probability of an action, rather it models the probability of an action falling within a certain integral.

Regardless, we can use the Gaussian distribution to represent the output of our policy. In other words, it represents:

The question is, how to we create the Gaussian distribution via the output of our neural network? First we must understand how the Gaussian Distribution is formed.

Here is the equation for Gaussian Distribution:

‘X’, the input, represents an action. The output represents the probability of that action

This equation looks very confusing, but we don’t need to understand the ins and outs of it. Instead, I want to point your attention to the two variables µ(the mean), and σ(the standard deviation).

These are the two variables which form the Gaussian Distribution. The mean, and standard deviation. With these two numbers, a distribution can be created.

As such, we can formulate a policy neural network to have two output nodes — one which represents the mean, and the other which represents the standard deviation.

From there, we can use those two values to create our distribution, and then sample an action from that distribution.

There’s only one catch — the standard deviation has to be positive. As we all know, neural networks are not constrained to a positive range. Therefore, we exponentiate the output node which represents the SD (e^x where x is output node).

The forward pass code would look basically identical to the code we’ve written before. The state would be passed through both layers, but in this case there would be no activation function after the output layer.

def forward(state):
    state = F.relu(self.layer1(state))
    state = F.relu(self.layer2(state))
    output = self.output(state)return output

The main difference is after we call the forward method. There are several steps we need to take after receiving the output of the neural network in order to choose our action.

#passing state thru network
mean, sd = policy_network(state) #exponentiate standard deviation so it's positive
sd = np.exp(sd) #sample action from our gaussian distribution
action = np.random.normal(mean, sd)

Overall, the steps aren’t too complicated! But it is important that we have an intuitive understanding of what’s going on.

Let’s discuss the implications of this for a second. In Gaussian policies where the mean and standard deviation are the two parametrized values, they can be described as such: the mean is the action which the policy believes to be best, and the standard deviation is the level of confidence that the policy thinks we should have in that action (the larger the SD, the lower the level of confidence).

Concluding Thoughts

Oftentimes, we (myself included) can take the workings of policies for granted without thinking through exactly how we would represent them with code. Such a simple concept as a probability distribution as an output of a policy has more nuance in its implementation than we originally perceive.

As such, I hope this article has shed some light on how we can formulate neural networks to represent policies according to the different circumstances you will find yourself in.

If you have any questions, always reach out to me! All my socials: @jereminuer

Until next time friends💫🗺