A link between Cross-Entropy loss and Policy-Gradient expression

Dhanoop Karunakaran
Intro to Artificial Intelligence
6 min readJun 9, 2020
Source: [2]

Cross-Entropy loss

Cross-Entropy loss is widely used in machine learning to optimising the classification model. Cross entropy, H determines the distance between true probability distribution and predicted probability distribution.

Source: [1]

There are mainly two types of cross-entropy loss: Categorical Cross-Entropy and Binary Cross-Entropy.

Categorical Cross-Entropy loss

In a multiclass classification, there is m number of classes or labels and we want to choose one class out of M classes. Categorical Cross-Entropy loss is mainly used in multiclass classification.

Source: [2]

Categorical cross-entropy is the function of the softmax layer and cross-entropy loss. Softmax function converts all the outputs of Neural Network in the range[0, 1] and the total value of all outputs add them up to 1.

For example, we have 3 classes: label 1, label 2, and label 3 and their softmax outputs add up to 1. This is our predicted class probability distribution.

Predicted class probability distribution

In a multiclass classification, only one class will be true over all other classes. In this case, ground truth or true class probability distribution is represented as below.

True class probability distribution

The loss can be calculated by computing cross-entropy between predicted class probability distribution from softmax output and true probability distribution.

Let’s apply the cross-entropy loss formula to the example above:

If we look at this closely, loss value is impacted by the one class which has highest probability, 1. In this case, label3 has only value 1 and the rest of them are zero. So we can rewrite the loss function, L formula as below:

where y is the class which has a value 1 and the rest of them are zero.

When we use cross-entropy as a loss function, the idea here is to minimize the computed cross-entropy to minimal during the training of the neural network.

Binary Cross-Entropy loss

This type of loss function is mainly used in multilabel classification. In a multilabel classification, there can be more than one label or class can be true. For example, images can have more than one label true as shown below:

Example of a multilabel classification: Source: [2]

This means we need to compute the loss of each output unit of the neural network independent of other output units’ result. In this type of classification problem, we cannot use softmax output to the cross-entropy as softmax output convert all the output such a way that the total value of the output’s value adds up to 1. Here, we use sigmoid that converts each output unit of a NN between 0 and 1 that is independent of other output units.

Source: [2]

When we use sigmoid with cross-entropy loss is called binary cross-entropy or sigmoid cross-entropy. Instead of considering the probability distribution across all output(like softmax cross-entropy), for binary cross-entropy, the problem considered as a binary classification problem for each label for every label. Here is the sigmoid cross-entropy loss equation and will discuss this through an example.

Sigmoid cross-entropy loss equation

For example, consider a neural network which has three output units for the labels: dog, cat, aeroplane. This is a multi-label classification problem where input image can be classified as more than one label. Sigmoid function squash the value from each output unit between 0 and 1 that is independent of other output units.

The sigmoid result over outputs

In the binary cross-entropy loss, we have to calculate the implicit probability of each output. Implicit probability of each output can be calculated using the following formula.

Implicit probability
Implicit probability over outputs
The true probability values of the labels

Now time to do the cross-entropy loss and let’s discuss the equation

Sigmoid cross-entropy loss equation

Where M is the number of labels and 1-yj is the implicit probability of the labels.

We can compute the loss from the above example as shown below:

Sigmoid cross-entropy loss equation from the example above

That gives the value 0.049 and the idea here is to minimize the computed cross-entropy loss to minimal during the training of the neural network.

REINFORCE — Policy Gradient and categorical cross-entropy loss

REINFORCE is the Mote-Carlo sampling of policy gradient methods. That means the RL agent sample from starting state to goal state directly from the environment, rather than bootstrapping compared to other methods such as Temporal Difference Learning and Dynamic programming.

The idea here is to sample a trajectory by following the policy π and find the gradient of the objective function. Then update the policy parameters towards the direction of the gradient. Repeat these steps until we get an optimal policy.

The gradient expression of the objective function

If you want to know, how this equation is derived from expectation of the total return of the trajectory, please have a look at this article.

The whole motivation here is R(τ) alone is not differentiable to enable gradient-based learning. By multiplying with R(τ) with a differentiable equation, the reward can have an impact on the learning.

By using cross-entropy loss and the gradient of it, we can enable the gradient-based learning for this type of reinforcement learning. Thus the name policy gradient.

A stochastic policy π is a probability distribution over the action for the given state.

Stochastic policy — a probability distribution over the action given state: Source: [3]

Where the policy is parameterised using θ.

When we use softmax function to neural network output, we can get a probability distribution over the output. This can be considered as actions generated using policy π. The idea here is to adjust policy parameter θ i.e a weight in NN to find the optimal policy that maximises the return.

The log expression of the policy gradient as shown below is equivalent to log expression of categorical cross-entropy loss. πθ(at, st) gives the action a to be taken to reach next state st+1 from the given state s at the time step t.

Here is the explanation of that:

This enables us to use the cross-entropy loss in the policy gradient algorithms. By taking gradient, ∇ of the equation, we can facilitate the gradient descent approach same as in any supervised learning.

Finally, the return R(τ) in the policy gradient expression determines how good was the trajectory using the current policy π. As we mentioned earlier R alone is not differentiable, so multiplying R with log expression, it will have an impact on the learning.

If you like my write up, follow me on Github, Linkedin, and/or Medium profile.

--

--