Mathematical introduction to Policy Gradient using Ted-Ed’s Ruby Riddle.

Ben Thayer
Ben’s book
Published in
11 min readOct 30, 2018

--

In this post, I’ll go over using Policy Gradient and Reinforcement Learning to solve one of TED-Ed’s riddles. If you’re interested in solving the problem for yourself, watch the video below first before scrolling down.

First, what is Policy Gradient? What is a policy? What is Reinforcement Learning? Reinforcement Learning considers an agent acting in an environment. Every time the agent acts, it receives a reward. Our goal is to maximize the reward that the agent receives through the environment. Policy Gradient attempts to do this by updating the way the agent acts, known as it’s policy, in the way that produces the best reward.

Below, I explain all of the logic and math for using policy gradient to solve this problem. In my opinion, the math is the hardest part of learning how to use policy gradient, and I think most resources don’t make it easy if the math is totally new to you. Sergey Levine’s slides on policy gradient are good if you’re interested in seeing even more of the math involved, but I try to explain some of the math in the most thorough way possible. If you want to see the code, check out the Jupyter Notebook on my GitHub.

So how do we turn this into a reinforcement learning problem? The definition of the problem is as follows:

  1. There are 30 rubies divided between three boxes.
  2. Each box must have at least two rubies, and one box has six more than one of the others.
  3. You must write down a number between 1 and 30 for each box. These boxes will then be opened simultaneously.
  4. For each of the boxes, if the number you wrote down is less than or equal to the number of rubies in the box, you may confiscate that many rubies. If it’s greater, the merchant keeps the entire box.
  5. The merchant knows the rules ahead of time and will do his best to maximize his chances of keeping as many rubies as possible.

In this situation, there are two agents. The “guesser” and the “placer”, but for simplicity’s sake, imagine that the placer always places the rubies according to the same policy. In this world, the placer is part of the environment, and we can update the guesser’s policy according to whatever is best for the guesser. The same argument goes for the placer. If we think of the guesser’s policy as being constant, we can update the placer’s policy to perform best in the presence of that version of the guesser’s policy.

Combining these two things, we can argue that if we update the policies slow enough, both the guesser and placer are acting in an approximately constant environment, and at each turn, each policy is going to be slightly better than the last one. Of course, the two are competing, so who actually does better is going to depend on the specific updates made.

Now we need to pin down exactly what is a policy? Typically policies just assign different probabilities to different actions. In a machine learning setting, the typical way of doing this is by using what’s called a softmax. A softmax takes in a weight for each action, which can be any number, positive or negative, and then assigns a higher probability to things with higher weights. The exact formula is as follows:

Above,

is the policy, A is the action space, and w is the weight of each action. For the guess policy, A is the set of possible numbers that are written down, and for the placement policy, A is the set of the different ways 30 rubies can be placed in 3 boxes that fit the problem’s constraints.

We also want to be maximizing the reward, so what is the reward exactly? If the boxes contained 15, 6 and 9 rubies, and I respectively guessed 13, 10 and 4, then I would get 13 rubies from the first box, 0 rubies from the second box, and 4 rubies from the last box, a total of 17 rubies. My score is 17 rubies, and since I’m taking them from the placer, the placer’s score is -17 rubies. For each combination of guesses and of placements, we can calculate the score. In order to find the score of a given action, we multiply the probability that the opponent picks an action by the score that the two actions together combine to make. If the placer has a 50% chance of picking 2, 11, 7 and a 50% chance of picking 16, 10, 4, then for guesser, the action 5, 10, 9 has an average score of 0.5 (0 + 10 + 0) + 0.5 (5 + 10 + 0) = 5 + 7.5 = 12.5. More generally, if the scores for all combinations are stored in a matrix O with dimensions num_guesses X num_placements, we can treat the policies as vectors of probabilities use matrix multiplication to find out

and

where R represents the rewards for each action and the subscripts g and p represent the guesser and the placer. For the reward of the placer, we negate the score because O is in terms of the guesser’s reward.

If we want to get the value of the entire policy, we just have to multiply the probability of each action by the reward of that action. Written in matrix form, we get:

Great, now we know how to compute the value of a given policy. Remember, our goal is to improve the policy, so what we need to do is take the gradient of J with respect to the parameters of our policy, the weights we’re using in the softmax. Mathematically, this is

For this problem, we can actually calculate this analytically, so we’ll do that first then move on to the more general form that policy gradient is based on.

First, we can find the gradient analytically by taking the partial derivative with respect to the weight of a given action and remembering the quotient rule:

This derivative is the same for each weight, so in vector form, we get

Where we perform an an element-wise multiplication and subtract the scalar J from every element in R.

This is the gradient for this specific problem. Because we can compute R exactly, we can take the derivative exactly. In most cases though, we can’t do this, and the best we can do is find the expected reward based on many trials or episodes of experience. Since we can only calculate the expected reward, we need to calculate the expected gradient too.

To see how this is done, I’ll first show the math, then break it down.

There’s a lot to address if this is the first time you’ve seen this expression, so read the whole explanation before staring and wondering what certain parts mean.

An expectation is the sum of the probability of encountering each state multiplied by whatever function is inside the expectation. Sums and integrals actually mean the same thing, but integrals also work for continuous cases, so it makes sense to use integral notation so that our result applies to the general case. Additionally, it looks nice.

The parameter we are integrating over,

, is used to represent multiple states and actions, corresponding to an episode, or a rollout, of the policy. We do this because actions can have future rewards, but rewards generally stop after a single episode, so we conveniently define

to reflect that. Correspondingly,

returns the total reward for a given episode, and

returns the probability that that specific episode (ie that combination of states and actions) would happen.

It is important to notice that

is a single probability rather than a probability distribution because we are specifying which actions we want to know the probability of through the input

. In our case though, a single episode only consists of one state and one action. For the guesser, the state consists of the placement and the action is the guess. For the placer, the opposite is true, the state is the guess and the action is the placement. Initially, the state is hidden to the actors, but for the purposes of the expectation this is okay.

The next step is to take the gradient. To do this, we take the gradient at every single state and sum them up (integrate). There could be many states though, so performing this integral would be prohibitively difficult. To fix this, we want tot turn this expression back into an expectation so we can approximate it. To do this, we need to have the term

outside of the gradient operator, so we use

to get

Although this is not initially obvious, by looking at it in reverse, we can see that this is just an application of chain rule.

.

Now, we can convert the integral back into an expectation.

Now that we have an expectation again, we can simply run our policy and compute

for set of states we encounter. The expectation, is equal to the simple average of those terms.

The last step before coding this up is to explicitly calculate

for our problem. Typically, whatever your using would have auto-differentiation and you wouldn’t need to worry about this, but I’m doing it manually so we can code up the result in numpy.

First, we don’t need to worry about

because it has no parameters that we can take the gradient of, so we can just compute this using the rules of the game. We can simplify the other part,

, so that it is only dependent on the action and not also the state.

Our policy is defined as

To compute the gradient, we compute the partial derivative with respect to each of the parameters of the policy. If we set

and use quotient rule like before, we get

and

where

is the action that occurred and

is any other action.

Finally, multiplying each result by

we get

and

Now all that’s left is to code up the math and see that it works! We pick a learning rate, then at each time step we compute the gradient and update. I wrote two bits of code — one calculates the gradient using the analytical result, and the other calculates the gradient by sampling actions and using the expectation. The rest of the code is the same. Check out the Jupyter Notebook I posted on GitHub for the implementation and some more details.

I hope you found this mathematical introduction to Policy Gradient helpful! Policy Gradient underlies some of the most powerful Reinforcement Learning algorithms we have to date, so I hope this tutorial helps you down the path of discovery! Don’t forget to check out Sergey Levine’s slides for more information, I couldn’t have written this post without them!

Originally published on October 30, 2018.

--

--