On-Policy v. Off-Policy Reinforcement Learning Explained

Jeremi Nuer
5 min readJun 9, 2022

--

In the large and complex field of Deep Reinforcement Learning, it is easy to get lost.

As you traverse this field, you will be met with terminologies that are nuanced and classifications which are difficult to parse from one another. What are policy-based and value-based methods? What about model-free versus model-based? Are these things exclusive to one another?

It is important that, as the brave explorer that you are, you are equipped with the understandings of the different sectors of the field, and how methods differentiate from each other. And I am here to help you!

Of course, there is no better place to start than the distinction between off-policy and on-policy methods. These two terms are used to describe how, in a general sense, the agent learns and behaves during the training phase, as well as the two main ways that an agent can go about learning/behaving.

But first, it is important to ask: What even is a policy?

In the simplest terms, a policy comprises the suggested actions that the agent should take for every possible state.

You can think of the policy as a probability distribution. For every state, the policy is the probability that we take any of the possible actions at that state. The policy might determine that at a certain state, there is a 60% chance of taking action 1, a 30% change of taking action 2, and a 10% action of taking action 3. Only, the policy doesn’t determine the probabilities. It is the probabilities.

Quite literally, the policy is the probability of each action a ∈ A being taken for each state s ∈ S. The policy is denoted by π.

At any given point, out agent follows a policy πk. This policy is constantly changing and improving as our agent goes through training. Every time the policy slightly changes (we go through a gradient step and the neural network’s weights are different), the policy has now changed to πk+1. Even if the only difference between the two policies was one single action which is now more likely to be chosen, they are two entirely different policies.

Everything still clear? Good.

The next important clarification is the definition and distinction between the behavior policy and update policy.

The behavior policy is quite simple. It is exactly what I described a policy to be in the paragraphs above. It is how an agent will act in any given state (the probabilities etc.).

The update policy is how the agent imagines it will act while calculating the value of a state-action pair. To better understand this, let’s think about how an agent calculates the value of a state action-pair (the value of taking a specific action at a specific state) in the first place.

The value of a state-action pair is the expected return. The expected return is what the agent believes the overall reward will be after taking said action at said state, and continuing onwards until the environment is reset. The agent is thinking “well, if I take this action at this state, I’m probably gonna end up in state s. And if I’m in state s, I’ll probably take action a, and then I’ll probably get reward r and end up in state s.” The agent follows this line of thinking on and on.

But here’s a question: how does the agent know what action it will take from being in state s? You see, the expected return is calculated assuming that the agent follows a specific policy onwards. This policy is the update policy.

When an agent calculates the value of a state-action pair, it is thinking “what will be the cumulative reward if I take this action and then follow the update policy π onwards?”

With that explained, we can finally get to the purpose of this article. Here enters the distinction between on-policy, and off-policy methods.

In on-policy methods, the behavior policy and the update policy are one and the same. In off-policy methods, they are different.

In on-policy methods, the value of a state-action pair is calculated assuming that the agent will follow the current behavior policy onwards. As such, the value is defined only in terms of this specific policy.

In off-policy methods, that is not the case. Let’s use Q-Learning, an off-policy method, to show what this would look like.

In Q-Learning, it is common to use a type of policy called Epsilon-Greedy. In this policy, the agent’s actions start off completely random, and slowly begin to get more greedy, and exploitative over time. This would be the behavior policy.

Only, when we’re calculating the value of a state-action pair in Q-Learning, we assume that at each state, we’re choosing the action that maximizes the future reward it thinks it can get. Our update policy is a greedy policy.

In Q-Learning, our update policy assumes that we’re following a greedy policy, that we are choosing the actions which we believe will net us the most reward.

In actuality, we are acting semi-randomly, and not being greedy all the time. Our behavior policy is not completely greedy.

Because of this, our update policy (a greedy policy) is different from our behavior policy, (an epsilon-greedy policy) making Q-Learning an off-policy method.

These distinctions are important to know of, and I’m sure you will continue exploring the field of RL, equipped with this new knowledge and ready to forage ahead!

Btw, if you still feel unclear about this topic, shoot me a DM on twitter @jereminuer and I’ll be happy to explain more!

--

--

Jeremi Nuer

What does the future hold? I’m exploring emerging technologies such as AI and Quantum Computing