Reinforcement Learning in the context of LLM

9 min readJul 6, 2023

Introduction

Reinforcement Learning are a set of techniques in machine learning that allows a system to incorporate stimuli from external environment. They are generally used in video games, robotics where the agent (robot or a character in a video game) has to respond to external events taking place in their environment. Typical RL systems are characterized by the agent(actor), environment, states, actions and a reward. State — represents the current attribute(s) of an agent as it interacts with the environment, action — represents the function to change the state of the actor in the environment. This leads to the agent updating its state. Set of actions are pre-defined given the environment in which the agent is acting. The state can be continuous or discreet. We will typically be referring to discreet states. Reward is the external stimuli — this signals to the agent whether the action it performed is good or bad. Based on this signal the agent can improve, learn and the goal is to maximize this reward.

Brief overview of RL approaches

Value based method In value based methods we learn an optimal value function — this can be thought of as optimal reward that can be obtained. Here the objective function is to minimize the loss between the predicted and target values to approximate the true action-value function. The policy is implicit meaning we are not learning an optimal policy but can be obtained from the actions that need to be taken at a given state — and using something like greedy algorithm to generate a policy. For example: at every state pick the action that gives the highest reward then execute it to give the next state, repeat it until the episode completes. A sequence of such state-action pairs will give the policy.

Policy based method In policy based methods we optimize a policy function directly. There is no value function to be used as a proxy. The idea is to parameterize the policy. For instance, using a neural network πθ, this policy will output a probability distribution over actions (stochastic policy).

Figure 1: **A represents a vector of actions probabilities associated in that given state. It is a function that estimates the cumulative reward of taking an action in that state**

Figure 2: Shows an example NN which takes as input a state that is described by 5 features(input) and outputs a probability distribution over actions.

In figure (2) we have an example showing what a parameterize stochastic policy based system would look like. We have a Neural Network that is trained to generate action probabilities to be taken for a given state. The goal of the training is to maximize the returns by choosing good actions are sampled more frequently. The most critical aspect of Policy gradient methods are that we don’t have an explicit rewards function. Actions are ranked ordered by the expected cumulative reward that a policy would generate, if that action was chosen for the given state.

Policy methods also fall into two broad categories — on-line and off-line. In the former, the agent needs to learn as it is interacting with the environment, while in the case of latter, the agent learns from data that is gathered as experiments. For LLM fine-tuning, we will use offline policy. The algorithm performs off-line policy update — where we build these experiments (also called rollout) and then update the policy.

Mathematical Interpretation

Mathematically we can setup an objective function for finding the optimal policy as shown below.

Here trajectory represents the sequence of state/action for a given policy. The objective is to find the policy that maximizes the expected cumulative reward R. The above equation can be further rewritten as

Note:

R(τ) : Return from an arbitrary trajectory. To take this quantity and use it to calculate the expected return, we need to multiply it by the probability of each possible trajectory.
P(τ;θ) : Probability of each possible trajectory τ (that probability depends on θ since it defines the policy that it uses to select the actions of the trajectory which has an impact of the states visited).

Estimating the trajectory will require us to understand the distribution of states in the given environment. This is hard to estimate and therefore, not easy to differentiate. Therefore, we need to convert that into an approximation — sample enough policies that it approaches the true distribution. The final equation for computing the gradient is to rewrite the objective function in terms of policy function where trajectories can be easily sampled. We will not go into the rigor of showing that, but, links in the reference will provide more insights. The final objective function can then be written as:

Here is a high level pseudocode for this:

Figure 3: High level training of Policy-gradient method

Proximal Policy Optimization

A drawback of Policy gradient methods are that it has high variance — meaning it is hard for the system to converge to a stable state. This is due to the fact one can have large policy updates, thereby leading to the system not converging. For the purposes of reaching stable state, proximal policy optimization was developed. The idea is to clip the policy gradient updates in a range and allowing the system to settle down or converge. The equation below is the updated objective function. The important component of this modified objective function apart from the clipping function it is the ratio; the reward is now a ratio between the probabilities of taking the action in the current state in the new policy vs the old policy. This ratio is kept to an acceptable limit enforced by the clipping function. The choice of epsilon is a hyper parameter.

Aligning LLM with Reinforcement Learning

Given the above background we will show how this is used in aligning the LLM models using human feedback.

Given the large size, and the complexity of training large language models from scratch one would like to fine-tune them for downstream tasks, instead of training for each such task. For example, fine tuning is to generate content that is not toxic, hate towards any community etc. How can one achieve this without incurring the cost and complexity of training these LLM’s from scratch? In this blog we will restrict our discussion to using RL, there are other techniques like loRA in combination (low order matrix approximation) provide efficiency and lower the cost. Furthermore, one can train many such “adapters” that can be dynamically changed to service different tasks.

Applying RL to LLM is not directly applicable meaning it is not intuitive to see how policy, action, reward and environment can be mapped. Moreover, what is reward in this case? Generally, the application of RL to LLM is task specific. For example, one would align LLM to ensure the text that is generated is not toxic or racial, if one is fine-tuning for such purposes. These tasks based alignment can be developed using human feedback — which can be treated as rewards and used for fine tuning the LLM model.

Let’s first formulate this fine-tuning task as a RL problem. First, the policy is a language model that takes in a prompt and returns a sequence of text (or just probability distributions over text). The action space of this policy is all the tokens corresponding to the vocabulary of the language model (often on the order of 50k tokens) and the observation space (environment) is the distribution of possible input token sequences, which is also quite large given previous uses of RL (the dimension is approximately the size of vocabulary ^ length of the input token sequence).

There are various architectures for doing this and here is quick summary of the most popular that has been reported to be used by chatGPT: A base LLM model like GPT3 or GPT 4 or other variations is first pre-trained. The second step is to fine tune this base model for whatever task using a labelled dataset — this fine tuning is a supervised method. In the case of chatGPT the supervised fine-tuned model is called InstructGPT — this can be summarization task, Q&A or whatever is the specific need. The fine-tune model (InstructGPT) is the reference model, a copy of this is used for policy training and also for reward training. In some cases, reward model is based on a different LLM, but they have to be trained on the same datasets (both pre-training and fine-tuning)

For training the reward model, a copy of the InstructGPT is used, with the model’s head replaced by a classification head (a MLP with two outputs)— constructing a classifier. A dataset with prompts with completion with the human feedback is collected; For generating this labeled dataset — the prompt is passed through the reference model, its completion’s are ranked/labeled by the human. This dataset is then used to train the reward model in a supervised setting. A high-level sequence of these steps is illustrated below

Finally, a copy of the reference model (called the active model) is now fine-tuned or aligned. This the final step shown as step 3. Prompts are assembled and passed through the active model, its completions are then classified by the reward model to generate the rewards — these rewards are then combined to optimize the active model.

Mathematical Interpretation for updating the Policy

What we have shown so far is how we can structure and engineer RLHF for aligning the LLM. In this short section I want to touch upon the various loss functions and how to interpret them.

Equ., (1) (L^policy) is the loss function that captures the combination of the reward function and a constraint on policy shift of the active(preferred) model. The ratio in (Equ., 1) ensures that the policy updates to the preferred model are kept in the trust region (they don’t deviate too much from the reference model). Note in Equ., (1) A-bar is the advantage estimation of taking an action in the give state. The algorithm to estimate this quantity is known as “Generalized Advantage Estimation” — this can be estimated by reversing the steps from the last time-step to the first and computing at each step the quantity for the action that was taken(the actual method is not covered here, there are some excellent resource). In the case of language models, once we have a completion — we can un-roll from the last generated token to the first and compute this quantity.

Another challenge that has been seen is the fact that agents/policy tend to choose the same action/token for maximizing the reward. In order to mitigate this a regularization factor is added as shown in Equ., (3) — computes the entropy over the probability distribution of the tokens at a given state; Entropy acts like a regularization term — allowing the policy to explore — the mean of the token/action probabilities are added to allow the model to explore other actions/tokens in a given state.

Finally, Equ., (2) is the loss function for training the reward model. The subscript preferred and non-preferred refers to the completion that is human labeled. In Equ., (4) we combine policy loss with output of the reward model and the entropy regularization to train the active LLM model.

References

Fine-tuning Language Models from Human Preferences

Illustrating Reinforcement Learning from Human Feedback

Reinforcement Learning in the context of LLM

Introduction

Brief overview of RL approaches

Aligning LLM with Reinforcement Learning

References

Written by VJ Anand