Preparing RL for Alignment in Large Language Models: Experience of the YandexGPT Team

Pavel Temirchev
Yandex
Published in
32 min readJul 30, 2024

Recently, a new YandexGPT 3 Lite model has become available via API. One of the key phases of its training, as in the case of other recent models, was the alignment phase, which also includes the reinforcement learning (RL) stage. Without this phase, we wouldn’t have achieved enough quality growth to launch new features and services (for example, Neuro). That’s why we decided to write a whole article about the specifics of model alignment.

There are already lots of articles about alignment and RL. Every ML engineer has probably heard about or come across them, one way or another. For this reason, although we will recall some basic information, our focus will be on those implementation details that are not widely known.

Short Intro

Today we will discuss one of the phases in training assistant language models. This phase is called alignment and follows the pre-training phase.

There virtually won’t be any mentions of the neural networks architecture, because all the neural networks in this article are transformers. All we need to know is that language models receive text in the form of tokens as input and return the probability of encountering a specific token next. If you don’t know what a “token” is, just assume that tokens are words. Simply put, the neural network examines the input text and returns the probability of each word appearing next.

In the pre-training phase, the language model gathers knowledge from a vast amount of texts available on the internet. It learns the fundamental rules for constructing sentences and acquires general information that can be found online. After this phase, the model can continue text in the way it learned from the training dataset.

If we want to create an assistant language model, we can’t use the pre-trained model as is. Although the model contains knowledge of the entire internet, it won’t be able to answer a user query — instead, the model will attempt to continue this query.

The process of transforming a simply smart model into an assistant model is called alignment. We will try to “align” the model’s answers with our human expectations. The Anthropic team gave a good description of the properties that the answers of such a model must have. We’re talking about the three H’s:

● Helpful — the answer must solve the user’s problem.

● Harmless — the answer must not harm the user.

● Honest — the answer must be truthful (factually correct).

Typically, we achieve this behavior in two stages of alignment:

1. Supervised learning that uses datasets collected by people.

2. Training the model further by means of reinforcement learning (RLHF), which allows not only to learn a “certain behavior” but also maximizes the user’s satisfaction from communicating with the model.

In the article, we’ll discuss these two stages in detail.

A Few More Words About SFT

At the supervised fine-tuning (SFT) stage, we fine-tune the pre-trained model using “request-response” pairs. This helps the model transform into an assistant. The main challenges and nuances of this stage lie in gathering a high-quality dataset for training. For the SFT stage, we need a diverse sample of requests that might come from users, as well as correct responses to each of these requests.

The situation with requests (queries) is somewhat simpler than with responses (answers): a human can just write them or we can collect them from the internet in a semi-automatic mode. We used internal contests to collect some requests for the first phases of YandexGPT training, asking enthusiasts to figure out how and where to collect a large number of possible requests (or even create them from scratch). The results were pretty good.

As for correct answers, the process here is much more mundane — we have humans label them. For this, we need a comprehensive statement outlining the criteria for a helpful, harmless, and honest answer. Next, we categorize requests and assign the queries from each category to experienced specialists who can provide a correct answer.

Then we train the pre-trained model on the dataset obtained this way and get the so-called SFT model. To help the model understand where the user’s query stops and the desired response starts in text, we use a special token to separate the query from the response. At the stage of using the model, it’ll receive a query + the special token — this way, it’ll understand that it needs to respond to the query with the next token.

We need to say a couple of words about how we choose a model. Normally, we opt for the best model based on the target metric we use for validation. When it comes to aligning models, it’s challenging to come up with an automated metric because most of them are unrepresentative. That is why the choice of a model is also typically made by humans. On a not too large holdout dataset of requests, we make the trained model and the baseline model generate their responses. Then people decide which answer was better and determine the so-called side-by-side (sbs) superiority. Since the datasets we use for comparison are not so large, we need to ensure that the increase is statistically significant. For example, we can use the binomial test.

Answers must be written and evaluated by experts, because it’s really hard to distinguish nuances in the model’s answers on complex topics. Here we get help from AI trainers who are experts in specific fields and can produce a high-quality text.

It is believed that during the alignment phase, SFT in particular, the language model doesn’t acquire new knowledge — it simply learns to utilize the existing knowledge, searching through the pool of previously learned information and conveying it accurately. Studies have shown that using a small dataset with high-quality labeling at this stage leads to better results compared to using large datasets of questionable quality (LIMA: Less Is More for Alignment). Moreover, a study by Google Research shows that incorporating new knowledge into a model during the SFT stage may result in hallucinations — these are situations where the model confidently produces statements that are completely inaccurate but often sound quite plausible (Does Fine-Tuning LLMs on New Knowledge Encourage Hallucinations?).

What’s next? Usually, after the SFT stage, the model is smart and aligned enough to serve as an assistant. But there’s room for improvement: the data used for SFT training is always limited, especially if we remove “dirty” data.

It turns out that we can align the model with our expectations even better without labeling — that is, without having people write answers. This stage is called RL from human feedback.

More About RLHF

Figure 1: A robot interacts with a human. The robot receives a request from the human, sends a response, and receives feedback

There is an area of machine learning called reinforcement learning (RL). It’s somewhat similar to supervised learning, but instead of learning from the right answer, the model learns from a punishment or reward it receives.

You can think of reinforcement learning as training dogs. A dog receives a command from its trainer, but it doesn’t get direct instructions on how exactly to move its paws to perform the command. Instead, the dog explores what it could do, and if its attempt is successful, it receives reinforcement — a treat.

Formally, RL experts often talk about the Markov decision processes (MDPs), which describe how a student (agent) interacts with the environment that provides the training signal. In an MDP, the agent observes the current state of the environment s and selects an action a based on its behavior strategy π(a|s) called the policy. The environment reacts to the action by changing its state to s’, as well as with some reward r for the agent, which serves as the training signal in the MDP. The Markov property in this case is that if the agent knows the current state of the environment, it doesn’t need to know the interaction history to choose the appropriate action.

We can also examine the interaction between a user and an AI assistant in the context of the Markov decision process. The environment state s represents the current text in a conversation that starts with a user query (at the beginning of the interaction, this state is simply the user’s initial query). The agent’s action is, of course, the response from the language model. What’s interesting in this scenario is the reward the agent gets — we will discuss this in more detail in the next chapter.

For simplicity, the alignment task usually focuses on a single interaction step: the agent observes the user’s query (or the dialogue between the user and AI that ends with the user’s response) and attempts to answer in a way that achieves the greatest reward.

There are many RL algorithms that differ in certain features. Later in this article, we’ll explore the three algorithms that are best suited for solving the alignment task.

While supervised learning is more like “pouring” expert knowledge into the model, at the reinforcement learning stage, the model learns from its own experience in generating texts. At this stage, the model can “re-learn” the statements introduced during the SFT stage in a more favorable way. The effect of hallucinations that can happen during SFT is reduced at the RL stage, because the model quickly understands that the answer “I don’t know” yields a greater reward compared to an incorrect answer.

Who Does the Evaluation?

If you have an ideal reward function, the RL formalism allows you to find optimal generative models without the need for direct labeling (that is, the answers written by people). All you need is a pool of good queries. However, we would be lying if we said that RL doesn’t need any labeling — after all, we don’t have a perfect reward model.

Where do we get this reward in the first place? Once again, the first thing that comes to mind is to make people label data, but this time in a less harsh way. Let’s take a large dataset with requests and ask our SFT model from the previous stage to generate an answer for each of them. Then we will ask people to rate the answers on a scale from one to five, depending on how smart they find the answer. After that, we’ll take the same SFT model and replace its “head” with an untrained linear layer. The neural network is designed in such a way that its last layer returns a probability distribution over tokens. We replace it with a regular linear layer that returns a single number. This will be our reward model. We’ll feed it a query + answer and then train it so that its output matches the average rating that people give to such an answer.

Despite its straightforwardness, this approach is not perfect:

- AI trainers often find it challenging to choose the correct grade out of five options.

- At the design stage, it’s hard to define what constitutes “excellent”, “good”, and so on in the rating system. If you make a mistake, it might happen that almost all of the grades are, for example, good. Then the model won’t be able to learn what exactly makes these answers good, and the labeling will have to be done again, which is costly.

Although this approach is used in the industry, many prefer a more complex modification, which simplifies labeling for AI trainers: instead of grading an answer, it’s much easier to compare two answers and choose the best one.

Here’s what we’ll do: for each query in a sample, we generate two answers and ask AI trainers to mark a better answer in each pair. This approach makes labeling easier and helps us avoid the problem of the model not knowing which answer is better. If the answers are equally good, the trainer can also specify this in their labeling.

If you intend to do labeling for your own LLM assistant, you should be aware of the baby duck syndrome. Just like ducklings consider the first suitable object they see to be their mother, AI trainers also tend to prefer the first answer they read, given all other factors are equal. Make sure to randomly shuffle the order of answers before showing them to AI trainers — otherwise, you may follow the wrong lead and get lost.

Figure 2: An AI trainer selects the response. There are three boxes: the first box contains a request, the second contains a wrong answer and is highlighted red, and the third box contains the right answer and is highlighted green

Here’s a more complex question: how can we train the reward model with such a dataset? These two options are the most popular.

Option 1. The Bradley–Terry model

We will train the model that returns a single number (reward) for the query + answer, but this time in a new way. Now we don’t have any explicit labeling that indicates which reward corresponds to which answer. But we can instruct the model to assign a higher score to a good answer and a lower score to a worse answer. To do this, we introduce a probabilistic model where the probability (the confidence of the model) that the answer a is better than the answer b is represented by the following formula:

Here, s is the query that is being answered, r is the neural reward model we are training with the parameters ψ, and σ is the function that maps an unlimited difference into the interval (0, 1). This is necessary because the probability must reside in this interval:

Figure 3: The graph of the sigmoid \sigma function: on the horizontal axis (abscissa), you have the reward difference, and on the vertical axis (ordinate), you have the sigmoid function. You can plot a point on the graph, labeled as P(a > b) on the top and r_ψ(s, a) — r_ψ(s, b) at the bottom.

This way, the model’s confidence that the answer a is better than the answer b increases in proportion to the disparity in the reward it assigns to these two answers.

The model is trained using the maximum likelihood estimation method: we select the ψ parameters in such a way that the probability of observing the dataset collected by AI trainers is maximized. To stabilize any numerical problems, we usually take the logarithm of the likelihood, which leads us to the following optimization problem:

In other words, the training is reduced to basic supervised learning but with a specific loss function.

During our experiments, we came up with a pipeline where the reward model learns in two stages:

  1. First, we train the model on “dirty” data collected in a semi-automatic mode.
  2. Then we train the reward further based on the clean data obtained from AI trainers.

The second stage must be pretty clear by now, but the first one still raises questions regarding which data we could use for it.

It turns out that you can find quite a lot of pairs in the form of(query) + (better answer, worse answer)on the internet for free.

Many online services allow you to rank user answers to queries. For instance, Medium lets users rate comments, which gives us the option to literally pick a (good comment, bad comment) pair and use it as an answer to the(query), which is the post itself. The same principle applies to, let’s say, Stack Overflow.

Another thing you can do is use problem books that give you a list of possible answers to choose from, and the correct answer is known in advance. This method also makes it easy to collect a sample with two answers to a query, where one answer is better than the other.

With a little thought, you can parse a large sample of low-quality comparisons and use it at the first stage of training the reward model.

Sometimes you may run into the problem of having a very large dataset of pairwise labels where there are many trivial examples, and the reward model can easily determine which answer is better. This can negatively impact the model quality, because during training, it’ll tend to focus on the numerous simple pairs and won’t do as well with the more difficult pairs. To avoid this, we can employ data filtering. We train the reward model using all the available data and then exclude the “easy” pairs from the dataset (those where the model is most confident), keeping the most difficult ones. We then use this filtered dataset to train a new reward model. This approach requires caution: if you use a dataset that is too small or not representative, this can have a negative impact on the reward quality. Moreover, besides the difficult examples, the filtered sample may also include extreme outliers.

Option 2. A model that digests two requests at once

The BradleyTerry model introduces certain limitations on what kind of reward can be trained. One of the main limitations is the assumption that the reward is transitive. Simply put, there are no preference cycles in the BradleyTerry model: for answer evaluations, we can’t get a > b, b > c but c > a .

Allowing the reward model to form preference cycles may seem ridiculous, but in fact, social choice theory states that this is exactly how people make choices — we do this in a non-transitive way.

Let’s assume there is a group of AI trainers and a set of different answers to a query. Each AI trainer can rank the answers according to their preference (for example, a > b > c). For each individual AI trainer, transitivity is observed. After we aggregate the answers from different trainers, we may get a cycle of a > b, b > c but c > a. This phenomenon, called the Condorcet paradox, is often discussed in the context of democratic elections.

In fact, it isn’t difficult to introduce some flexibility into the model, allowing it to reproduce intransitivities that may occur in the choices made by AI trainers. What you need to do here is train the neural network r(s, a, b), which would accept not one but two answers along with the query and return the model confidence that a > b. To train such a model, we already have the appropriate sample where better answers are labeled. We use it to train the model to solve a standard classification problem.

This model, however, has a significant drawback. You can only use it to compare different models, and it doesn’t return the absolute value of the reward. Also, when training a model like this, you need to somehow solve the asymmetry problem: r(s, a, b) ≠ 1 — r(s, b, a)

In addition, this reward model is simply more expensive to infer, since it is fed two answers at once instead of one.

When developing YandexGPT, we experimented with a potentially intransitive reward model. However, we couldn’t prove that it was better than the BradleyTerry model: they just worked equally well. Because the intransitive model was too difficult to use, we settled on the classical BradleyTerry model.

What (Else) is the Reward Model Used For?

Finally, we have a measure of success — the quality estimator for the generative model answers. You surely have the general idea of how the model can be used, and it’s trivial: just use it to run reinforcement learning to fine-tune the SFT model and thus maximize the reward. And we’ll definitely discuss this approach later.

However, there are other applications of the reward model besides reinforcement learning. Let’s say you have a reward model and an SFT model. You have collected new data and plan to retrain the SFT model on the new dataset. During the training, it’s a good idea to take the reward into account in addition to the validation loss charts. To do this, during the validation process, you need to use the model that is being trained to generate answers to the queries from the validation dataset and calculate the average reward for these answers.

It often turns out that early stopping on the reward occurs at a different iteration compared to the early stopping on the loss. We use this simple trick in YandexGPT and consistently get higher-quality SFT models.

Here and later in the article, we’ll talk more about generating answers for the purpose of validation and training. Speed-efficient generation is the key to success. There are a number of libraries that allow you to compile your model into a format suitable for fast execution — for example, TensorRT, which is supported in Torch. At Yandex, we use our own framework for fast execution of models.

Converting a model to a “fast” format is not always worth the effort. For example, if you plan to generate texts at each validation iteration, the conversion may take even more time than the inefficient generation, since you’ll have to convert the model again each time. Here you can use the following rule: if the model weights change rarely or not at all, you can convert. If the weights change frequently, it’s better to perform inference using the standard Torch. Because of this, we do not convert the model at the training stage when it’s generating answers for validation. However, we do keep the reward model in a “fast” format (because it does not change).

“Poor Man’s RL”: The Cross-Entropy Method

Finally, we can talk about the main application for the reward model — training the generative model that would collect a lot of reward.

Let’s start with the simplest algorithm, which our team refers to as “poor man’s reinforcement learning”. The approach is very easy to implement. To do it, we need a sample of relevant queries:

- The SFT model generates N answers for each query.

- Then we choose the best answer based on the reward for each query.

- After that, we train the SFT model further based on the obtained query + best answerpairs.

Despite its simplicity, this method proves to be really helpful. Here, it makes sense to use the generative model and the reward model, both converted to the “fast” format, for generating and evaluating answers.

Among RL specialists, this approach is more commonly known as the cross-entropy method (CEM or CE-RL). Applying CEM to RL is actually a specific case. CEM is a general optimization algorithm that belongs to the family of genetic algorithms. In a nutshell, CEM can be described as follows: we generate many points and select the best ones among them and then move the generator of points towards the best ones.

It’s important to understand that CEM only works effectively for models that produce varied answers. If the same answer is given every time, picking the best one is pointless.

In theory, you can do several consecutive iterations of this approach to further improve the model. In practice, however, it wouldn’t work very well, because after the first iteration, the model loses much of the original variation in responses, so selecting the best answer no longer helps improve the model.

The idea of distilling good answers and feeding them back into the model is actually quite general. However, smart answers don’t need to be obtained by randomly generating a variety of answers. You can get good answers by using the Chain-of-Thought method, where the model generates the answer in several guided steps by “saying its thoughts out loud”. Alternatively, you can bring out the big guns and use the Monte Carlo tree search (MCTS) for a direct search through the tree of answers with a high reward.

Almost every RL course will tell you there is no overfitting in RL. This means that if an agent has learned to get many rewards for a task, it doesn’t need to be validated in a special way, because it already solves the reward maximization problem. However, this rule doesn’t work in language modeling, because the agent sees only a subset of possible queries. That’s why it is crucial to use a holdout dataset of queries to validate a language model trained with RL.

“Rich Man’s RL”: Proximal Policy Optimization

Moving from simple to complex: although the “poor man’s RL” is easy to implement, it’s quite challenging to make it find an оptimal model in terms of reward. There are two main reasons for this:

- It takes a lot of time and resources to generate N hypotheses for each query in a relatively large sample.

- If you don’t have a clever control method, the model quickly loses the ability to produce varied answers. As a result, subsequent iterations of the method do not lead to improvements.

This is where more theoretically advanced reinforcement learning algorithms come in useful. For the alignment task, using algorithms from the policy gradient family is common practice.

The idea is simple: we want to maximize the average reward that the agent earns. Let’s calculate the gradient of this average reward and change the model weights in the direction of this vector — in other words, use the most basic gradient ascent method (the method is similar to gradient descent but is used for maximizing functions, not minimizing them).

Formally, the average reward for the agent can be expressed as follows:

You’re right, E here stands for the expected value. But don’t worry! For the moment, we can just say it is used for the simple averaging of a large (infinite) number of examples. For instance, E with the subscript s ∼ 𝒟 is the averaging over a large number of s queries from the 𝒟 sample, and E with the subscript a ∼ π(a| s) is the averaging over a large number of answers to the s query. Here, π refers to the generative language model that is being trained and has the θ parameters. Consequently, π(a | s) is the probability of giving the answer a to the query s. These are exactly the probabilities that language models produce.

The tricky part of this method is to find the derivative of the whole thing with respect to the parameters θ. The θ sits under the expected value here, and that isn’t nice. But there are (almost) no unsolvable problems, and so with a bit of effort, we can get a nicer formula for the derivative:

*More about the gradient formula: Click here to see the appendix

In general, from this formula alone, we can figure out a working algorithm for maximizing the average reward. But let’s take a moment to consider what this formula tells us. If you examine it carefully, you can see that this gradient implies a change in the probability of the answer a in proportion to its reward r(s, a). If the reward is negative, we decrease the probability. If it’s positive, we increase it. And if the reward is very positive, we increase the probability significantly.

As long as we use expected values (that is, the averages over an infinite number of examples), everything works as expected. But in practice, we’ll inevitably move to averaging over a finite, small number of answers produced by the model. And that’s where problems may arise — for example, the reward model may learn to assign positive scores to all responses. If the model assigns small positive values to bad examples and larger values to good ones, this gradient pushes us to increase the probabilities for all answers generated by the model.

To solve this problem, a modified formula is commonly used (which is also theoretically justified):

Here, V(s) = E r(s, a) is the average reward that the agent receives if it generates answers for the s query (the so-called value function for the agent). Typically, we have another neural network trained for this: it receives the request and returns a single number. In other words, the idea is that we increase the probability of only those answers that are better than the average. This neural network learns to minimize the mean square deviation of its outputs from the reward for the answer:

Thanks to the mean square deviation, the model will learn just the average we need it to predict. In this case, we don’t get any gradient issues because there are no ϕ parameters under the expectation.

So what do we get in the end? Let’s put this in the form of pseudo-code:

Algorithm 1

1. Input: multiple queries for for training D

2. Initialize the policy with SFT model: pi <- SFT

3. Initialize the value V with the reward model: V <- RM

4. Repeat until convergence occurs:

4.1. Select a batch of queries B from D

4.2. Calculate the value V(s) for all s in B

4.3. Generate one answer a for each query s from B
It's important that they're generated by the model being trained — pi

4.4. Calculate the reward r(s, a) for all pairs (s, a)

4.5. Calculate the loss for agent

4.6. Calculate the loss for the value function V

4.7. (L_a + L_v).backward()

4.8. optimizer.step()

The algorithm above is called Advantage Actor Critic (A2C), and it already works pretty well for maximizing the reward. Before we delve into details, let’s take a look at its main drawback: training time. The algorithm requires you to generate an answer to each batch element using the model that has the current version of the weights. You can’t efficiently generate an answer to each query in a sample in advance, as we would do in the “poor RL” method. Here, you may need to generate answers using the standard Torch model, which is less efficient.

If you’ve come across alignment before, you’ve probably heard of a method called PPO (Proximal Policy Optimization). Essentially, it is A2C with a few workaround techniques, such as importance sampling and gradient clipping — they allow performing more than one step of the optimizer on the same answers. That’s probably all you need to know about PPO. If you want to learn more, consider taking the reinforcement learning course offered by the Yandex School of Data Analysis (course materials on GitHub).

PPO partially solves one of the problems of A2C — the need to generate a large number of texts. However, it doesn’t really address other issues and also introduces new ones:

- You need to store three models in memory: policy, value, and reward.

- You still need to write an efficient generation code with Torch.

- PPO has many hyperparameters, and it’s quite sensitive to them — you’ll have to perform grid searches for hyperparameters.

Some books recommend that you don’t create a separate model for the V-function but instead use a two-headed model. In this model, two heads protrude from the same body: the policy and the V-head. With YandexGPT, we noticed that this architecture is very hard to train, because policy and value losses need to be mixed with some coefficient that’s difficult to find, and we would also like to change it during the training process. If the chosen coefficient is incorrect, it can lead to the V-model either not learning or consuming all the optimizer resources, which means the agent doesn’t turn out the way we expect it to be.

There are some studies that propose completely abandoning the V-model and using a pre-calculated constant instead. We experimented with this approach in YandexGPT and found that it works, but not 100% of the time. In 50% of launches, if the hyperparameters weren’t set up perfectly, the training failed terribly. So while getting rid of the V-model accelerates training, this approach is generally unstable and therefore unreliable.

If you want one killer feature to make your PPO work, that is probably advantage normalization. In the context of reinforcement learning, advantage refers to the difference between the reward for an answer and the average reward r(s, a) — V(s) used in training the agent. If you use zero mean and unit variance normalization for this, on the one hand, you’ll lose the theoretical guarantees for the algorithm, but on the other, you’ll get a much more stable convergence. So we recommend doing it.

Domain Shift aka Goodhart’s Law

So you’ve learned to maximize the reward. But is it always a good thing? It seems like this is what we’ve been aiming for — good answers that get lots of reward. Though unfortunately, a high reward doesn’t always mean the answer itself is good, because the reward is obtained from the neural network that’s been trained on a finite dataset, and it has its weaknesses.

The situation becomes worse when you realize that RL modifies the generator’s answers exactly in the direction where the reward model performs poorly. See this effect visualized in the illustration below.

Imagine that your agent is standing at the foot of a mountain and receives a reward that is proportional to its y coordinate — the higher, the better. The agent observes its environment and identifies the pattern for increasing the reward: if you move to the right in the image, the reward gets greater. Next, the algorithm for finding the optimal policy (RL) is launched. Naturally, this algorithm suggests you need to move right at all times — that’s the optimal choice according to the reward evaluation. Sooner or later, the agent following this reward function will fall off the cliff.

Figure 4: Illustration of Goodhart’s law in RL: a coordinate grid with an x-axis and a y-axis is shown, depicting a mountain with a gradual rise on the left and a cliff on the right. A stylized robot is at the foot of the mountain on the left

The problem arises because optimizing the metric leads us away from the domain where this metric works well. The effect we observe is called Goodhart’s law. It states: “When a measure becomes a target, it ceases to be a good measure”. The principle applies to companies’ KPIs, macroeconomic indicators of entire states, and, of course, the proxy reward in our alignment problem.

Developing YandexGPT, we noticed a fascinating phenomenon at one of the training stages: the reward model preferred a bad answer formatted beautifully to a good answer without any formatting. This happened because the reward model’s dataset included examples where a nicely formatted answer was rated better than an answer without formatting, and there were no reverse examples at all. And that resulted in a classic case of Goodhart’s law.

There are two main ways to solve this problem.

The first way is to prevent the model from going too far from the initialization (that is, from the SFT model). If your agent doesn’t go far from the mountain foot, it probably won’t fall off the cliff. To achieve this, we add a KL penalty to the reward model for PPO model’s deviation from the SFT model:

Here, β is the coefficient that determines the strength of the penalty.

*How to calculate the KL penalty: Click here to see the appendix

The second way is to constantly train the reward model further based on the answers that RL leads to. You train the PPO model, make it generate an answer to each query from a dataset, and then ask your AI trainers to decide which model gave a better answer: SFT or PPO. After the labeling is done, you add it to the dataset for fine-tuning the reward model and do the drill.

The first way is often an inevitable evil. If you can do without it, that’s great: it means your reward model is good and stable. But most of the time, you can’t. The second way is a long and arduous journey where you have to do many iterations of fine-turning the reward model. But, unlike the first way, it actually solves the problem rather than conceals it.

There are also more exotic ways to solve the problem that aren’t very popular yet. One example is using epistemic uncertainty (the uncertainty caused by insufficient training data) in evaluating reward models. One way to estimate this uncertainty is to train a reward model ensemble on slightly different data and starting from a slightly different initialization of weights. If all members of the ensemble agree on the reward value, it means that they’re confident in it. If they disagree, we should be cautious about the evaluation. You can, for example, subtract the variance of the ensemble forecast from the agent’s reward.

Cutting Costs: Direct Preference Optimization

PPO is an expensive algorithm. It’s difficult to implement, unstable in learning (which also takes a lot of time), and requires selecting hyperparameters and a bit of luck to work properly. But if the stars align, this algorithm achieves high rewards. Can we do the same thing, but at a lower cost? Yes, we can! There’s another method that may not be exactly identical but very similar in quality, performing better or worse than PPO in different aspects.

Direct Preference Optimization (DPO) is a relatively new algorithm that solves the problem of maximizing the reward in RLHF. It doesn’t require either generating lots of text or training the reward model. At the core, the idea is simple: we collect a dataset with answers compared against each other in pairs — just like we did before for the reward model. Then we go right in and somehow train the generator on a contrastive loss so that it produces good answers more often and bad answers less often.

DPO stands out among many other contrastive methods because of its rigorous theoretical foundation, and we’ll try to get deeper into it. There’s a fascinating, well-known fact about how the optimal policy and the reward are related if we use the KL penalty mentioned in the previous chapter. We can literally write down the analytical formula for the optimal policy through the reward:

The problem is that this formula virtually can’t be used in real life: Z(s) here is the normalizing constant of the probability distribution. To find the constant, you need to calculate the sum of the rewards for all the possible answers. And because there are infinitely many possible answers, this task is simply unfeasible. Alright, we’ll get rid of this Z a bit later.

*Proof of expressing the policy through the reward: Click here to see the appendix

We’re interested in the inverted version of this formula: let’s express the reward through the policy.

No magic here, just an inverted expression. But what happens if we substitute an arbitrary policy π instead of the optimal policy π* into this formula? A policy that’s not optimal for your task might be optimal for another task. With this formula, we may calculate the reward function in which π is optimal.

Let’s write this down explicitly:

The difference is that the policy is arbitrary and not optimal, and the reward is expressed with the index θ since it’s now explicitly parameterized through the policy parameters. That’s the reward where π is optimal.

This equation has another interesting property: in general, the optimal policy must not change if the constant that doesn’t depend on the answer is added to the reward. In other words, if you simply exclude β log Z(s) from the formula, the same optimal policy will correspond to the reward function we get:

Now here’s the difficult trick for you: let’s select the parameters θ in such a way that the reward r is as plausible as possible according to the BradleyTerry model (refer to the part that talks about the reward model):

Note that we won’t use a separate neural network for the reward because we know how to express the reward through the policy. We’ll need a neural network for the policy, and that’s it. To avoid mentioning the reward in the optimized functional, let’s substitute the reward expressed through the policy:

Summing up what we’ve done:

- Expressed the reward function through the policy.

- Substituted this expression into the reward loss function.

So now we solve two problems at once: when training the reward model, we also train the generator at the same time. It turns out π is the model that maximizes r. And we only needed one neural network to achieve this — the policy. If, for some reason, we want to calculate the reward, we have an analytical formula for this.

The DPO method requires you to collect the same training data as for the reward model. The policy is trained through standard supervised fine-tuning, but on a rather specific loss. Unlike PPO, DPO doesn’t require you to generate data during training. You also don’t need to train a separate reward model if you use this method. In terms of computational complexity, DPO is still more expensive than plain supervised learning, because the loss function includes probabilities of both the model being trained as well as the SFT — the latter also needs to be stored in memory. While working on YandexGPT, we observed that DPO is not very sensitive to selected hyperparameters (in particular, β). This allowed us to perform model alignment without much pain and spend more time experimenting with the data rather than searching for the right hyperparameters to ensure the training works.

If you are training DPO and already have a trained reward model, you can use it as well. DPO is often run on top of a synthetic sample of comparisons: we generate two answers for each query and select the best one using the reward model. The obtained sample of comparisons is used to train the DPO. Of course, if you have real-world data, the best thing to do is add them and train the model in two stages. This may seem like a redundant step, but this strategy has proven useful in practice.

So Which Method is Better?

In truth, the scientific community still hasn’t come to a definite answer. Most experts say it’s between DPO and PPO, claiming that CEM achieves worse results. However, we don’t know of any serious studies where training using CEM was carried out in several stages with entropy control.

The choice between DPO and PPO isn’t all that clear either: there’s a general assumption that making PPO work is harder than DPO, but you can get a better model with the former. Still, we don’t have definitive evidence yet to confirm this thesis.

As for Yandex’s basic alignment team, we continue our research on all fronts.

Appendix

More About the Gradient Formula

To understand where this formula for the gradient comes from, we have to remember what the expected value is. Let’s expand the expectation on the answers in the functional that we’re going to differentiate:

Summation here occurs over an infinite (countable) set of all possible answers. Let’s try to take a derivative with respect to θ:

There is a problem with this expression: previously, we had the expectation on the answers, but now we have the sum. This sum is not an expected value, because there must be probabilities under the sum, and here we have gradients of probabilities. Expected values are better than sums, because we can estimate them using the Monte Carlo method (through averaging, in fact), but we can’t estimate sums. To return to the expected value, we’ll use the so-called log-derivative trick:

To get this formula, just expand the logarithm derivative formula. And in fact, that’s exactly what you need to substitute in the formula for the optimized functional derivative:

Now that there are probabilities under the sum, we can put the expected value back in the way we wanted:

How to Calculate the KL Penalty

The penalty for deviation from the SFT model is traditionally introduced through KL divergence, which is a measure of the distance between two probability distributions. By definition:

A fair KL penalty is rather difficult to calculate, because you have to obtain the expected value. In practice, a Monte Carlo estimation is used. The most common solution is to generate one answer a by π — the model being trained — and estimate KL based on it (in case of PPO, you need to use the example you’re already generating for the algorithm):

Since transformers provide a probability for each token rather than the entire answer, the final formula looks something like this:

In YandexGPT, we use a more accurate estimate of the expected value, calculating the sum of KL penalties per token:

In other words, for each token, we calculate not just the logarithm of the probability ratio but the exact expected value at the token level, averaging it over all the tokens in the dictionary with the corresponding probabilities.

Proof of Expressing the Policy Through the Reward

So this formula connecting the optimal policy and the reward — where did it come from in the first place? You can easily get it from the reward maximization problem with the KL penalty. Let’s write down this problem for an arbitrary user query s ∈ 𝒟:

Next, we expand the KL divergence by definition:

and combine the expected values:

Here’s what we’re trying to do here: in this maximization problem, we want to see the problem of minimizing another KL divergence. To make the transition transparent, we divide the expression by . Because we divide by a negative number, the maximization problem should be replaced with the minimization problem:

Now we get to the last step required to see the KL divergence in this expression: applying the identity transformation:

After that, apply one of the logarithmic properties right away: the difference of the logarithms is the logarithm of the ratio.

The expression we have here is simply the KL divergence between the policy in training and a non-normalized distribution:

We know that the minimum value of the KL divergence is achieved when the distributions are identical. Taking into account the normalizing constant, here’s what we get:

--

--