Align LLMs with Reinforcement Learning from Human Feedback

AI Learns Best with Rewards

7 min readMay 16, 2024

Large Language Models(LLMs) may behave badly. They can generate toxic, harmful or misleading information. This stems from the massive datasets they’re trained on, which may contain biases or imperfections.

Fine-Tuning LLMs with Reinforcement Learning acts as an etiquette coach, gently nudging LLMs towards human preferences and responsible information sharing.

LLMs with a moral compass?

Not built-in, but definitely trainable!

Helpful, Honest and Harmless (HHH) is a set of principles researchers apply to align LLMs with human values.

While regular Fine-Tuning aims to boost model performance, natural-sounding outputs, or prompt comprehension, Fine-Tuning with Reinforcement Learning takes a different approach. Here, the goal is to align the LLM with human preferences. This can mean either detoxifying its responses, removing aggressive language or preventing the spread of dangerous information.

What does Reinforcement Learning have to do with dogs?

Meet Caspy, my sister’s dog! 🐶

While mischievous and spoiled, he would sometimes entertain to do a trick for you. When Caspy does a good trick, like fetching a toy, we give him a treat or reward. That makes him want to do the trick again! Reinforcement learning is like that but for computers. When a computer program makes a good choice, it gets a reward. This helps the program learn what choices you value the most.

How does Reinforcement Learning work in practice?

Let’s see how Reinforcement Learning works in practice with a familiar example. Since I’ve been battling Tetris relentlessly lately, imagine we’re training the ultimate enemy — the CPU itself.

adapted by the author based on deeplearning.ai diagram

Meet Feli, my worthy Tetris rival. For those playing Tetris, you know Feli’s a total badass, but for the sake of this training exercise, let’s imagine she needs a little extra polish.

The objective in the illustrated diagram is for the agent Feli to win the Tetris game.

In this scenario, Feli interacts with the Tetris game as the environment. Every possible move Feli can make is considered an action. The current arrangement of blocks on the screen defines the state. Feli attempts different moves and receives rewards based on their effectiveness — clearing lines, avoiding stacking chaos and round bridges. The series of actions and moves Feli will take is called Playout or Rollout. Through trial and error (dropping those Tetriminos), Feli learns a winning strategy.

The goal of Reinforcement Learning is for the agent to learn the optimal policy within a specific environment. This policy is discovered iteratively through trial and error, with the goal of maximising the agent’s rewards.

Now that we got the Reinforcement Learning jargon out of the way, let’s see this diagram adjusted for LLMs.

Fine-Tuning LLMs with Reinforcement Learning from Human Feedback (RLHF)

Let’s pick an Instruct model to Fine-Tune with RLHF as they can do well at general tasks. The agent policy that guides the actions is the LLM. Its objective is to generate text that’s aligned with human values (e.g., not toxic, informative, helpful).

The action the LLM can take can be a word, sentence or text chunk the LLM generates in response to a prompt.

The environment is the token vocabulary meaning all the possible words and sentences the model can choose from to complete the text.

Current Context is the prompt text and the LLM’s internal state considering the probability of each token in the vocabulary. This influences the next action the LLM takes.

Rewards are assigned based on how closely the completions align with human preferences. Unlike simpler tasks, the reward for LLMs is more complex.

In the diagram above, you probably noticed in the loop a reward model and a little lady.

Ladies first, so let’s talk about the method for improving LLMs that involves human evaluation.

Here, humans assess LLM outputs based on specific predefined metrics such as non toxicity. They can assign a simple binary score (0 or 1), thumb up or thumb down or ranks to indicate whether the output is problematic. This feedback is then used to iteratively adjust the LLM’s internal parameters to reflect the human feedback. Collecting ranks feedback is more rich and it offers more nuanced data than just collecting thumbs up or thumbs down. This can also balance bad labelling if the humans in the loop did not understand the task.

In the above example we have 3 human labellers to evaluate the model responses with human feedback based on helpfulness where 1 is the best and 3 is the worse. The same completion is assigned to multiple human labellers to establish consensus and collect robust evaluations. The first completion “UK weather tends to be rainy” gets a 3 from the first and second labeller because it’s not that helpful and the last completion gets a 1 from both because it does suggest some alternatives and acknowledges the fact that the weather is bad.

The third one looks like they misunderstood the task as the ranks assigned seem to be the other way around. In the labelling process, it’s important labellers get clear instruction on how to evaluate the completion to ensure high data quality.

While gathering extensive human feedback can be a bottleneck, there’s a powerful alternative — reward models. These models can analyse LLM outputs and assess their alignment with human preferences, effectively overcoming the need for large-scale human evaluation.

Reinforcement Learning with a Reward Model

The reward model becomes a central component of the Reinforcement Learning process. It encodes all of the preferences that are learned from human feedback, and it plays a crucial role in how the model updates its weights over many iterations.

In this scenario, you still need some data — you can start with a few human labelled examples and train the secondary model using supervised ML to classify the outputs. Once trained you can use this reward model to asses the output of the LLM and assign a reward value which in turn is used to update the weights of the LLM and train a human aligned version.

Let’s get a glimpse of how that looks like.

The objective in the above example is for the LLM to generate aligned text. The metric is helpfulness and each completion gets a reward value. In the above example “the book was…” — “a waste of time” completion gets a negative reward as it’s not that helpful and “intuiting but slow” gets a positive reward. The model’s weights are updated by a Reinforcement Algorithm to reflect this favourite response — let’s call this model an RL-updated LLM. The aim here is to iterate over this process and if going well, you should be seeing the rewards values given by the reward model increase.

You will iterate over this process until achieving a certain pre-defined threshold reward value or a certain number of iterations.

The above shows that the best acceptable answer has been reached in this case. For “fascinating, starts slow but with an excellent plot twist” the model got a reward of 2.3 and the LLM parameters are updated to reflect this preference. The higher the reward, the better the completion is aligned with the defined metric. Proximal Policy Optimisation (PPO) is a popular and complex algorithm that can be used to update the model’s weights. Read more about PPO here. Now you have a Human-aligned LLM.🎉

Closing words

In this article we explored how RLHF is used to align LLMs with human preferences. This involves guiding LLMs according to the 3H principle — Honest, Harmless and Helpful. Just as dogs get treats when they do a good trick, LLMs are rewarded to generate more aligned responses. Reward Models offer a solution to the bottleneck of collecting human feedback, efficiently updating LLMs to reflect preferred responses. The potential for RLHF extends beyond LLM alignment. One application that I find particularly exciting is personalised training materials or even AI-powered tutors tailored to individual needs.

Thank you for reading!

PS: If you enjoyed the read, show some love with a few claps below 👏👏👏

Find me on LinkedIn!