Responsible AI — Align LLMs with Human’s values using RLHF

7 min readOct 16, 2023

While we are building application using Generative AI, we need to consider how we can responsibly, build and use generative AI models? One of the major risk with LLM is sometimes they behave badly and produce toxic language, false information or incomplete information (‘Hallucination’) and may provide dangerous information. Hallucination is a phenomenon where an LLMs can produces text that may be correct grammatically but may provide false information.

Ideally model should respond more natural sounding language while aligning response with human values. These important human values, helpfulness, honesty,and harmlessness are sometimes collectively called HHH. These are a set of principles that guide developers in the responsible use of AI.

These problems exist because large models are trained on vast amounts of texts data from the Internet where data may not have accurate information or data might be using toxic language. Another cause can be overfitting where training data can cause a model to produce outputs that mirror the training set but misalign with new or different inputs.

Examples of HHH

Below are some of examples which shows LLM’s response is not honest, harmful and not helpful.

Reinforcement Learning from Human Feedback (RLHF)

A popular technique to handle HHH by fine tuning the Large language model with Human feedback is called Reinforcement learning from Human feedback (RLHF). Once we have selected or fine tuned LLM for a specific task, we can again fine tune the model with human feedback for the specific value (HHH). These human feedback can be fed into the model using the Reinforcement learning algorithm. That value can be one of HHH so it can maximize helpfulness, minimize harm or avoid dangerous responses. We can also cover a organization specific value if any.

Researcher has found that model fine-tuned with human feedback produced better responses than a pretrained model (like GPT4), an instruct fine-tuned model (fine tuned model for a particular task) and even the reference human baseline responses.

Reinforcement Learning

Reinforcement Learning is a feedback-based Machine learning technique in which an agent learns to behave in an environment by performing the actions and seeing the results of actions. For each good action, the agent gets positive feedback, and for each bad action, the agent gets negative feedback or penalty. Agent learns to make decisions related to a specific goal by taking actions in an environment with the objective of maximizing some notion of a cumulative reward. By iterating through learning process, the agent gradually refines its strategy or policy to make better decisions and increase its chances of success.

RLHF using Human Feedback

To keep the things simple we can understand RLHF on higher level. We can have series of the input to our fine tuned model (instructed model — fine tuned for a specific task) and human evaluate all of the completions of the model against some guideline, such as determining whether the generated text is toxic or non-toxic. This feedback can be represented as a scalar value, either a zero or a one. We can then use Reinforcement learning algorithm to update the model weight based on the human feedback. The LLM weights are then updated iteratively to maximize the reward obtained from the human classifier enabling the model to generate non-toxic completions.

Reward Model

Now we can understand that obtaining human feedback can be time consuming and expensive. As a alternative approach we can train another LLM called Reward model with using Human feedback using traditional supervised learning methods. This reward model can be trained for a specific value like toxic or non-toxic or combination of more values (HHH) based on the requirements and can output numerical reward score as a degree of alignment with human value. Process of training of reward model is complex topic which we can be cover in future blog.

Once trained, We can use this reward model to access the output of the LLM and return a reward value. This reward value will be used by RL algorithm which in turn gets used to update the weights of the LLM and train a new human aligned LLM version. The process of updating the weight of LLM model will depend on the Reinforcement Learning algorithm.

RLHF process using reward model

Assume we have a instructed model which is already fine tuned for a specific task and giving us good performance and we want to align this model with human value “toxic” or “not toxic” using reward model. We can pass the first prompt from our prompt database to our instructed model which generate a completion. We then pass this prompt and completion pair to our reward model. The reward model evaluates the pair based on human feedback on which it was trained and returns the reward score. This score will be between 0 to 1 in which higher value represents the more aligned response. We can pass this prompt completion pair and reward score to Reinforcement learning algorithm to update the weight of instructed model which move it towards generating more aligned responses. This full set of actions together forms a one iterations for RLHF process. We can continue these iterations for a given number of epics similar to other types of fine tuning. For each iteration reward score will increase and human aligned model will produce more human aligned responses. We can continue this iterative process for maximum no of steps or a maximum threshold value of reward score for a toxic or non toxic. We can continue this fine tuning process for many prompts to update the weights of model. let’s refer this final fine-tuned model as the human-aligned LLM.

Hugging Face provides various pretrained reward models. One of them is Meta AI’s RoBERTa-based hate speech model. This model returns the score based on the hate and non-hate prompt completion pair. This model can be used to detoxify our fine tuned model.

Proximal Policy Optimization Algorithms (RL Algorithm )

Question arises what kind of RL algorithms we can use here? RL algorithm play a heavy role and it takes the output of the reward model and uses it to update the LLM model weights so that the reward score increases over time. There are several different algorithms that we can use for this part of the RLHF process. A popular choice is proximal policy optimization called PPO. PPO is a very complicated algorithm. If you want to go through details, you can refer paper on Proximal Policy Optimization Algorithms. Hugging Face provides a PPO implementation called PPO Trainer for training language models on any reward signal with RL. This trainer is heavily inspired by the original OpenAI learning for summarize work. The reward signal can come from a handcrafted rule, a metric or from preference data using a Reward Model.

Reward Hacking

In the fine tuning process of LLM using RLHF may introduce new problem called Reward hacking. In the RLHF process, RL Algorithm (Agent) can update the LLM weights in such a way that it can get maximum reward score from the reward model, even completion of LLM is not well aligned with original objective. Reward hacking can introduce addition of words or phrases into completions that result in high scores for the human value being aligned but that reduce the overall quality of the language. Assume we are detoxifying our instruct model using a reward model and prompt completion having the words like Garbage and other toxic words. Reward model will return a low reward score for these completions which in turns pass into PPO Algorithm which in turns updates the weights of LLM. As we iterate through, RLHF process will update the LLM to create a less toxic responses. PPO will try to optimize the reward and in that case it may diverge too much from the initial language model. Then model may start generating completions that it has learned. These completions may lead to very low toxicity scores by including phrases like most awesome, most incredible. This language sounds very exaggerated and may deviate from the original goal of the model. This is called Reward hacking.

To stop this reward hacking, we need to compare each completion with the original model completion. For this we can have one copy of original model as a reference model for comparison. During training, each prompt is passed to both the models, generating a completion by the reference LLM and the intermediate LLM updated model. At this point, you can compare the two completions and calculate a value called the Kullback-Leibler divergence( KL divergence). KL divergence is a statistical measure of how different two probability distributions are. Using KL divergence we can find out how much completion of updated model is diverted from the reference model. Once we calculated the KL divergence between the completions of two models, we can add acid term to the reward calculation. This will penalize the RL updated model if it shifts too far from the original reference model. Hugging Face PPO trainer class already implements the KL divergence which we can use while doing training.

Thanks for reading this blog. I have kept this blog as simple as i can so even user who don’t have great technical background can understand this. I will try to have next blog which can cover more technical background with implementation. Interested users can go through Hugging Face TRL blog which covers its implementation in details.

References -

1-) https://www.coursera.org/learn/generative-ai-with-llms/

2-) https://huggingface.co/

3-) Paper on Learning to summarize from human feedback — https://arxiv.org/pdf/2009.01325.pdf

4-) Paper on Proximal Policy Optimization Algorithms — https://arxiv.org/pdf/1707.06347.pdf