Fine-tuning Large Language Models: A Brief Introduction

Supervised Fine-tuning, Reinforcement Learning from Human Feedback and the latest SteerLM

5 min readMay 7, 2024

Author

· Xuzeng He (ORCID: 0009–0005–7317–7426)

Introduction

Large Language Models (LLMs), usually trained with extensive text data, can demonstrate remarkable capabilities in handling various tasks with state-of-the-art performance. However, people nowadays typically want something more personalised instead of a general solution. For example, one may want LLMs to assist in code writing while the other may seek models that are specialised in medical knowledge. In this case, to better align LLMs to human preference, we can fine-tune a pre-trained model to make it specialised in knowledge from a specific domain.

In this post, we introduce 3 different algorithms to fine-tune your LLMs, including the latest fine-tuning method proposed by NVIDIA — SteerLM.

Supervised Fine-tuning (SFT)

Supervised Fine-tuning (SFT) is the most common approach to adapt a pre-trained model to a specific task. The model is trained on a labelled dataset and learns to predict the correct label for each input. It usually consists of 3 steps:

Pre-train the model: The base model should be pre-trained beforehand to give it a basic understanding of language.
Label the Dataset: Each data point in the task-specific training dataset should be labelled because SFT is a Supervised Learning algorithm, and Supervised Learning means training the model with a labelled dataset.
Fine-tune the model: The parameter of the model is adjusted to improve its performance on the given task using the loss value between the prediction and the label for each datapoint.

Supervised Fine-Tuning process flow

For some actual practice, one can check the SFTTrainer class from the TRL library (developed by Hugging Face), which is designed to facilitate the SFT process. This class accepts a column in your training dataset CSV that contains system instructions, questions, and answers, which form the prompt structure.

Reinforcement Learning from Human Feedback (RLHF)

Since SFT is pretty basic, we now move to a more complicated algorithm — Reinforcement Learning from Human Feedback (RLHF). As suggested by its name, RLHF is a method that uses reinforcement learning to directly optimise a language model with human feedback. It has enabled language models to be trained to align with different sets of complex human values. It mainly includes three core steps:

Pretraining the model
Gathering data and training a reward model
Fine-tuning the LLM with reinforcement learning.

As a starting point, RLHF needs to be applied on an LLM that has been pre-trained. This step can be skipped if the model is already pre-trained beforehand. (Similar to SFT)

Next, with the LLM, one needs to generate data to train a Reward Model so that human preferences can be integrated into this algorithm. The goal is to retrieve a model or system that takes a sequence of text as input and outputs a scalar reward which should numerically represent the human preference.

Eventually, the technique of reinforcement learning is applied to the LLM to fine-tune the model using a policy-gradient Reinforcement Learning (RL) algorithm called Proximal Policy Optimization (PPO). The model is essentially fine-tuned using the reward value output by the reward model and an additional penalty term, which is a scaled version of the Kullback–Leibler (KL) divergence. This penalty term can penalise the fine-tuned model from moving substantially away from the initial pretrained model so that it can output reasonably coherent content.

RLHF Process flow. Source: Demystifying ChatGPT: A Deep Dive into Reinforcement Learning with Human Feedback

There are already a few active repositories for RLHF in Pytorch. The primary repositories, in this case, are Transformers Reinforcement Learning (TRL), TRLX which originated as a fork of TRL, and Reinforcement Learning for Language models (RL4LMs).

SteerLM

Apart from SFT and RLHF, a novel approach called SteerLM was recently proposed by NVIDIA to overcome some limitations associated with conventional SFT and RLHF methods. Similar to RLHF, SteerLM incorporates additional reward signals by leveraging annotated attributes (e.g., quality, humour, toxicity) present in the Open-Assistant dataset for each response. It generally comprises 4 steps:

Attribute Prediction Model: The base language model is trained as an Attribute Prediction Model to assess the quality of responses by predicting attribute values.
Annotating Datasets using Attribute Prediction Model: The attribute prediction model is used to annotate response quality across diverse datasets.
Attribute Conditioned SFT: Given a prompt and desired attribute values, a new base model is fine-tuned to generate responses that align with the specified attributes.
Bootstrapping with High Quality Samples: Multiple responses are sampled from the fine-tuned model in the last step, specifying maximum quality. The sampled responses are evaluated by the trained attribute prediction model, leading to another round of fine-tuning.

SteerLM Process Flow. Source: Nvidia Docs Hub

For some actual practice, one can refer to this post officially written by NVIDIA for a complete tutorial. Note that since this method is developed by NVIDIA, AMD GPUs are currently not supported.

Conclusion

The use of Large Language Models has witnessed significant advancement in multiple directions while there is a rising trend among users seeking task-specific models. In this post, we introduce 3 different algorithms to fine-tune LLMs, including SFT, RLHF and SteerLM. Through continuous investigation and refinement, we believe that the use of Large Language Models can open up exciting opportunities for us in the future.

References

Lambert, N.; Castricato, L.; von Werra, L.; and Havrilla, A., 2022. Illustrating reinforcement learning from human feedback (rlhf). https://huggingface.co/blog/rlhf
Dong, Y., Wang, Z., Sreedhar, M. N., Wu, X., & Kuchaiev, O. (2023). SteerLM: Attribute Conditioned SFT as an (User-Steerable) Alternative to RLHF (Version 1). arXiv. https://doi.org/10.48550/ARXIV.2310.05344