The Role of Fine-tuning Large Language Models For Improved Functionality

Published in

Data Science Student Society @ UC San Diego

8 min readMar 11, 2024

Abstract

The use of AI, specifically Large Language Models, has become more and more widespread and used daily by users that incorporate AI to generate new content out of known information. The alignment and data collection processes of these LLMs play a key role in transforming the models into tools that provide the most accurate and helpful responses to users. Through the case study of the processes of alignment for InstructGPT, the AI’s functionality has been fine-tuned to become a basis for future Large Language Models to be modeled and modified to fit creators’ purposes, as well as the user’s needs. After the LLMs are aligned and refined, users need to input context for the models to provide the most targeted responses, as studied further through context injection in the study of Notion’s AI.

Introduction

The use of Large Language Models, also known as LLMs, has become more widely used as a tool for understanding information and content generation. With the introduction of LLMs, these AI models can perform numerous tasks, such as summarizing notes, action tasks (like analyzing course notes and generating new content from those notes), correct spelling and grammar by analyzing users’ writing, and even rewriting or creating new content from users’ input. Large Language Models have been trained using extensive datasets consisting of millions of data points to grasp human language structures. However, these LLMs could easily generate false information or inaccurate assumptions that could be biased or unethical for various reasons, for example, a lack of specificity in context input for the model. To most accurately mimic human language and generate text, these language models undergo a series of training, followed by a sequence of supervised fine-tuning and reinforcement learning from human feedback (RLHF). These sets of modeling have become a basis of how future LLMs are evolved from a basic language model to AIs that are used for specific purposes and applications.

**Case Study #1 — InstructGPT**

The first case study evaluates a Large Language Model that has undergone fine-tuning of datasets on text and code on the human language and is also reinforced with learning from human feedback and interaction, called InstructGPT.

Large Language Models do not directly take data from users, they are trained on extensive datasets related to text, code, and the patterns of the human language. Training LLMs on larger datasets and models does not necessarily make them more accurate and efficient. Furthermore, LLMs could easily generate content that is false, biased, or diverted from the user’s initial prompts. To combat these drawbacks, InstructGPT was further developed and compared to other models with learning from human feedback, where the improvements in accuracy and more aligned with the users’ intents, prompted generated outputs that were preferred over other outputs generated by other GPT models.

According to the study, InstructGPT was able to accomplish this through training its language models with three objectives: helpfulness (accomplishment of prompted tasks), honesty (accuracy), and harmlessness (ethical). The methodology behind fine-tuning the language models began with reinforcing learning through human feedback on GPT-3, a pre-trained LLM that is used for the fine-tuning process, using human preferences as the basis for a reward signal, further detailed in Figure 1 below.

Figure 1: The step-by-step process of how GPT-3 is fine-tuned through a reward model (RM) process to achieve *Instruct*GPT for more reinforced outputs through human feedback is mapped out above. This process involved hiring a team of 40 contractors to collect demonstration feedback data, which is then used to train models in *Instruct*GPT (as depicted by the blue arrows).

Dataset Collection:

The datasets that InstructGPT was trained with consisted of prompt datasets primarily sourced from text prompts submitted to a commercial language model API. These datasets contained natural language tasks including, generation, dialogue, extraction, and summarization, where 96% of the datasets were in English. From these, datasets were curated and used in the fine-tuning process in Figure 1: (1) SFT (supervised fine-tuning) dataset, comprising of prompts with labeler demonstrations, (2) RM (reward model) dataset, consisting of prompts with labeler rankings of model outputs for RM training, and (3) PPO (proximal policy optimization) dataset, containing prompts without human labels (used for RLHF fine-tuning).

Basic Models of Training General Large Language Models

Supervised Fine-tuning (SFT):

The SPT model fine-tunes GPT or previous basic language models as the first step in training LLMs for the alignment process. To begin, datasets are pulled from GPT’s responses to user’s prompts and examples of how the LLM generates responses correctly. The term “supervised” comes from how these datasets are specifically chosen, demonstrating to the model examples that curators want the output to be like to match users’ desires. For SFT to have the best results, highly dependable datasets need to be assembled to consist of all factors that take into consideration what users intend their outputs to include. Through this dependence on the collection of high-quality supervised datasets, the model can reproduce responses that are more accurate and in line with the examples during the SFT process.

The following are Step 2 and Step 3 as presented in Figure 1, which are both grouped as reinforcement learning from human feedback (RLHF) under the process of model alignment.

2. Reward Model (RM):

The RM model is important in the process of alignment as it can act as additional feedback to the LLMs to allow them to understand what types of outputs benefit users, improving helpfulness in accuracy and safety from biased or toxic responses. The RM model does so by mimicking human feedback on desirability through a reward training system. The first step in training an RM model is data collection of prompts and responses with human feedback is compared over various outputs based on preferability ranked by human labelers. To train the models using the collected datasets, the models need to have a designed neural network structure (transforming inputted datasets into mathematical computations) that generates a scalar reward score given each prompt’s response comparison on preferability ranking. Lastly, a loss function should be designed based on the dependency and accuracy of the neural network structure at each scalar score.

Using InstructGPT as an example, the loss function looks like the following:

where rθ(x, yw) is the scalar reward score for prompt x, in which output yw is ranked over yl.

Through the development of designing the neural network structure and the loss function, the model can now be evaluated and run based on a reward ranking system based on the preferences of the users.

3. Proximal Policy Optimization (PPO):

The PPO model in the RLHF, which is used to fine-tune the SFT model, refers to a per-token Kullback-Leibler (KL) penalty at each token to prevent over-optimization of the RM when presented with a random prompt and response. The use of the KL divergence measures how one response diverges from another, quantifying differences between two output distributions, where the KL penalty then penalizes if the difference is too extreme. PPOs are models that mitigate over-optimization and balance between exploration (testing new prompts to discover various responses) and exploitation (selecting certain actions that yield higher rewards based on the RM). As for the InstructGPT case study, they refer to the PPO model as “PPO-ptx,” where the pretraining gradients are incorporated into the PPO gradients to cover performance regressions in public NLP datasets. The use of PPO shows improved reliability, increased data efficiency, and greater simplicity and applicability.

The use of this three-part process for the alignment of LLMs has proved to be efficient and accurate through the case of InstructGPT, catalyzing this process of SFT and the process of RLHF to be standardized. Now that there is more understanding of how LLMs undergo the process of alignment to generate output that is more aligned with user intentions through the case study of InstructGPT, we can dive deeper into how Notion AI applies this fine-tuning to allow users to generate more intuitive and desired outputs.

Case Study #2 — How Users Can Incorporate Data into Notion AI

Notion is growing as a widely used workspace amongst college students, many of whom have turned to Notion instead of other productivity applications. As a productivity software developed by Notion Labs Inc., Notion blends organizational tools into one application, allowing users to personalize Notion according to their needs, whether it be time management, planning to-do lists, organizing projects, note-taking, and more. To further improve content creation efficiency, Notion implemented an AI trained as a Language Model for Dialogue Applications, hence the release of Notion AI. Notion AI is advertised to be different from other AI tools and applications because it operates within an application that users already have their workspaces in, allowing it to be more customizable and personalized to user content.

Notion AI and InstructGPT both are modeled through integrations from LLMs developed by OpenAI, essentially GPT-3, where both are AIs that are built from similar structures and alignment processes, using advanced NLP (Natural Language Processing). This depicts how Notion AI collects and uses data in users’ workspaces to fine-tune the model to improve user interactions with the AI, allowing Notion AI’s reliability to improve and become more personal to the specific user and their workspace’s purpose.

Now, to incorporate Notion AI into users’ workspaces within Notion and train it to users’ data, users could train the model through context injection. Context injection applies to how users interact with the model by simply providing Notion AI with more context, for example summarizing relevant information and including more detailed prompts. Context injection is one of the main ways Notion AI can access users’ data directly, other than when users direct Notion AI to access data on a specific workspace in Notion, for example, summarizing a page. A limitation of context injection is “Token Limit,” where LLMs can process up to a certain amount of tokens (prompts with limited lengths), where Notion AI’s limit is 4000 tokens, estimated to be around 3200 words.

Since writing a clear prompt is important for Notion AI to provide its best feedback, users need to be as clear and specific as possible, where when the prompt is more specific, the better the feedback. Avoiding complicated sentences and wording, and using clear and concise wording works best for LLMs as it prevents the LLM from becoming confused by the prompt target. Users can also conduct their own A/B testing (comparing two variables and their outcomes to figure out the variation that provides a better outcome) to achieve the most personalized feedback from the AI. Through using these steps to curate prompts that clearly describe users’ end goals, Large Language Models can be fine-tuned to provide the best outcomes for that user.

Conclusion & Key Takeaways

The process of training these models mainly incorporates specialized fine-tuning (SFT), reward modeling (RM), and proximal policy optimization (PPO). Through this, LLMs can produce output that is helpful, harmless, honest to users, and which most importantly adjusts to users’ intentions, preferences, and data as well. As Large Language Models are gaining more access to data through increased information throughout the Internet and growing industries, these models will become more proficient in generating concrete text mimicking human language and providing datasets or information instantaneously, which could revolutionize healthcare, education, and many other industries. Through the increasing popularity and applicability of Large Language Models, many tech companies compete to create and expand more LLMs, proving rapid innovation and growth in the industry of LLMs and AI.