Introduction to LLMs and the generative AI : Part 3— Fine Tuning LLM with Instruction and Evaluation Benchmarks

9 min readJul 18, 2023

In the world of artificial intelligence, large language models (LLMs) play a crucial role in various applications. Last Article, we explored transformer networks, which form the foundation of LLMs, and the Genitive AI project life cycle. This Article, we delve deeper into the topic by focusing on instruction tuning and efficient fine-tuning techniques.

Source : https://www.coursera.org/learn/generative-ai-with-llms/

Instruction Tuning and Its Significance

While pre-training LLMs on vast amounts of data allows them to learn about the world, they often struggle to respond to specific prompts or instructions. This is where instruction fine-tuning comes into play. By training the model to change its behavior and respond more effectively to instructions, we can enhance its performance and usefulness.

The Breakthrough of Instruction Fine-Tuning

Instruction fine-tuning represents a major breakthrough in the history of large language models. Unlike pre-training, where models learn to predict the next word based on general text, fine-tuning allows us to train the model on a smaller dataset specifically tailored to following instructions. This fine-tuning process enables the model to adapt and excel at specific tasks.

Challenges and Solutions

One challenge in fine-tuning is catastrophic forgetting, where the model forgets previously learned information when trained on new data. To combat this, we discuss techniques such as instruct fine-tuning across a wide range of instruction types. By broadening the scope of training, we can mitigate the risk of catastrophic forgetting.

Efficient Fine-Tuning with Parameter Efficient Fine-Tuning (PEFT)

Fine-tuning a large model can be computationally and memory intensive. To address this, parameter efficient fine-tuning (PEFT) offers methods to achieve similar performance results with reduced resource requirements. Techniques like LoRA (Low Rank Approximation) have gained popularity due to their ability to deliver impressive performance while minimizing compute and memory usage.
This will be covered in detail in the next part of this article.

Choosing the Right Approach

Developers often start with prompting, which can provide satisfactory results. However, when prompting reaches its limits, fine-tuning with techniques like PEFT becomes crucial to unlock further performance improvements. The cost-effectiveness of using a smaller model for fine-tuning is also a topic of debate among developers, considering the benefits and constraints of a giant model.

Implications and Benefits

Parameter efficient fine-tuning makes it possible for everyday users to fine-tune generative AI models without incurring prohibitive costs. These techniques empower developers to fine-tune models for specific tasks, domains, and applications, while optimizing memory footprint and resource usage. Furthermore, having models of appropriate sizes ensures control over data and addresses privacy concerns.

Fine-tuning large language models through instruction prompts is a powerful technique to enhance their performance for specific tasks. By leveraging instruction fine-tuning and techniques like PEFT, developers can achieve significant improvements while managing computational and memory constraints.

Challenges

One of the challenges in fine-tuning large language models (LLMs) for specific tasks is the potential issue of catastrophic forgetting. This phenomenon occurs when the fine-tuning process modifies the weights of the original LLM, resulting in improved performance on the targeted task but a degradation in performance on other tasks. To address this, developers have two options. Firstly, they can choose to focus solely on the specific task and prioritize reliable performance in that area, without concern for the model’s ability to generalize to other tasks. However, for those who require multitask capabilities, parameter efficient fine-tuning (PEFT) offers a solution. PEFT preserves most of the pre-trained weights and only trains a small number of task-specific adapter layers and parameters. This approach significantly reduces the risk of catastrophic forgetting and ensures the model maintains its multitask generalized capabilities. As we delve deeper into the topic later this article, we will explore the exciting world of PEFT and its applications in fine-tuning LLMs.

Fine-Tuning Language Models for Multitask Learning: Improving Performance Across Tasks

Multitask fine-tuning, an extension of single task fine-tuning, has emerged as a powerful technique to enhance the performance of language models across various tasks simultaneously. By training the model on a diverse dataset comprising examples from multiple tasks, including summarization, review rating, code translation, and entity recognition, catastrophic forgetting can be avoided. Although it requires a substantial amount of data, ranging from 50,000 to 100,000 examples, the benefits of multitask fine-tuning are significant and well worth the effort.

Introducing the FLAN Family of Models

The FLAN (fine-tuned language net) family of models exemplifies the success of multitask instruction fine-tuning. Models such as FLAN-T5, derived from the T5 foundation model, and FLAN-PALM, based on the palm foundation model, have been trained on a staggering 473 datasets across 146 task categories. This comprehensive training approach enables these models to exhibit remarkable performance and versatility across various tasks.

Unleashing the Power of Dialogue Summarization with FLAN-T5

Among the multitude of tasks, dialogue summarization stands out as a crucial capability for language models. The SAMSum dataset, part of the muffin collection, provides a training ground for language models to summarize dialogues. Crafted by linguists, SAMSum consists of 16,000 messenger-like conversations, meticulously annotated with linguistically curated summaries. By employing prompt templates that instruct the model to summarize the dialogue using various phrasings, the FLAN-T5 model achieves impressive results.

Customizing FLAN-T5 for Domain-Specific Dialogue Summarization

While FLAN-T5 demonstrates competence across multiple tasks, specific use cases may demand further improvement. Suppose you are a data scientist developing an application to support a customer service team. In that case, you require accurate summaries of chat conversations to identify key actions and determine appropriate responses. While the SAMSum dataset covers diverse conversation topics, it may not align precisely with the language structure of customer service chats.

Fine-Tuning FLAN-T5 with DialogSum

To address this challenge, additional fine-tuning using a domain-specific summarization dataset like DialogSum can enhance FLAN-T5’s summarization capabilities. DialogSum comprises over 13,000 support chat dialogues and summaries, providing a closer match to the conversations encountered by your chatbot. Through this fine-tuning process, the model gains a deeper understanding of the nuances specific to your customer service domain, enabling more accurate and contextually relevant summaries.

Optimizing Model Performance through Internal Data

While the DialogSum dataset serves as an illustrative example, the true power of fine-tuning lies in using your company’s internal data. By leveraging actual support chat conversations from your customer support application, the model can learn the intricacies of your company’s preferred summarization style and deliver more tailored and valuable summaries to your customer service colleagues.

Evaluating Fine-Tuned Models

When embarking on fine-tuning endeavors, it becomes crucial to assess the quality of the model’s completions. In the next video, we will delve into various metrics and benchmarks that aid in evaluating the performance of fine-tuned models, allowing you to measure the improvements achieved and gauge the superiority of your fine-tuned version over the base model.

Assessing Performance and Comparing Large Language Models: Metrics for Evaluation

When it comes to evaluating the performance of large language models, traditional machine learning metrics like accuracy fall short due to the non-deterministic and language-based nature of their output. To address this challenge, developers of these models have devised metrics such as ROUGE and BLEU. Let’s delve into these metrics and understand how they can be used to assess model performance.

ROUGE: Recall Oriented Understudy for Gisting Evaluation

ROUGE is primarily employed to evaluate the quality of automatically generated summaries by comparing them to human-generated reference summaries. The metrics within ROUGE, including ROUGE-1 and ROUGE-2, focus on word-level matches between the reference and generated sentences. ROUGE-1 measures the recall, precision, and F1 scores based on individual word matches, while ROUGE-2 considers bigram matches.

However, ROUGE scores alone may not capture the complete context and ordering of words, leading to potentially deceptive results. To overcome this limitation, the ROUGE-L score calculates the longest common subsequence between the reference and generated outputs, giving a more comprehensive evaluation. It takes into account the ordering of words and provides a more accurate assessment.

BLEU: Bilingual Evaluation Understudy

BLEU, originally designed for evaluating machine-translated text, measures the quality of translations by comparing n-gram matches between the machine-generated translation and the reference translation. BLEU scores are calculated for a range of n-gram sizes, and the average precision across these sizes is used to determine the BLEU score. It provides a measure of how closely the generated output matches the reference translation.

Both ROUGE and BLEU scores have their advantages and limitations. While they offer quick and simple evaluation methods, they should not be used as the sole basis for assessing the final performance of a language model. For a more comprehensive evaluation, researchers have developed specific evaluation benchmarks tailored to different tasks.

In conclusion, ROUGE and BLEU metrics serve as valuable tools for evaluating the performance of large language models in tasks like summarization and translation. These metrics provide an automated and structured way to measure the quality and similarity of generated outputs compared to human references. However, for a more comprehensive evaluation, it is essential to consider task-specific evaluation benchmarks developed by researchers in the field.

By incorporating these metrics and evaluation benchmarks, developers can gain insights into the capabilities and improvements of their models, enabling them to make informed decisions during the fine-tuning process and effectively compare their models with others in the field.

Benchmarking Language Models

Evaluating the true capabilities of large language models (LLMs) requires a comprehensive and holistic approach. While simple evaluation metrics like Rouge and BLEU scores offer limited insights, selecting the right evaluation dataset becomes vital. Researchers have established pre-existing benchmarks to measure and compare LLM performance accurately, focusing on specific model skills, potential risks, and unseen data. Let’s explore some noteworthy benchmarks in the field.

GLUE and SuperGLUE: GLUE (General Language Understanding Evaluation) introduced in 2018, comprises diverse natural language tasks like sentiment analysis and question-answering. SuperGLUE, introduced in 2019, addresses GLUE’s limitations and features more challenging tasks, including multi-sentence reasoning and reading comprehension. Both benchmarks provide leaderboards to compare model performance and track progress.

2. The Holistic Evaluation of Language Models (HELM): HELM framework aims to enhance model transparency and guide performance assessment. It employs a multimetric approach, measuring seven metrics across 16 core scenarios, beyond basic accuracy measures. HELM also includes metrics for fairness, bias, and toxicity, crucial as LLMs become more capable of human-like generation and potential harm.

3. The Holistic Evaluation of Language Models (HELM): HELM framework aims to enhance model transparency and guide performance assessment. It employs a multimetric approach, measuring seven metrics across 16 core scenarios, beyond basic accuracy measures. HELM also includes metrics for fairness, bias, and toxicity, crucial as LLMs become more capable of human-like generation and potential harm.

By utilizing these benchmarks, researchers can gain a comprehensive understanding of LLM capabilities while addressing various dimensions of language understanding. As LLMs advance, it remains crucial to evolve benchmarks continuously and assess emerging risks, ensuring responsible development and deployment of these powerful language models.

Connect with me : https://www.linkedin.com/in/yash-bhaskar/
More Articles like this: https://medium.com/@yash9439