The Hitchhiker’s Guide to Instruction Tuning Large Language Models

Viraj Shah
7 min readNov 14, 2023

--

All the code in this article can be found on my github here

King Llama

What is instruction Tuning?

Instruction tuning represents a specialized form of fine-tuning in which a model is trained using pairs of input-output instructions, enabling it to learn specific tasks guided by these instructions.

Consider the following input and output:

  • Input: “Provide a list of the most spoken languages”
  • Output: “English, French”

Providing this input and output trains the model to comprehend and execute tasks based on given instructions. These instructions span a wide array of text types, from composing emails to sentence editing, fostering the model’s adaptability across various instruction-driven tasks. By exposing the model to diverse instructions, it gains robust generalization skills, enhancing its ability to generate accurate responses aligned with human-like instruction formats.

This training method significantly improves language models’ responses, aligning them better with how humans typically communicate instructions. While instruction tuning demands substantial computational infrastructure due to the vast number of model parameters involved, this initial investment can yield operational cost reductions by decreasing expenses related to resource-intensive API calls during inference.

We aim to fine-tune the model specifically for solving science multiple-choice questions (MCQs) by configuring it to generate a single-token output representing the precise answer to each question.

The dataset

The original inspiration for this was a kaggle competition so we will be using the dataset from here.

Making this work with colab

A primary challenge in fine-tuning large language models (LLMs) lies in the substantial computational resources they demand. However, the script we’ll be employing can run on Google Colab’s Free tier. The duration for training completion varies based on the chosen training duration. Typically, the GPU’s VRAM represents a common constraint, with the batch size acting as a key hyperparameter governing its utilization. Essentially, the batch size determines the volume of dataset information processed by the model at any given instance during training.

Lets compare this to the number of epochs to make it clearer:

  1. Epoch = 1, Batch Size = 32
  2. Epoch = 1, Batch Size = 64

Option 1 will run more slowly compared to Option 2 because it processes a smaller amount of data at any given time. However, Option 1 also requires less VRAM. These numbers are merely used as examples to help clarify the explanation.

Parameter Efficient Fine Tuning (Peft)

PEFT, or Parameter-Efficient Fine-Tuning, represents a strategy used to adeptly tailor pre-trained language models (PLMs) for various applications without the need to fine-tune every parameter within the model. The conventional approach of fine-tuning large-scale PLMs can be excessively expensive in terms of computation and storage. However, PEFT methods take a more focused approach by selectively fine-tuning only a small subset of additional model parameters. This targeted adjustment significantly reduces the computational and storage costs involved. Impressively, recent cutting-edge PEFT techniques have demonstrated performance levels comparable to those achieved through full fine-tuning methods, showcasing their efficiency and effectiveness in adapting PLMs for diverse applications.

PEFT encompasses a wide range of methods aimed at achieving a common goal by updating only a portion of the model’s weights. However, due to this selective updating, the training process typically requires more time to achieve results comparable to those obtained through other fine-tuning approaches. This extended training duration is necessary to ensure the model adjusts effectively despite only updating specific segments of its parameters.

Model Quantisation

Given our VRAM limitations, we’ll implement a technique called quantisation, as we lack the computational resources to load the entire model. Understanding quantisation requires familiarity with data types. Model weights are essentially tensors, each composed of a value represented by a specific data type, usually FP16 or FP32 by default. To simplify, FP16 refers to Floating Point 16, accommodating a 10-bit fraction, while FP32 supports 23-bit fractions. Representing the same number in FP16 results in reduced accuracy compared to FP32 due to the fewer bits allocated for the fraction part. I would suggest this quick read about these data types.

Quantisation refers to the process of reducing the accuracy of those weights so they all fit within our RAM constraint. We will be using bitsandbytes for the quantisation. It’s crucial to consider that this level of quantisation can influence the outcome, potentially impacting accuracy. Therefore, it’s essential to strike a balance, avoiding excessive reduction in accuracy when implementing quantisation techniques.

Low Ranked Adaption(LoRa)

LoRA, or Low-Rank Adaptation, is an approach aimed at streamlining the fine-tuning process in machine learning. It operates by representing weight updates using two smaller matrices via low-rank decomposition. These matrices, termed update matrices, are trained to incorporate new data while minimizing overall changes. Notably, LoRA maintains the original weight matrix unchanged, separating it from further adjustments. The final outcomes are generated by combining both the original and adapted weights, showcasing LoRA’s methodology of optimizing fine-tuning efficiency through a dual representation approach.

This approach, LoRA, boasts several advantages:

  1. Enhanced Efficiency: LoRA significantly reduces the count of trainable parameters, streamlining the fine-tuning process.
  2. Maintained Original Weights: The original pre-trained weights remain fixed, enabling the creation of multiple lightweight and portable LoRA models, tailored for various downstream tasks.
  3. Compatibility: LoRA is compatible and complementary to many other parameter-efficient methods, offering flexibility for combination with diverse techniques.
  4. Comparable Performance: Models fine-tuned using LoRA demonstrate performance akin to that of fully fine-tuned models.
  5. Minimal Inference Impact: LoRA implementation doesn’t introduce additional inference latency, as adapter weights can seamlessly merge with the base model, ensuring efficient inference.
  6. Extensibility: LoRa exhibits remarkable extensibility as it seamlessly integrates with the transformers library without the need for any third-party tools. This inherent compatibility ensures its direct usability with platforms like bitsandbytes by default.

The Code

Importing all libraries we will be using

from huggingface_hub import login
# login(<add token>)
import pandas as pd
import numpy as np
import torch
from datasets import Dataset, DatasetDict
from transformers import (
AutoModelForCausalLM,
AutoTokenizer,
BitsAndBytesConfig,
AutoTokenizer,
)
fom peft import LoraConfig, get_peft_model
from transformers import TrainingArguments
from trl import SFTTrainer
import warnings
warnings.filterwarnings("ignore")

TRL — Transformers Reinforcement Learning

SFTTrainer = Supervised Fine Tuning Trainer

We require a prompt to use in training the Language Model (LLM).

df = pd.read_csv("/content/train(1).csv")
df["question"] = (
df["prompt"]
+ "\n A)"
+ df["A"]
+ "\n B)"
+ df["B"]
+ "\n C)"
+ df["C"]
+ "\n D)"
+ df["D"]
+ "\n E)"
+ df["E"]
+ "\n"
+ "You must only answer with the options and nothing else.I do not want an explanation, only three options that you think are mostly the answer. The answer to this question is"
+ df["answer"]
)
custom_ds = pd.DataFrame()
custom_ds["prompt"] = df["question"]
dataset = Dataset.from_pandas(custom_ds)

Loading our Model and Tokenizer

model_name = "meta-llama/Llama-2-7b-hf"
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.float16,
)
model = AutoModelForCausalLM.from_pretrained(
model_name, quantization_config=bnb_config, trust_remote_code=True
)
model.config.use_cache = False
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token
from peft import LoraConfig, get_peft_modellora_alpha = 16
lora_dropout = 0.1
lora_r = 64
peft_config = LoraConfig(
lora_alpha=lora_alpha,
lora_dropout=lora_dropout,
r=lora_r,
bias="none",
task_type="CAUSAL_LM",
)

Lora Alpha : From my interpretation, adjusting alpha while keeping the rank fixed results in a situation where scaling alpha by a factor of k, dividing the learning rate by k, and modifying the LoRA initialization weights by sqrt(k) yield no change in the outcome. Effectively, scaling alpha by k equates to simultaneously scaling the learning rate by k and adjusting the initialization weights by sqrt(k). Consequently, rather than focusing on tuning alpha directly, the emphasis should be on fine-tuning the learning rate and initialization parameters.

Lora Dropout : This works like dropout in general wherein the contribution of some nodes in a nerual network are purposely ‘dropped’, so as to reduce the chances of overfitting.

Defining the Training Arguments

output_dir = "./results"
per_device_train_batch_size = 4
gradient_accumulation_steps = 4
optim = "paged_adamw_32bit"
save_steps = 200
logging_steps = 10
learning_rate = 2e-4
max_grad_norm = 0.3
max_steps = 300
warmup_ratio = 0.03
lr_scheduler_type = "constant"
training_arguments = TrainingArguments(
output_dir=output_dir,
per_device_train_batch_size=per_device_train_batch_size,
gradient_accumulation_steps=gradient_accumulation_steps,
optim=optim,
save_steps=save_steps,
logging_steps=logging_steps,
learning_rate=learning_rate,
fp16=True,
max_grad_norm=max_grad_norm,
max_steps=max_steps,
warmup_ratio=warmup_ratio,
group_by_length=True,
lr_scheduler_type=lr_scheduler_type,
)

To get the optimal result, I highly suggest playing around with the hyperparams to get the best result. The most optimal method of achieving this would be to use Optuna, which uses bayesian optimisation(This might be what the next article is about).

Training the model

trainer = SFTTrainer(
model=model,
train_dataset=dataset,
peft_config=peft_config,
dataset_text_field="prompt",
max_seq_length=max_seq_length,
tokenizer=tokenizer,
args=training_arguments,
)
for name, module in trainer.model.named_modules():
if "norm" in name:
module = module.to(torch.float32)
trainer.train()

Saving our model

model_to_save = trainer.model.module if hasattr(trainer.model, 'module') else trainer.model  # Take care of distributed/parallel training
model_to_save.save_pretrained("outputs")

We can now test the model

lora_config = LoraConfig.from_pretrained('outputs')
model = get_peft_model(model, lora_config)
text = dataset["prompt"][0]
device = "cuda:0"
preds = []
inputs = tokenizer(text, return_tensors="pt").to(device)
# outputs = model.generate(**inputs, max_new_tokens=50,return_dict_in_generate=True, output_scores=True)
outputs = model.generate(
input_ids=inputs["input_ids"],
attention_mask=inputs["attention_mask"],
max_length=50,
return_dict_in_generate=True,
output_scores=True,
)
first_token_probs = outputs.scores[0][0]
option_scores = (
first_token_probs[[319, 350, 315, 360, 382]].float().cpu().numpy()
) # ABCDE
pred = np.array(["A", "B", "C", "D", "E"])[np.argsort(option_scores)[::-1][:3]]
pred = " ".join(pred)
preds.append(pred)
# print(tokenizer.decode(outputs[0], skip_special_tokens=True))

After achieving the desired performance levels with your model, you can proceed to upload it to the Hugging Face Hub.

model.push_to_hub("Veer15/llama2-science-mcq-solver",create_pr=1)

Hey! Look, you made it to end 🎉

Thank you for your time!

--

--