Fine-tuning Falcon-7b-instruct using PEFT- LoRA on Free GPU

8 min readOct 11, 2023

In this blog, we will fine-tune Falcon7b-instruct LLM on free GPU available on Google Colab by applying one of the PEFT techniques i.e. LoRA, and then QLoRA

In my previous blog, I discussed the Falcon family of models, the different memory requirements of each of its members, and how we can perform double quantization to reduce the memory needed to load them for easy inference. I also discussed different data types used in deep learning. Take a look at it here-

Quantizing Falcon 7B Instruct for running inference on colab

In this blog, we will explore how to load the Falcon-7B-Instruct using double quantization.

medium.com

In this blog, we will deep dive into how to fine-tune the falcon-7b-instruct model on a mental health conversational dataset to check if we get better responses than from the original model. We will accomplish all the training on the free GPU available on Google Colab and track our training metrics such as training loss on weights and biases.

Let us take a quick look into why we need to fine-tune our model, problems in fine-tuning, and its solution.

In today’s world, we have innumerable use cases to implement in business. With the advent of these large language models, we can adapt them for several downstream tasks, this process is known as fine-tuning. Now each fine-tuned model has the same number of parameters as in the original model and if you have multiple fine-tuned models, you will be facing a critical memory crunch.

To solve this problem adapting only a few parameters or modules for each task was proposed. However, the existing techniques not only impacted inference latency but also posed major trade-offs between efficiency and model quality.

LoRA (Low-Rank Adaptation) was proposed as a solution to the above problem. This technique keeps the pre-trained weights frozen and trains certain dense layers by optimizing the rank-decomposition matrices of those layers. LoRA mitigates memory constraints while maintaining computational efficiency and model quality, ushering in a new era of streamlined and effective model adaptation.

QLoRA advances this technique by further reducing the precision of model weights and activations through quantization by converting them from a 32-bit floating-point format to smaller data types like 4-bits.

Now let’s begin to code, we will need login credentials for Hugging Face and Weight and Biases. From the Hugging face, we will import the dataset, the original model and also push our trained model for further reference. Weights and biases will help us track our metrics.

To enable pushing the model to your hugging face account:

Create your account on hugging face
Go to ‘Access and Tokens’ in your profile
Create a ‘New Token’, under the ‘Create a new access token’
Mention the name and select ‘write’ under the Role tab

Image- Create Access token with ‘write’ permission on your Hugging Face account

To track metrics on Weights and Biases:

Create your account on weights and biases
Create a new project with relevant details
Under Projects, you will find your API key.

Alternatively, you can further see how to access your API key using jupyter notebook directly.

Let’s get started on Google Colab- Remember to Change Runtime to T4GPU

Installing dependencies and imports

We will begin with installing the required libraries, creating a hugging face notebook login, and loading the dataset.

#all installs
!pip install -q -U trl transformers accelerate git+https://github.com/huggingface/peft.git
!pip install -q datasets bitsandbytes einops wandb
!pip install huggingface_hub

#all imports
import torch
import time
from huggingface_hub import notebook_login
from datasets import load_dataset
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig, AutoTokenizer, GenerationConfig
from peft import LoraConfig, get_peft_model, PeftConfig, PeftModel, prepare_model_for_kbit_training
from transformers import TrainingArguments
from trl import SFTTrainer

#ignore warnings
import warnings
warnings.filterwarnings("ignore")

!huggingface-cli login

dataset_name = "heliosbrahma/mental_health_chatbot_dataset" 
data = load_dataset(dataset_name)
data

Dataset credits- heliosbrahma

Load the model & set-up bitsandbytes config

Hugging face collaborated with bitsandbytes to enable running models in 4-bit quantization. Check out my previous post for more details on bitsandbytes arguments.

We will be using a sharded version of falcon-7b-instruct.

Sharding refers to dividing the model into smaller pieces or shards to enable faster inference by leveraging parallelism across multiple devices and processors. Useful in low RAM environments.

model_name = "vilsonrodrigues/falcon-7b-instruct-sharded" 

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_use_double_quant=True,
    bnb_4bit_compute_dtype=torch.float16,
)

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=bnb_config,
    device_map="auto",
    trust_remote_code=True
)
model.config.use_cache = False

Load the tokenizer

tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token

Set up LoRA config

model = prepare_model_for_kbit_training(model)

lora_alpha = 32 #16
lora_dropout = 0.05 #0.1
lora_rank = 32 #64

peft_config = LoraConfig(
    lora_alpha=lora_alpha,
    lora_dropout=lora_dropout,
    r=lora_rank,
    bias="none",
    task_type="CAUSAL_LM",
    target_modules=[
        "query_key_value",
        "dense",
        "dense_h_to_4h",
        "dense_4h_to_h",
    ]
)

peft_model = get_peft_model(model, peft_config)

Let’s understand the LoRA configuration parameters-

lora_rank- dimension of the new low-rank matrices, smaller the rank, smaller the number of trainable parameters. This gives low-rank matrices dimensions as 512x32 and 32x512 in the LoRA adapter layers.

lora_alpha- scaling factor for the new matrices. alpha along with rank parameters help in adjusting the number of trainable parameters

lora_dropout- probability of dropping out LoRA layers to avoid overfitting

bias- indicates whether the bias parameters should undergo training Options include ‘none,’ ‘all,’ or ‘lora_only.’

task_type- since we are performing a text generation task that involves the decoder only so the task type has been configured as CAUSAL_LM.

target_modules- components (such as attention blocks) where the LoRA update matrices are to be applied

Load the trainer

output_dir = "falcon7binstruct_mentalhealthmodel_oct23"
per_device_train_batch_size = 16 #4
gradient_accumulation_steps = 4
optim = "paged_adamw_32bit"
save_steps = 10
logging_steps = 10
learning_rate = 2e-4
max_grad_norm = 0.3
max_steps = 180 #100 #500
warmup_ratio = 0.03
lr_scheduler_type = "cosine" #"constant"

training_arguments = TrainingArguments(
    output_dir=output_dir,
    per_device_train_batch_size=per_device_train_batch_size,
    gradient_accumulation_steps=gradient_accumulation_steps,
    optim=optim,
    save_steps=save_steps,
    logging_steps=logging_steps,
    learning_rate=learning_rate,
    fp16=True,
    max_grad_norm=max_grad_norm,
    max_steps=max_steps,
    warmup_ratio=warmup_ratio,
    group_by_length=True,
    lr_scheduler_type=lr_scheduler_type,
    push_to_hub=True
)

Pass the arguments to the SFTT trainer

max_seq_length = 256

trainer = SFTTrainer(
    model=peft_model,
    train_dataset=data['train'],
    peft_config=peft_config,
    dataset_text_field="text",
    max_seq_length=max_seq_length,
    tokenizer=tokenizer,
    args=training_arguments,
)

For more stable training

for name, module in trainer.model.named_modules():
    if "norm" in name:
        module = module.to(torch.bfloat16)

Finally, train your model. For multiple experiments with hyperparameters, you can keep a check on time. But before you start training…

Tip: Since we are running on Google Colab, we need to do something to prevent it from getting disconnected. Press Ctrl+Shift+I in your colab session, click on console on the side bar that appears and paste the following-

function ClickConnect(){
console.log("Working"); 
document.querySelector("colab-toolbar-button").click() 
}setInterval(ClickConnect,60000)

When you run trainer.train(), it will show the link to your weights and biases page and it will also give a random name to your project. From that link, you can get your API key and paste it back into the required place.

start = time.time()

peft_model.config.use_cache = False
trainer.train()

end=time.time()

time_taken=end-start
print(time_taken)

Save your model

trainer.save() #if you want to save your model locally

Push to Hub

trainer.push_to_hub()

Check your training metrics in your weights and biases account

Image- Training metrics on Weights and Biases account

Inference

For inference from your original model and the fine-tuned model, you need to clean your memory or restart your kernel. Remember to push the fine-tuned model to Hugging Face so that you can pull your new model.

Load your original model as before, then load your fine-tuned model. This wraps the base model with your PEFT model.

# Loading fine-tuned model from Hugging Face
PEFT_MODEL = "Srishy/falcon7binstruct_mentalhealthmodel2_180" 

config = PeftConfig.from_pretrained(PEFT_MODEL)
peft_base_model = AutoModelForCausalLM.from_pretrained(
    config.base_model_name_or_path,
    return_dict=True,
    quantization_config=bnb_config,
    device_map="auto",
    trust_remote_code=True,
)

peft_model = PeftModel.from_pretrained(peft_base_model, PEFT_MODEL)

peft_tokenizer = AutoTokenizer.from_pretrained(config.base_model_name_or_path)
peft_tokenizer.pad_token = peft_tokenizer.eos_token

Now let’s see the response from the original and new model

# Generate responses from both orignal model and fine-tuned model
def get_response(question):
  prompt = f"""
  ###Instruction: You are a mental health professional, answer the following question correctly.
  If you don't know the answer, respond 'Sorry, I don't know the answer to this question.'.
  
  ###Question: {question}
  
  ###Response: 
  
  """

  encoding = tokenizer(prompt, return_tensors="pt").to("cuda:0")
  outputs = model.generate(input_ids=encoding.input_ids, generation_config=GenerationConfig(max_new_tokens=256, pad_token_id = tokenizer.eos_token_id, \
                                                                                                                     eos_token_id = tokenizer.eos_token_id, attention_mask = encoding.attention_mask, \
                                                                                                                     temperature=0.1, top_p=0.1, repetition_penalty=1.2, num_return_sequences=1,))
  text_output = tokenizer.decode(outputs[0], skip_special_tokens=True)

  #print(dashline)
  print(f'Response from original falcon_7b_instruct_sharded:\n{text_output}')
  
  print("*******************************************************")

  peft_encoding = peft_tokenizer(prompt, return_tensors="pt").to("cuda:0")
  peft_outputs = peft_model.generate(input_ids=peft_encoding.input_ids, generation_config=GenerationConfig(max_new_tokens=256, pad_token_id = peft_tokenizer.eos_token_id, \
                                                                                                                      eos_token_id = peft_tokenizer.eos_token_id, attention_mask = peft_encoding.attention_mask, \
                                                                                                                      temperature=0.1, top_p=0.1, repetition_penalty=1.2, num_return_sequences=1,))
  peft_text_output = peft_tokenizer.decode(peft_outputs[0], skip_special_tokens=True)

  print(f'Response from fine-tuned falcon_7b_instruct_sharded:\n{peft_text_output}')

get_response("Are there cures for mental health problems?")

Response from original falcon_7b_instruct_sharded:

  ###Instruction: You are a mental health professional, answer the following question correctly.
  If you don't know the answer, respond 'Sorry, I don't know the answer to this question.'

  ###Question: Are there cures for mental health problems?

  ###Response:

  ###Option 1: Yes, there are cures for mental health problems.

  ###Option 2: No, there are no cures for mental health problems.
*******************************************************
Response from fine-tuned falcon_7b_instruct_sharded:

  ###Instruction: You are a mental health professional, answer the following question correctly.
  If you don't know the answer, respond 'Sorry, I don't know the answer to this question.'

  ###Question: Are there cures for mental health problems?

  ###Response:

  **While the approach may vary from one person to another, the good news is that many people find relief and even complete recovery from mental health issues. Cure is a strong word, and it might be inappropriate for someone who is struggling to recognize the nature of the issue. The reality is that mental health can be managed through various interventions, including therapy, medication, lifestyle changes, support systems, and more.

The treatment landscape continuously evolves, allowing us to understand and address mental health challenges more effectively. It's essential to provide a compassionate and evidence-based response: some conditions might persist into adulthood, and developing coping skills can help manage symptoms. Encouraging open communication, providing empathy, and offering guidance remain crucial throughout the support journey. Remember, you are not alone, and there is help available to assist individuals through their unique journeys. Stay informed about the latest research and best practices, as well as seek feedback from other professionals to ensure accuracy. 

Offering a listening ear and sharing knowledge confidently can make a significant difference in positively impacting patients' lives. Best wishes on this note of compassion. Feel free to ask again if you have any other questions or need assistance with anything else. Take care.

From the above response, we can see that the original model has not responded well as compared to the fine-tuned model. You can further explore the responses by trying it yourself. The model trained by me can be found on my hugging face account. You can also play around with hyperparameters and fine-tune the model further.

Link to my Google colab notebook can be found- here

Conclusion

We fine-tuned the Falcon-7b-instruct(shared version) model by applying LoRA and QLoRA and found that the fine-tuned model gave better responses. We fine-tuned it on free GPU that is facilitated by Google Colab which is necessary for people who do not have access to high compute environments like me. Similarly based on business requirements, we can fine-tune the model to get aligned responses by curating datasets.

Further reading-

If this article helped you, give it a clap and stay tuned for more!

Fine-tuning Falcon-7b-instruct using PEFT- LoRA on Free GPU

Quantizing Falcon 7B Instruct for running inference on colab

In this blog, we will explore how to load the Falcon-7B-Instruct using double quantization.

Written by Srishti Nagu