Falcon 180B Finetuning using 🤗 PEFT and DeepSpeed

8 min readOct 2, 2023

Chatbot using the Falcon 180B Chat & Instruct Fine-tuned model using 🤗 PEFT and DeepSpeed.

In this blog post, we will look at how you can fine-tune humongous models like Falcon 180B using Hugging Face’s PEFT, DeepSpeed ZeRO-3, Flash Attention and Gradient Checkpointing using just 16 A100 80GB GPUs, a fraction of the 1024 GPUs typically used for such tasks. What makes this even more compelling is that the resulting model not only consumes significantly fewer resources but also outperforms the official Falcon-180B and FalconChat-180B models on the OpenLLM Leaderboard by a remarkable 3%. So, not only does this save on computational power, but it also delivers superior results. And the cherry on top? The cost of training this model comes in at just $864 (36 hrs * $24/hr), a mere fraction of what it would take to fine-tune the chat version of Falcon-180B. Let’s dive in and discover how to achieve this impressive feat!

Introduction

DeepSpeed Zero Redundancy Optimizer (ZeRO)

It is a paradigm in which the optimizer states, gradients and parameters are sharded across devices. It results in huge memory savings allowing to scale training to large LLMs and higher batch sizes.

Stage 1 : Shards optimizer states across data parallel workers/GPUs

Stage 2 : Shards optimizer states + gradients across data parallel workers/GPUs

Stage 3: Shards optimizer states + gradients + model parameters across data parallel workers/GPUs

Optimizer Offload: Offloads the gradients + optimizer states to CPU/Disk building on top of ZERO Stage 2

Param Offload: Offloads the model parameters to CPU/Disk building on top of ZERO Stage 3

Parameter Efficient Fine-Tuning (PEFT) and Low Rank Adaptation (LoRA)

As models get larger and larger, full fine-tuning becomes infeasible to train on consumer hardware. In addition, storing and deploying fine-tuned models independently for each downstream task becomes very expensive, because fine-tuned models are the same size as the original pretrained model.

Parameter-Efficient Fine-Tuning (PEFT) is a technique that allows us to fine-tune a large pretrained model on a specific downstream task while requiring significantly fewer parameters than full fine-tuning. The goal is to achieve comparable or even better performance than full fine-tuning, while requiring less computation and memory resources.

PEFT approaches only fine-tune a small number of (extra) model parameters while freezing most parameters of the pretrained LLMs, thereby greatly decreasing the computational and storage costs. This also overcomes the issues of catastrophic forgetting, a behaviour observed during the full finetuning of LLMs. PEFT approaches have also shown to be better than fine-tuning in the low-data regimes and generalize better to out-of-domain scenarios. It can be applied to various modalities, e.g., image classification and stable diffusion dreambooth.

It also helps in portability wherein users can tune models using PEFT methods to get tiny checkpoints worth a few MBs compared to the large checkpoints of full fine-tuning, e.g., bigscience/mt0-xxl takes up 40GB of storage and full fine-tuning will lead to 40GB checkpoints for each downstream dataset whereas using PEFT methods it would be just a few MBs for each downstream dataset all the while achieving comparable performance to full fine-tuning. The small trained weights from PEFT approaches are added on top of the pretrained LLM. So, the same LLM can be used for multiple tasks by adding small weights without having to replace the entire model.

🤗 PEFT library provides the latest Parameter-Efficient Fine-tuning techniques seamlessly integrated with 🤗 Transformers and 🤗 Accelerate. This enables the use of the most popular and performant models from Transformers coupled with the simplicity and scalability of Accelerate.

For further information, please go through the below blogpost.

Parameter-Efficient Fine-Tuning using 🤗 PEFT

We're on a journey to advance and democratize artificial intelligence through open source and open science.

huggingface.co

LoRA is a PEFT method which employs memory efficient reparametrization trick by adapting small additional trainable parameters on target modules (usually query, value layers on transformer attention blocks). Thereby, drastically reduces the number of trainable parameters. One of the nice features of LoRA is that there is no latency addition during inference as the additional trainable parameters are added back to the original weights. This method achieves comparable performance to full fine-tuning which makes its usage widespread across the community.

LoRA implementation. Figure 1 from LoRA paper (Hu. et al.)

Flash Attention

Flash Attention and enabling gradient checkpointing are required for faster training and reducing VRAM usage to enable fine-tuning and save compute costs. The codebase currently uses monkey patching and the implementation is at DHS-LLM-Workshop/chat_assistant/training/falcon_flash_attn_monkey_patch.py

FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness introduces a way to compute exact attention while being faster and memory-efficient by leveraging the knowledge of the memory hierarchy of the underlying hardware/GPUs — The higher the bandwidth/speed of the memory, the smaller its capacity as it becomes more expensive.

If we follow the blog Making Deep Learning Go Brrrr From First Principles, we can figure out that Attention module on current hardware is memory-bound/bandwidth-bound. The reason being that Attention mostly consists of elementwise operations as shown below on the left hand side. We can observe that masking, softmax and dropout operations take up the bulk of the time instead of matrix multiplications which consists of the bulk of FLOPs.

This is precisely the problem that Flash Attention addresses. The idea is to remove redundant HBM reads/writes. It does so by keeping everything in SRAM, perform all the intermediate steps and only then write the final result back to HBM, also known as Kernel Fusion. Below is an illustration of how this overcomes the memory-bound bottleneck.

Tiling is used during forward and backward passes to chunk the NxN softmax/scores computation into blocks to overcome the limitation of SRAM memory size. To enable tiling, online softmax algorithm is used. Recomputation is used during backward pass in order to avoid storing the entire NxN softmax/score matrix during forward pass. This greatly reduces the memory consumption.

For a simplified and in depth understanding of Flash Attention, please refer the blog posts ELI5: FlashAttention and Making Deep Learning Go Brrrr From First Principles along with the original paper FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness.

Hardware and Environment

Number of nodes: 2.
Number of GPUs per node: 8
GPU type: A100
GPU memory: 80GB
intra-node connection: NVLink
RAM per node: 1TB
CPU cores per node: 96
inter-node connection: Elastic Fabric Adapter

Dataset

This dataset is focused on improving LLM logical reasoning skills and conversation skills. It is comprised of the following datasets and it can be found on the hub at smangrul/chat-instruct-mixer

Fine-Tuning

Below is the command showcasing how to use Accelerate launcher to run the training. Notice that we are overriding main_process_ip , main_process_port , machine_rank , num_processes and num_machines values of the fsdp_config.yaml. Here, another important point to note is that the storage is stored between all the nodes. The training code can be found at DHS-LLM-Workshop/chat_assistant/sft/training/train.py


accelerate launch \
    --config_file configs/deepspeed_config.yaml \
    --main_process_ip $MASTER_ADDR \
    --main_process_port $MASTER_PORT \
    --machine_rank \$SLURM_PROCID \
    --num_processes 16 \
    --num_machines 2 \
    train.py \
    --model_name "tiiuae/falcon-180B" \
    --dataset_name "smangrul/chat-instruct-mixer" \
    --max_seq_len 2048 \
    --bf16 True \
    --max_steps 5000 \
    --logging_steps 25 \
    --eval_steps 1000 \
    --save_steps 100 \
    --packing True \
    --output_dir "./experiments/falcon-180B-chat-asst-ds-lora-v3" \
    --per_device_train_batch_size 1 \
    --gradient_accumulation_steps 1 \
    --dataset_text_field "content" \
    --lr_scheduler_type "cosine" \
    --weight_decay 0.01 \
    --warmup_ratio 0.03 \
    --use_flash_attn True \
    --use_gradient_checkpointing True \
    --use_peft_lora True \
    --lora_r 16 \
    --lora_alpha 32 \
    --lora_dropout 0.1 \
    --lora_target_modules "query_key_value,dense,dense_h_to_4h,dense_4h_to_h"

The DeepSpeed config is available at DHS-LLM-Workshop/chat_assistant/training/configs/deepspeed_config.yaml and also given below:

compute_environment: LOCAL_MACHINE                                                                                             
debug: false
deepspeed_config:
  deepspeed_multinode_launcher: standard
  gradient_accumulation_steps: 1
  offload_optimizer_device: none
  offload_param_device: none
  zero3_init_flag: true
  zero3_save_16bit_model: true
  zero_stage: 3
distributed_type: DEEPSPEED
downcast_bf16: 'no'
machine_rank: 0
main_training_function: main
mixed_precision: bf16
num_machines: 1
num_processes: 8
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false

Finetuning plots displaying the eval loss, train loss and learning rate schedule respectively.

Above are the plots displaying the eval loss, train loss and learning rate schedule respectively. We can observe the training progressed as expected.

Notice that the training time is above 30 hours. As per logs, it took 36 hours in total. Therefore, the cost of training this model comes in at just $864 (36 hrs * $24/hr), a mere fraction of what it would take to fine-tune the chat version of Falcon-180B. The price is based on the Lambda GPU Cloud | VM Pricing and Specs (lambdalabs.com).

The trained model checkpointing can be found here: smangrul/falcon-180B-chat-asst-ds-lora

Evaluation

We evaluate the model as per the Open LLM Leaderboard — a Hugging Face Space by HuggingFaceH4 and below are the results. We can observe that our fine-tuned model outperforms the official Falcon-180B and FalconChat-180B models by a remarkable 3% (relative gains).

Open LLM Leaderboard metrics of various Falcon 180B variants. PEFT+DeepSpeed finetuned model achieves the best performance with ~3% improvement over the previous best.

Qualitative Evaluation

Let’s take a look at a few examples to gauge the model’s capabilities qualitatively:

Conclusion

We successfully fine-tuned Falcon-180B model using LoRA and DeepSpeed in a multi-node multi-gpu setting. We went over a brief overview of DeepSpeed, PEFT methods and Flash Attention. This was followed by the description of the dataset to be used for fine-tuning, finetuning codebase and the script launching command with the related hyperparameters. We looked at the cost and performance of the resulting model which can be found here: smangrul/falcon-180B-chat-asst-ds-lora. The fine-tuned model outperformed the original models on the Open LLM leaderboard by 3% while being extremely computationally efficient and it only cost $864. Therefore, fine-tuning LLMs using PEFT and DeepSpeed is a good alternative to Full Fine-tuning in a computationally resource-constrained scenario.