Multi-GPU Training of 70B LLM with Deepspeed and FSDP+Qlora

5 min readMar 14, 2024

Train 70–120B LLM on 4xA100s and 2xRTX3090s (Consumer-grade GPUs)

I have been working with bigger models like Mixtral 8x7B, Qwen-120B, and Miqu-70B recently. But the most important thing when playing with bigger models is the amount of compute resources they require during training. I have been using Deepspeed for multi-GPU training, understanding what difference each stage(Zero-1, Zero-2, Zero-3) brings to the table. I will also be focusing on a recent technique (FSDP+Qlora) for training larger models on consumer-grade GPUs. A few details regarding my recent experiments:

Liberated Miqu 70B

With the release of the new dataset from Abacus AI, I tried out fine-tuning Miqu-70B on SystemChat with 2x A100s and Deepspeed Zero-2. I also tried out Deepspeed Zero-3 but with multiple issues occurring in Axolotl regarding quantization and OOM, I went back to Zero-2. Some highlights of Zero-2 are that it only divides optimizer states and gradients across GPUs but the model params are copied on each GPU while in Zero-3, model weights are also distributed across all GPUs. Liberated Miqu 70B is a totally uncensored model. So be careful with what you use it for. I trained the model for 1 epoch using Qlora with axolotl. The axolotl configuration for this experiment is shown below.

base_model: 152334H/miqu-1-70b-sf
model_type: LlamaForCausalLM
tokenizer_type: LlamaTokenizer
load_in_8bit: false
load_in_4bit: true
strict: false

datasets:
  - path: abacusai/SystemChat
    type: sharegpt
dataset_prepared_path:
val_set_size: 0
output_dir: /workspace/miqu-systemchat
resume_from_checkpoint:
hf_use_auth_token:
adapter:  qlora
lora_model_dir:
sequence_len: 2048
sample_packing: true
pad_to_sequence_len: true
lora_r: 16
lora_alpha: 16
lora_dropout: 0.05
lora_target_modules:
lora_target_linear: true
lora_fan_in_fan_out:
lora_modules_to_save:
  - embed_tokens
  - lm_head
wandb_project: Miqu-Systemchat-multiGPU
wandb_entity: 
wandb_watch:
wandb_run_id: 
wandb_log_model:
gradient_accumulation_steps: 1
micro_batch_size: 1
num_epochs: 1
optimizer: paged_adamw_8bit
lr_scheduler: cosine
learning_rate: 0.0002
train_on_inputs:
group_by_length: false
bf16: true
fp16: false
tf32: false
gradient_checkpointing: true
early_stopping_patience:
local_rank:
logging_steps: 1
xformers_attention:
flash_attention: true
warmup_steps: 100
eval_steps: 
save_steps: 2000
save_total_limit: 2
eval_sample_packing:
debug:
deepspeed: deepspeed_configs/zero2.json
weight_decay: 0.05
fsdp:
fsdp_config:
special_tokens:
tokens:
trust_remote_code: true

🤗 Liberated-Miqu-70B: https://huggingface.co/abideen/Liberated-Miqu-70B

FSDP+Qlora

Answer.ai released a new technique to train bigger models on consumer-grade GPUs (RTX 3090 or 4090) with FSDP and Qlora. Two types of hardware are normally used, one is the data center class hardware, such as H100s and A100s, and others are desktop computers containing gaming GPUs, such as dual 4090s and 3090s. The idea here was simple; figure out how to use these 10x cheaper GPUs to train the best available open-source models. Here is where Answer.ai’s fsdp+Qlora comes in handy. I gave FSDP+Qlora a shot with Mixtral 8x7B on 2x 3090s. This technique was also integrated into the Axolotl library on an experimental basis. In Answer.ai’s blog, they did not mention anything regarding speed and time constraints with consumer-grade GPUs. I set out training Mixtral on only 100 steps to try things out but the time required for that was 70 hrs which is huge. Since the experiment was taking such a long time, it was not feasible for me to complete this experiment. So currently, I am moving back to using A100s until this technique becomes more efficient. Btw, a great effort by Jeremy Howard and his team to bring the training of larger models to consumer-grade GPUs with limited VRAM. The axolotl config file for this experiment is given below.

base_model: mistralai/Mixtral-8x7B-v0.1
model_type: AutoModelForCausalLM
tokenizer_type: LlamaTokenizer
trust_remote_code: true

load_in_8bit: false
load_in_4bit: true
strict: false
datasets:
  - path: cognitivecomputations/WizardLM_evol_instruct_V2_196k_unfiltered_merged_split
    type: sharegpt
    conversation: chatml
dataset_prepared_path: last_run_prepared
val_set_size: 0.02
output_dir: ./qlora-out
model_config:
  output_router_logits: true
adapter: qlora
lora_model_dir:
sequence_len: 1024
sample_packing: false
pad_to_sequence_len: false
lora_r: 16
lora_alpha: 16
lora_dropout: 0.05
lora_target_linear: true
lora_fan_in_fan_out:
wandb_project: fsdp
wandb_entity:
wandb_watch:
wandb_name:
wandb_log_model:
gradient_accumulation_steps: 4
micro_batch_size: 2
num_epochs: 1
max_steps: 100
optimizer: paged_adamw_8bit
lr_scheduler: cosine
learning_rate: 0.0002
train_on_inputs: false
group_by_length: false
bf16: auto
fp16:
tf32: false
gradient_checkpointing: true
early_stopping_patience:
resume_from_checkpoint:
local_rank:
logging_steps: 1
xformers_attention:
flash_attention: true
loss_watchdog_threshold: 5.0
loss_watchdog_patience: 3
warmup_steps: 10
evals_per_epoch: 4
eval_table_size:
eval_max_new_tokens: 128
saves_per_epoch: 1
debug:
weight_decay: 0.0
fsdp:
  - full_shard
fsdp_config:
  fsdp_transformer_layer_cls_to_wrap: MixtralSparseMoeBlock
special_tokens:

MegaQwen-120B

I also tried out the interleaving technique on Qwen-70B to create MegaQwen-120B inspired by Venus-120B. Since a 120B model would have also required an insane amount of VRAM for training, I learned this fact the hard way that you have to fine-tune your 70B model before and then interleave it, thereby bypassing the memory constraints. I tried out interleaving first and then fine-tuning the massive 120B model which ended up with OOM. My prior logic was that a model 120B param requires 240GB VRAM (4bit -> 68GB), I threw 4x A100 i.e. 320GB VRAM and this should work. But, that didn’t work out. The main reason was that Zero-2 has copies of entire model parameters on each GPU and Pytorch was somehow taking 12GB leading to OOM on 80GB VRAM A100, so throwing in A100s didn’t make any difference. Also, Zero-3 (model params sharding) was not an option due to the errors that it was presenting. I noted these OOM errors and will try to keep track of the memory constraints more vigilantly in the future. The axolotl config for this experiment is available below.

base_model: abideen/Qwen-120B
model_type: Qwen2ForCausalLM
tokenizer_type: Qwen2Tokenizer
load_in_8bit: false
load_in_4bit: true
strict: false

datasets:
  - path: abacusai/SystemChat
    type: sharegpt
dataset_prepared_path:
val_set_size: 0
output_dir: /workspace/Qwen-120b-systemchat
resume_from_checkpoint:
hf_use_auth_token:
adapter:  qlora
lora_model_dir:
sequence_len: 2048
sample_packing: true
pad_to_sequence_len: true
lora_r: 16
lora_alpha: 16
lora_dropout: 0.05
lora_target_modules:
lora_target_linear: true
lora_fan_in_fan_out:
lora_modules_to_save:
  - embed_tokens
  - lm_head
wandb_project: Qwen-Systemchat-multiGPU
wandb_entity: 
wandb_watch:
wandb_run_id: 
wandb_log_model:
gradient_accumulation_steps: 1
micro_batch_size: 1
num_epochs: 1
optimizer: paged_adamw_8bit
lr_scheduler: cosine
learning_rate: 0.0002
train_on_inputs:
group_by_length: false
bf16: true
fp16: false
tf32: false
gradient_checkpointing: true
early_stopping_patience:
local_rank:
logging_steps: 1
xformers_attention:
flash_attention: true
warmup_steps: 100
eval_steps: 
save_steps: 2000
save_total_limit: 2
eval_sample_packing:
debug:
deepspeed:
weight_decay: 0.05
fsdp:
fsdp_config:
special_tokens:
  eos_token: "<|im_end|>"
tokens:
  - "<|im_start|>"
trust_remote_code: true

💥 MegaQwen-120B: https://huggingface.co/abideen/MegaQwen-120B

Conclusion

All in all, it was a good experience for me to try out multi-GPU training on the best available open-source models. I tried to work my way through different errors, but some remained unresolved. Will try to solve them in future experiments.

Special thanks to QueryLoopAI for sponsoring the compute of these experiments.

Also, feel free to drop me a message or:

Connect and reach me on LinkedIn and Twitter
Follow me on 📚 Medium
Subscribe to my 📢 weekly AI newsletter!
Check out my 🤗 Hugging Face

Multi-GPU Training of 70B LLM with Deepspeed and FSDP+Qlora

Liberated Miqu 70B

FSDP+Qlora

MegaQwen-120B

Conclusion

Written by Zain ul Abideen