Setting Up Cloud-Based Distributed Training to Fine-Tune LLMs

Fine-Tuning the nanoGPT Model for Language Tasks

Published in

Intel Analytics Software

4 min readMar 29, 2024

Image generated from Stability AI’s Stable Diffusion 2–1 with prompt “2 chat robots AI in future on mars with nice background of black hole” (https://huggingface.co/spaces/stabilityai/stable-diffusion).

Fine-tuning large language models (LLMs) is in the spotlight due to the rise in interest to create applications for specific language tasks like text generation, code generation, chatbots, and retrieval augmented generation. A foundation model like OpenAI GPT-4 was trained on a massive compute cluster of around 25,000 GPUs over 100 days (source). Fine-tuning a model to a downstream language task can often be accomplished on a single GPU, and sometimes even a single CPU. However, training on a single node can be slow, so there is often a need to leverage multiple nodes. In this article, I give a high-level overview of how to fine-tune nanoGPT in parallel over several CPUs on Google Cloud Platform (GCP). Even though the implementation here is specific to nanoGPT, CPUs, and GCP, the same principles apply to fine-tuning other LLMs, multiple GPUs, and any other cloud service providers. Depending on the size of the model, memory and networking requirements must also be considered for other LLMs.

NanoGPT Model

The nanoGPT model is a 124 million parameter model, and an attempt to replicate OpenAI’s GPT-2 model. I am showing fine-tuning with the OpenWebText dataset distributed over three 4th Gen Intel Xeon CPUs. The objective here is not to arrive at a ChatGPT-like AI model, but rather to understand how to set up distributed training to fine-tune a specific language objective. The end result of training here will result in a base LLM that can generate words, or tokens, suitable for a language task after further fine-tuning.

Cloud Solution Architecture

Our cluster consists of GCP virtual machines from the C3 series. To enable seamless communication between the instances, each machine is connected to the same virtual network. A permissive network security group is established that allows all traffic from other nodes within the cluster. The raw dataset is downloaded from Hugging Face, and once the model has been trained, the weights are saved to the virtual machines.

Reference architecture for the Intel Optimized Cloud Modules for GCP: nanoGPT Distributed Training (Source: GitHub).

Code Highlights

The Intel Extension for PyTorch elevates PyTorch performance on Intel hardware with the integration of the newest features and optimizations that have not yet been incorporated into open source PyTorch. This extension efficiently utilizes Intel hardware capabilities including Intel Advanced Matrix Extensions (Intel AMX). Unleashing this power is straightforward — just wrap your model and optimizer objects with ipex.optimize:

# Set up CPU autocast and bfloat16 dtype
dtype = torch.bfloat16
self.autocast_ctx_manager = torch.cpu.amp.autocast(
        cache_enabled=True, dtype=dtype
)

# Wrap both Pytorch model and Optimizer
self.model, self.optimizer = ipex.optimize(
        self.model, optimizer=self.optimizer,
        dtype=dtype, inplace=True, level="O1",
)

The Accelerate library by Hugging Face streamlines the gradient accumulation process. This package helps to abstract away the complexity of supporting multi-CPUs/GPUs and provides an intuitive API, making gradient accumulation and clipping hassle-free during the training process:

# Initializing Accelerator object
self.accelerator = Accelerator(
        gradient_accumulation_steps=gradient_accumulation_steps,
        cpu=True,
)

# Gradient Accumulation
with self.accelerator.accumulate(self.model):
       with self.autocast_ctx_manager:
             _, loss = self.model(X, Y)
       self.accelerator.backward(loss)
       loss = loss.detach() / gradient_accumulation_steps

# Gradient Clipping
self.accelerator.clip_grad_norm_(
       self.model.parameters(), self.trainer_config.grad_clip
)

For distributed training, I used Intel oneAPI Collective Communications Library (oneCCL). With optimized communication patterns, oneCCL enables developers and researchers to train newer and deeper models more quickly across multiple nodes. It offers a tool called mpirun, which allows you to seamlessly launch distributed training workloads from the command line:

# Generating Multi-CPU config
accelerate config --config_file ./multi_config.yaml

# Launching Distributed Training job
mpirun -f ~/hosts -n 3 -ppn 1 -genv LD_PRELOAD="/usr/lib/x86_64-linux-gnu/libtcmalloc.so" accelerate launch --config_file ./multi_config.yaml main.py

Conclusions

This module demonstrates how to transform a standard single-node PyTorch LLM training scenario into a high-performance distributed training scenario across multiple CPUs. To fully capitalize on Intel hardware and further optimize the fine-tuning process, this module integrates PyTorch and Intel oneAPI Collective Communications Library (oneCCL). The module serves as a guide to setting up a cluster for distributed training while showcasing a complete project for fine-tuning LLMs.