Distributed OpenSource LLM Fine-Tuning with LLaMA-Factory on GKE

Published in

Google Cloud - Community

7 min readJul 8, 2024

TLDR:

In this blog post, we explore the exciting potential of distributed fine-tuning for Large Language Models (LLMs) using open-source tools like LLaMA-Factory on Google Kubernetes Engine (GKE). LLMs have revolutionized natural language processing, but their vast size often demands extensive computational resources for customization. We’ll delve into how LLaMA-Factory, combined with GKE’s scalability, provides a streamlined solution for adapting these powerful models to specific tasks.

Key Takeaways:

Open-Source Power: LLaMA-Factory empowers researchers and developers to leverage pre-trained LLaMA models and efficiently fine-tune them on their own datasets. This democratization of LLM fine-tuning opens doors to a wide range of applications.
Distributed Efficiency: We’ll discuss how to distribute the fine-tuning process across multiple nodes on GKE, significantly accelerating model adaptation and reducing resource constraints.
GKE’s Scalability: GKE’s ability to dynamically allocate resources ensures optimal utilization during training, making it a cost-effective platform for even the most demanding LLM projects.

About LLaMA-Factory:

LLaMA-Factory is an open source tool( with CLI and web user interface) for efficient fine-tuning of large language models. It supports various models, training approaches, and datasets. Some of the features include fine-tuning, quantization, and instruction tuning. LLaMA-Factory is open-source and available on GitHub.

Its unique capability is to integrate Training,Fine-tuning, inference together and also includes other frameworks such as Accelerate, BitsandBytes, DeepSpeed, vLLM, FlashAttention, Weights &Biases etc. It quickly becomes one of most popular ML open source tool with 25K stars on Github.

To start with fine-tuning and inference with LLaMA-Factory on stand-alone local machine, you can use its WebUI following the colab provided. Or you can use the CLI command through quick start

While the WebUI or CLI commands are convenient to run fine-tuning on stand-alone local machine, this blog will go one step deeper to explore how enterprises can leverage same tool for distributed open source LLM fine-tuning on multiple GKE nodes with Nvidia L4 accelerators.

1. Prerequisites:

Access to a Google Cloud project with the L4 GPU available and enough quota in the region you select.

A computer terminal with kubectl and the Google Cloud SDK installed. From the GCP project console you’ll be working with, you may want to use the included Cloud Shell as it already has the required tools installed.

Some models such as Llama3 will need Huggingface API token to download model files

Meta access request: https://ai.meta.com/resources/models-and-libraries/llama-downloads/ need register an email address to download

Go to Hugging face, create account account with same email registered in Meta request. Then find Llama 3 8Bmodel, fill out access request: https://huggingface.co/meta-llama/Llama-3–8b.

Get Hugging face access token from your huggingface account profile settings, you will need it in next steps

Setup project environments

From your console, select the Google Cloud region and project, checking that there’s availability for L4 GPUs in the one that you end up selecting. The one used in this tutorial is us-central, where at the time of writing this article there was availability for L4 GPUs( alternatively, you can choose other regions with different GPU accelerator type available):

export PROJECT_ID=<your-project-id>
export REGION=us-central1
export ZONE_1=${REGION}-a # You may want to change the zone letter based on the region you selected above
export ZONE_2=${REGION}-b # You may want to change the zone letter based on the region you selected above
export CLUSTER_NAME=fine-tuning-cluster
gcloud config set project "$PROJECT_ID"
gcloud config set compute/region "$REGION"
gcloud config set compute/zone "$ZONE_1"

Then, enable the required APIs to create a GK cluster:

gcloud services enable compute.googleapis.com container.googleapis.com

Now, you need to go ahead download the source code repo provided for this exercise:

git https://github.com/llm-on-gke/LLaMA-Factory
cd LLaMA-Factory
export WORK_DIR=$(pwd)

In this exercise, you will be using the default service account to create the cluster, you need to grant it the required permissions to store metrics and logs in Cloud Monitoring that you will be using later on:

PROJECT_NUMBER=$(gcloud projects describe $PROJECT_ID --format='value(projectNumber)')
GCE_SA="${PROJECT_NUMBER}-compute@developer.gserviceaccount.com"
for role in monitoring.metricWriter stackdriver.resourceMetadata.writer; do
  gcloud projects add-iam-policy-binding $PROJECT_ID --member=serviceAccount:${GCE_SA} --role=roles/${role}
done

2. Create GKE Cluster and Nodepools

Create a GKE Cluster

Now, create a GKE cluster with a minimal default node pool, as you will be adding a node pool with L4 GPUs later on:

gcloud container clusters create $CLUSTER_NAME \
 --location "$REGION" \
 --workload-pool "${PROJECT_ID}.svc.id.goog" \
 --enable-image-streaming --enable-shielded-nodes \
 --shielded-secure-boot --shielded-integrity-monitoring \
 --enable-ip-alias \
 --node-locations="$ZONE_1" \
 --workload-pool="${PROJECT_ID}.svc.id.goog" \
 --addons GcsFuseCsiDriver \
 --no-enable-master-authorized-networks \
 --machine-type n2d-standard-4 \
 --num-nodes 1 --min-nodes 1 --max-nodes 3 \
 --ephemeral-storage-local-ssd=count=1 \
 --enable-ip-alias

Create 2 GKE Nodepools

Create2 additional empty Spot node pool with regular (we use spot to illustrate, one nodepool for 1 L4 gpu per node, while the 2nd nodepool for 2 L4 GPU per node.

gcloud container node-pools create l4-node-pool --cluster \
$CLUSTER_NAME --accelerator type=nvidia-l4,count=1,gpu-driver-version=latest   --machine-type g2-standard-8 \
--ephemeral-storage-local-ssd=count=1   --enable-autoscaling --enable-image-streaming   --num-nodes=0 --min-nodes=0 --max-nodes=3 \
--shielded-secure-boot   --shielded-integrity-monitoring --node-locations $ZONE_1,$ZONE_2 --region $REGION --spot

gcloud container node-pools create l4-2-node-pool --cluster \
$CLUSTER_NAME --accelerator type=nvidia-l4,count=2,gpu-driver-version=latest   --machine-type g2-standard-24 \
--ephemeral-storage-local-ssd=count=0   --enable-autoscaling --enable-image-streaming   --num-nodes=0 --min-nodes=0 --max-nodes=3 \
--shielded-secure-boot   --shielded-integrity-monitoring --node-locations $ZONE_1,$ZONE_2 --region $REGION --spot

After a few minutes, check that the node pool was created correctly:

gcloud container clusters get-credentials $CLUSTER_NAME --region $REGION --project $PROJECT_ID
gcloud container node-pools list --region $REGION --cluster $CLUSTER_NAME

Also, create a secret to store hugging face API token

export HF_TOKEN=<paste-your-own-token>
kubectl create secret generic huggingface --from-literal="HF_TOKEN=$HF_TOKEN"

3. Build LLaMA-Factory container image:

cloudbuild.yaml and Dockerfile are included in downloaded repo here to build the container image, run the following to kick off container image build process:

WORK_DIR=$(pwd)
gcloud artifacts repositories create gke-llm --repository-format=docker --location=$REGION
gcloud auth configure-docker $REGION-docker.pkg.dev
gcloud builds submit . --region=$REGION

The build may last around 9–10 minutes

4. Deploy sample LLM Fine-tuning jobs

In this blog, we simply use jobset( a kubernetes SIG project) in GKE to manage distributed workload.

And a basic pytorch based workload would look similar to the following:

apiVersion: jobset.x-k8s.io/v1alpha2
kind: JobSet
metadata:
  name: pytorch
spec:
  replicatedJobs:
    - name: workers
      template:
        spec:
          parallelism: 2
          completions: 2
          backoffLimit: 0
          template:
           spec:
            containers:
            - name: pytorch
              image: gcr.io/k8s-staging-jobset/pytorch-resnet:latest
              ports:
              - containerPort: 3389
              env:
              - name: MASTER_ADDR
                value: "pytorch-workers-0-0.pytorch"
              - name: MASTER_PORT
                value: "3389"
              command:
              - bash
              - -xc
              - |
                torchrun --nproc_per_node=1 --master_addr=$MASTER_ADDR --master_port=$MASTER_PORT resnet.py --backend=gloo

Let’s install jobset controller in GKE with the following command:

VERSION=v0.5.2
kubectl apply --server-side -f https://github.com/kubernetes-sigs/jobset/releases/download/$VERSION/manifests.yaml

Sample distributed fine-tuning jobs are included under gke folder from downloaded source repo:

gke/fine-tune-2x1-l4.yaml — Total of 2 L4: 2 nodes, with 1 L4/ node
gke/fine-tune-2x2-l4.yaml — Total of 4 L4: 2 nodes, with 2 L4/ node
gke/fine-tune-3x1-l4.yaml — Total of 3L4: 3 nodes, with 1 L4/ node

Update line 42 on container image path with your own project:

image: us-east1-docker.pkg.dev/rick-vertex-ai/gke-llm/llama-factory:latest

Also, update line 28 to link to an existing cloud storage bucket to save data and output

bucketName: “mlops-repo”

When you are ready, kick off the following fine-tuning jobs one at a time,

kubectl delete jobset pytorch
kubectl apply -f gke/fine-tune-2x1-l4.yaml

kubectl delete jobset pytorch
kubectl apply -f gke/fine-tune-2x2-l4.yaml

kubectl delete jobset pytorch
kubectl apply -f gke/fine-tune-3x1-l4.yaml

The following are the summary of 3 fine-tuning jobs:

LLM LoRA fine-tuning settings,

### model
model_name_or_path: meta-llama/Meta-Llama-3–8B-Instruct

### dataset
dataset: identity,alpaca_en_demo
template: llama3
cutoff_len: 1024
max_samples: 1000
overwrite_cache: true
preprocessing_num_workers: 16

Conclusion

This post tries to demonstrate how to perform distributed Open source LLM models fine-tuning such as Llama 3 8B using LLaMA-Factory on GKE is flexible and straightforward.

LLaMA Factory offers an enterprise-scalable solution for fine-tuning a wide range of LLMs on custom datasets. By leveraging the power of Google Kubernetes Engine (GKE), businesses can distribute fine-tuning jobs across a cluster of machines, enabling them to train models on massive datasets. This empowers enterprises to unlock the full potential of LLMs and custom data, leading to improved performance and new business opportunities.Don’t forget to check out other GKE related resources on AI ML infrastructure offered by Google Cloud and check the resources included in the AI/ML orchestration on GKE documentation.

For your reference, the code snipptes listed in this blog can be find in this source code repo