How to do Fine-Tuning and Inference for LLMs in Scalable Settings

Published in

Techsalo Infotech

8 min readAug 19, 2023

In this guide we will talk about the steps taken for Fine Tuning a Hugging Face Transformer Model using the ALPACA Dataset. This includes data preprocessing and finally we will do batch inference on our fine-tuned model. We have divided this guide in two parts — Part 1 explains data preprocessing and fine tuning; Part 2 explains the batch inference. This is part 1.

Prerequisites:
Python Programming Knowledge.

Knowledge of Ray AI Runtime (AIR) Libraries.

Standalone or Kubernetes Cluster with Ray Installed and GPU Support enabled.

Workflow of the Process:
We will be using the same tools as used by the OpenAI team for their chatGPT models (of course without 1000s of GPUs to spare) i.e. Open Source Ray AIR Libraries . We will be using Google’s Open Source model in the paper — Scaling Instruction-Finetuned Language Models i.e. FLTAN-T5 as our base model and will further fine-tune it on ALPACA (Alpaca: A Strong, Replicable Instruction-Following Model) dataset containing 52K instruction sets. The Alpaca dataset designed for instruction training pretrained language models.

The steps include distributed data preprocessing, model fine-tuning and finally inference of the trained model. Our base model i.e. google/flan-t5-base and the dataset is easily accessible from HuggingFace Hub, we will utilize Ray Integration to Hugging Face Hub to bring the base model and ALPACA dataset to our cluster. If you get the gist of the whole process, this can be easily modified for other LLMs as per the requirements.

Step 1 — Initialization of Process:

We will start with a simple warm up step and import our basic (& favourite) python libraries -

import random
import torch
import transformers
import warnings

import numpy as np
import pandas as pd

from IPython.display import display, HTML
from typing import Any, Dict, List, Optional

transformers.set_seed(42)
warnings.simplefilter("ignore")

Next we go straight away and initialize our Ray Cluster and access the Ray Dashboard UI. A crucial aspect of Ray is its Dashboard for monitoring our cluster, jobs and serving applications. You may have also enabled Integration with Prometheus and Grafana for important metrics from Ray, but for this guide we will not be focusing on that.

import ray
ray.init()

2023–08–13 21:58:47,164 INFO worker.py:1612 — Started a local Ray instance. View the dashboard at http://127.0.0.1:8265

If all goes well you will have the Ray Dashboard UI url up and running on port 8256 on your host machine.

Out[4]:

Python version:

3.11.4

Ray version:

2.6.2

Dashboard:

http://127.0.0.1:8265

Dashboard UI:

Step 2 — Data Ingestion:

In this step we will be bringing the ALPACA dataset from Hugging Face Hub for fine-tuning the Question-Answering and text generation ability of the original base model. For this it is important to understand the dataset and its format. Remember the base model FLATAN-T5 expects the input data in a certain predefined format.

Here we also do the train-test split in 80:20 proportion on our huggingface dataset.

from datasets import load_dataset
hf_dataset = load_dataset("tatsu-lab/alpaca", split="train").train_test_split(
    test_size=0.2, seed=57
)
hf_dataset

DatasetDict({
    train: Dataset({
        features: ['instruction', 'input', 'output', 'text'],
        num_rows: 41601
    })
    test: Dataset({
        features: ['instruction', 'input', 'output', 'text'],
        num_rows: 10401
    })
})

Note that the ALPACA dataset has four features, let’s take a look what it contains. For this we define a simple utility function to grab a random subset of the dataset -

def get_random_elements(dataset: List, num_examples: int = 2) -> pd.DataFrame:
    """
    Picks a random subset of elements from the given dataset and displays them
    as a Pandas DataFrame.

    Args:
        dataset: A list of elements to choose from.
        num_examples: The number of elements to choose. Defaults to 2.

    Raises:
        ValueError: If `num_examples` is greater than the length of `dataset`.

    Returns:
        None
    """

    if num_examples > len(dataset):
        raise ValueError("Can't pick more elements than there are in the dataset.")

    picks = random.sample(range(len(dataset)), k=num_examples)
    return pd.DataFrame(dataset[picks])

df = get_random_elements(dataset=hf_dataset["train"], num_examples=3)
display(HTML(df.to_html()))

Let’s understand each column one by one and for this it’s important to understand the the ALPACA dataset was created using the OpenAI’s text-davinci-003 -

“instruction” — It’s the original query or prompt given to the model.

“Input” — this column contains the additional info if any was provided to the model along with the instruction to enhance the context for the model.

“output” — This is the output generated by the OpenAI’s model i.e. text-davinci-003.

“text” — This contains the combined values of all other three columns prefixed with instructional context to the model.

Read more about training ALPACA here — https://github.com/tatsu-lab/stanford_alpaca#data-release

Next we convert our dataset to ray dataset and define the train and validation dataset. Ray datasets have in-built capabilities for distributed processing and easy integration with other RAY libraries and converts huggingface data to standard ray datasets

ray_dataset = ray.data.from_huggingface(hf_dataset)
ray_dataset

Output:

{'train': MaterializedDataset(
    num_blocks=1,
    num_rows=41601,
    schema={instruction: string, input: string, output: string, text: string}
 ),
 'test': MaterializedDataset(
    num_blocks=1,
    num_rows=10401,
    schema={instruction: string, input: string, output: string, text: string}
 )}

train_dataset = ray_dataset["train"]
validation_dataset = ray_dataset["test"]

Step 3 — Distributed Data Preprocessing:

In this step we define our preprocessor function for our dataset, this function will convert each batch of our ray datasets i.e. ALPACA dataset to a format which can be acceptable to the base model FLTAN-T5.

We will be utilizing Hugging Face component called T5-tokenizer which will process the natural language to tokens with necessary padding and truncation necessary for our training process ahead.

from transformers import T5Tokenizer, T5ForConditionalGeneration
def preprocess_function(batch: Dict[str, Any]) -> Dict[str, Any]:
    """
    Tokenizes the input and instruction pairs in a batch using the T5 tokenizer
    from the Google/flan-t5-base model, and returns a dictionary containing the
    encoded inputs and labels.

    Args:
        batch: A dictionary containing at least two keys, "instruction" and
        "input", whose values are lists of strings.

    Returns:
        A dictionary containing the encoded inputs and labels, as returned by
        the T5 tokenizer.
    """
    model_name = "google/flan-t5-base"
    tokenizer = T5Tokenizer.from_pretrained(model_name)

    encoded_inputs = tokenizer(
        list(batch["instruction"]),
        list(batch["input"]),
        padding="max_length",
        truncation=True,
        return_tensors="np",
    )

    encoded_inputs["labels"] = encoded_inputs["input_ids"].copy()

    return dict(encoded_inputs)

Next we will use Ray BatchMapper for our defined preprocessing function to be applied on each incoming batch of our ALPACA dataset. Ray Batchmapper modifies each batch instead of individual records for efficient vector transformations.

from ray.data.preprocessors import BatchMapper
batch_preprocessor = BatchMapper(preprocess_function, batch_format="pandas", batch_size=4096)

Step 4 — Ray Distributed Fine Tuning:

Next we will be using Ray AI Runtime component “HuggingFaceTrainer” which will distribute our base model on each worker node. So each worker node will have a copy of our batch preprocessor function and the base model initialized. Both will be applied to each incoming batch of our ALPACA dataset.

from transformers import TrainingArguments, Trainer
batch_size = 2
use_gpu = True
def trainer_init_per_worker(
    train_dataset: ray.data.Dataset,
    eval_dataset: Optional[ray.data.Dataset] = None,
    **config,
) -> Trainer:
    """
    Initializes a Hugging Face Trainer for training a T5 text generation model.

    Args:
        train_dataset (ray.data.Dataset): The dataset for training the model.
        eval_dataset (ray.data.Dataset, optional): The dataset for evaluating
        the model.
            Defaults to None.
        config: Additional arguments to configure the Trainer.

    Returns:
        Trainer: A Hugging Face Trainer for training the T5 model.
    """
    device = torch.device("cuda" if use_gpu and torch.cuda.is_available() else "cpu")
    print(f"Using device: {device}")

    model_name = "google/flan-t5-base"

    tokenizer = T5Tokenizer.from_pretrained(model_name)
    model = T5ForConditionalGeneration.from_pretrained(model_name)

    training_args = TrainingArguments(
        "flan-t5-base-finetuned-alpaca",
        evaluation_strategy="epoch",
        save_strategy="epoch",
        logging_strategy="epoch",
        learning_rate=config.get("learning_rate", 2e-5),
        per_device_train_batch_size=batch_size,
        per_device_eval_batch_size=batch_size,
        num_train_epochs=config.get("epochs", 4),
        weight_decay=config.get("weight_decay", 0.01),
        push_to_hub=False,
        disable_tqdm=True,
    )

    hf_trainer = Trainer(
        model=model,
        args=training_args,
        train_dataset=train_dataset,
        eval_dataset=eval_dataset,
        tokenizer=tokenizer,
    )

    print("Starting training...")
    return hf_trainer

The above function will initialize Hugging Face Transformer Trainer on each worker node with our base model from Hugging Face Hub. Each worker node will have the same model but will operate on different incoming batches of data.

We will then distribute this trainer using Ray AIR library which internally does so by utilizing Pytorch Distibuted Data Parallel backend. For each batch of data our BatchMapper and Trainer Model will be applied and at the end of each step model weights will be synced by all workers. This will result in a fine-tuned model.

from ray.air.config import RunConfig, ScalingConfig, CheckpointConfig
from ray.train.huggingface import HuggingFaceTrainer

Depending upon number of GPUs you have available, define worker nodes

num_workers = 2

trainer = HuggingFaceTrainer(
    trainer_init_per_worker=trainer_init_per_worker,
    scaling_config=ScalingConfig(num_workers=num_workers, use_gpu=use_gpu),
    datasets={
        "train": train_dataset,
        "evaluation": validation_dataset,
    },
    run_config=RunConfig(
        checkpoint_config=CheckpointConfig(
            num_to_keep=1,
            checkpoint_score_attribute="eval_loss",
            checkpoint_score_order="min",
        ),
    ),
    preprocessor=batch_preprocessor,
)

Ray AIR library has integration with Hugging Face Transformer Library, here we used all our previous definitions and defined a ray trainer i.e. HuggingFaceTrainer.
Arguments for HuggingFaceTrainer are:

trainer_init_per_worker : our user defined function for base model copied to each worker node

scaling_config : Hardware specifications as per the availability of GPUs

datasets : Our train and validation datasets in ray dataset format

run_config : Model checkpoints behavior

preprocessor : Our user defined function as Ray Batchprocessor to tokenize natural language format to tokens format.

Next we simply call the trainer fit and Ray will do the fine tuning of the model.

result = trainer.fit()

You can check the Ray Dashboard UI for Ray Actors status for Ray Train Worker and HuggingFaceTrainer

We can try to check the results of fine-tuned model -

model_name = "google/flan-t5-base"
tokenizer = T5Tokenizer.from_pretrained(model_name)
model = T5ForConditionalGeneration.from_pretrained(model_name)
checkpoint = result.checkpoint
finetuned_model = checkpoint.get_model(model)


instruction = "how many colors are there in a rainbow?"  # Enter your own instruction here.
input_query = (
    "rainbow has 7 colors"  # Write additional context for the model here.
)

inputs = tokenizer(instruction, input_query, return_tensors="pt")
outputs = finetuned_model.generate(**inputs)

print(tokenizer.batch_decode(outputs, skip_special_tokens=True))

Output:

['7']

Note that the checkpoint object requires passing our base model to define what kind of model we expect to receive.

Stay tuned for Part 2 where we will use the Ray “Batch Predictor” for implementing Batch Inference on our Fine-Tuned Model.

Note: We at Techsalo Infotech is a team of engineers solving complex Data engineering and Machine learning problems. Please reach out to us at sales@techsalo.com for any query on How to build these systems at scale and in cloud.

How to do Fine-Tuning and Inference for LLMs in Scalable Settings

Written by Manish