Fine-Tuning Large Language Models For Downstream Tasks

4 min readJun 11, 2023

Large Language Models, commonly known as LLMs, have become a popular topic in the AI world in the last few years. With the introduction of LLMs like ChatGPT people have realized that AI is no longer science fiction. The common question people have these days is whether we can build an LLM like ChatGPT or at least train an existing LLM to do our tasks.

The simple answer for that is, it’s not practical to build a LLM from scratch because to build an LLM we need billions of data and large hours of computations. But it’s possible to train/fine-tune an existing LLM to do our tasks. This practice is known as downstream training. By doing this we can take the powerfulness of LLMs generated by training on large amounts of data to our custom tasks.

Today, I am going to discuss how to fine-tune an LLM for a custom task. Due to the advanced technology used in modern LLMs, we can train an LLM for a custom task within a hundred or thousand data points. This is quite significant because LMMs are originally trained on billions of data. As the base model, I am going to use a pre-trained model available in HuggingFace. HuggingFace is an open platform where we can have free models, datasets, metrics, etc for training purposes.

Prerequisites

from datasets import load_dataset
from transformers import AutoTokenizer
from transformers import T5ForConditionalGeneration
from transformers import Seq2SeqTrainingArguments 
import numpy as np
from datasets import load_metric
from transformers import Trainer
import pandas as pd

Note: when you are running this for the first time there may be errors asking to install necessary libraries. As a one-time operation, you can install them using pip or coda. Ex: !pip install transformers

Loading Data

dataset = load_dataset("csv", data_files= <file_name>)
dataset

Note: “dataset” is a data type in HuggingFace. There are loaders in HuggingFace to load the data to a dataset. Here I am loading my csv data to a dataset.

Loading Tokenizer

tokenizer = AutoTokenizer.from_pretrained(<tokenizer_name>)

You can provide your tokenizer. Ex: “Salesforce/codet5-large-ntp-py”. The exact name of the model is mentioned in the HuggingFace. You can check “Use in Transformers” tab for your model.

Ex:

Tokenizing Data

def tokenize_function(examples):
    model_inputs = tokenizer(examples[<prompt_column_name>], padding="max_length", truncation=True)
    with tokenizer.as_target_tokenizer():
        labels = tokenizer(examples[<completion_column_name>], padding="max_length", truncation=True)
        
    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

tokenized_datasets = dataset.map(tokenize_function, batched=True)

You can provide your prompt column and completion column in your dataset. So the tokenizer can tokenize the correct columns.

Once you tokenize your dataset pick the “train” part of your dataset for model training.

small_train_dataset = tokenized_datasets["train"]
small_train_dataset

Loading Base Model

model = T5ForConditionalGeneration.from_pretrained("Salesforce/codet5-large-ntp-py").to("cuda")

Normally we need GPU to train an LLM. In most cases, we can use Google Colab as an option. If that’s not an option you can load the model without GPU.

model = T5ForConditionalGeneration.from_pretrained("Salesforce/codet5-large-ntp-py")

Model Training

To train the model we will have to use a Trainer. HuggingFace provides different trainers. Based on your model, you have to pick the relevant Trainer for it.

training_args = Seq2SeqTrainingArguments(
output_dir="/model", 
save_steps=50,
save_strategy="steps",
per_device_train_batch_size=1,
gradient_checkpointing=True,
save_total_limit=3,
num_train_epochs=4
                                 )

Based on the parameters you set for the Trainer your model’s training time accuracy and even out-of-memory scenarios will get impacted. Smaller per_device_train_batch_size will provide more accuracy but a longer training time. A larger number of num_train_epochs will give more accuracy with longer training times. Finally, by setting gradient_checkpointing=True you can avoid out-of-memory issues while model training.

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=small_train_dataset,
    tokenizer=tokenizer
    )
result = trainer.train()
print(result)

While training you can check your memory and GPU status by the following command.

!nvidia-smi

Model Inference

trained_model = T5ForConditionalGeneration.from_pretrained("/model").to("cuda")

If you are running in the CPU, you can follow the following command.

trained_model = T5ForConditionalGeneration.from_pretrained("/model")

To try some sample inferences you can try the following code.

input_ids = (tokenizer(<prompt>, return_tensors="pt").input_ids).to("cuda")
outputs = trained_model.generate(input_ids, max_length=600)
completion = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(completion)