Parameter Efficient Finetuning (PEFT) of LLM

Published in

Intro to Artificial Intelligence

4 min readOct 23, 2023

The PEFT approach only trains a subset of existing or new trainable parameters (weights). Source:[1]

This article explains the PEFT technique to finetune LLM and how can we implement it. To learn more about Transformer and its different types, please have a look at my previous article.

Theory

Full finetuning vs PEFT

High-level overview of PEFT and full finetuning approach. The approach on the left is PEFT and the right is full finetuning Source: [5]

Full finetuning is time-consuming and expensive due to the computation and memory requirements. In the PEFT approach, we only train a small subset of parameters while keeping most of the LLM model’s weights frozen. This makes the memory manageable. Also, this has to only update a small subset of existing model parameters or newer parameters, so it does not have an issue of catastrophic forgetting (Full finetuning is struggling with this issue). There are three approaches in PEFT: selective, reparameterisation, and additive. In this article, we will discuss a reparameterisation approach called low-rank adaptation (LoRA).

LoRA

It is a parameter-efficient fine-tuning technique for LLMs (large language models). It does this by freezing the LLM weights and injecting trainable rank-decomposition matrices. This means that only a small number of parameters need to be updated during fine-tuning, which can significantly reduce the computational cost and memory requirements.

As shown in the above figure, low-rank decomposition matrices are matrices that can be decomposed into a product of two matrices with much smaller dimensions. This is done by finding the best rank-r approximation to the original matrix, where r is much smaller than the original rank of the matrix. Let’s say we have a pre-trained dense layer (or weight matrix) called W0, which has dimensions n x n. We then initialize two new dense layers, A and B, with dimensions n x rank and rank x n, respectively. Here, rank is much smaller than n[3].

LoRA training and inferencing process. Source:[4]

In LoRA finetuning, two low-rank decomposition matrices are injected into the LLM. Then, we train the weights of these smaller matrices. Now, we make the model ready for inference by merging the original weights and product of two small matrices as shown in the above figure.

Where should we add small matrices?

Now the question is where we should inject the decomposition matrices. Just adding matrices to the self-attention layer is often enough to finetune LLM for a specific task. This can improve the model’s performance on those tasks, as it will be able better to understand the context and nuances of the text. Having said that, LoRA can be applied to other layers such as feed-forward.

Effectiveness

LoRA can reduce the more than 80% of trainable parameters to train in the finetuning task. It means it has only less memory footprint, enabling us to train them on a single GPU.

Implementation in TensorFlow

I’ve implemented a sample code of the algorithm which is taken from [3] and published it in the GitHub repo. Please have a look at it if you want to play with it. Here are the main steps for implementing the algorithm.

Download the dataset from the TensorFlow dataset.

import tensorflow_datasets as tfds

# Download the dataset
reddit_ds = tfds.load("reddit_tifu", split="train", as_supervised=True)

# print sample data
for document, title in reddit_ds:
    print(document.numpy())
    print(title.numpy())
    break

2. The data is already in tf. data.Dataset API format and take a subset of the dataset for the purpose of this example.


train_ds = (
    reddit_ds.map(lambda document, _: document)
    .batch(BATCH_SIZE)
    .cache()
    .prefetch(tf.data.AUTOTUNE)
)
train_ds = train_ds.take(NUM_BATCHES)

3. Load the model GPT-2 model for fine-tuning.

preprocessor = keras_nlp.models.GPT2CausalLMPreprocessor.from_preset(
    "gpt2_base_en",
    sequence_length=128,
)
gpt2_model = keras_nlp.models.GPT2CausalLM.from_preset(
    "gpt2_base_en",
    preprocessor=preprocessor,
)

gpt2_model.summary()

4. Define the LoRA layer.

5. Replace the original projection matrices with the new custom LoRA layer we defined earlier.

for layer_idx in range(lora_model.backbone.num_layers):
    # Change query dense layer.
    decoder_layer = lora_model.backbone.get_layer(f"transformer_layer_{layer_idx}")
    self_attention_layer = decoder_layer._self_attention_layer
    
    # Change query dense layer.
    self_attention_layer._query_dense = LoraLayer(
        self_attention_layer._query_dense,
        rank=RANK,
        alpha=ALPHA,
        trainable=True,
    )

    # Change value dense layer.
    self_attention_layer._value_dense = LoraLayer(
        self_attention_layer._value_dense,
        rank=RANK,
        alpha=ALPHA,
        trainable=True,
    )

6. Freeze the entire LLM, only the LoRA layers should be trainable.

for layer in lora_model._flatten_layers():
    lst_of_sublayers = list(layer._flatten_layers())

    if len(lst_of_sublayers) == 1:  # "leaves of the model"
        if layer.name in ["lora_A", "lora_B"]:
            layer.trainable = True
        else:
            layer.trainable = False

7. Train the model.

8. Before text generation, we need to merge the original weights with the product of two matrices. Then, it can be used to produce text similar to the dataset.

If you like my write-up, follow me on Github, Linkedin, and/or Medium profile.

Parameter Efficient Finetuning (PEFT) of LLM

Theory

Full finetuning vs PEFT

LoRA

Implementation in TensorFlow

Other articles in the finetuning LLMs series

Reference

Written by Dhanoop Karunakaran