LoRA Parameter-Efficient Tuning with Understanding of Self-Attention

Adarsh Shrivastav
DataDreamers
Published in
8 min readAug 17, 2023
Photo by Mojahid Mottakin on Unsplash

Large language models (LLMs) have achieved state-of-the-art results on a wide range of natural language processing (NLP) tasks. However, they are often computationally expensive to train and fine-tune. Parameter-efficient fine-tuning (PEFT) is a set of techniques that can be used to reduce the computational cost of fine-tuning LLMs.

One of the most popular PEFT techniques is Low-Rank Adaptation (LoRA). LoRa works by decomposing the weight matrices of the LLM into low-rank matrices. This reduces the number of parameters that need to be trained, while still maintaining the performance of the original model.

To understand LoRA, it is important to understand self-attention. Self-attention is a neural network mechanism that allows an LLM to attend to different parts of its input sequence. This allows the LLM to learn long-range dependencies between words in the sequence. Self-attention is a key component of many LLMs, including Transformer-based models.

LoRA PEFT combines LoRA with self-attention to achieve even more parameter efficiency. LoRa PEFT works by decomposing the weight matrices of the self-attention layers in the LLM. This reduces the number of parameters that need to be trained, while still maintaining the performance of the original model.

Let’s understand Encoder Architecture at a high level

The operational mechanics of the Transformer model comprise a sequence of stages that synergistically bring about its remarkable capacity to comprehend and generate human language. Let’s delve into the working mechanism with more depth, shedding light on each stage’s functionality and significance.

Encoding the Input

At the inception, the input sentence undergoes two crucial transformations. Firstly, the sentence is tokenized, segmenting it into discrete units of meaning. These tokens are then transformed into dense vectors known as word embeddings. Additionally, positional encodings are introduced to ensure that the model grasps the sequence’s word order. These word embeddings, enriched with positional encodings, constitute the initial input fed into the Transformer.

Self-Attention Layers

Self-attention, a fundamental component in modern natural language processing models like transformers, operates on input sentences by assigning importance scores to each word based on their contextual relationships.

At its core, self-attention processes each word in the input sentence simultaneously. For each word, it computes a weighted sum of all other words in the sentence, where the weights signify the relevance or importance of those words to the current one. This is achieved through a series of matrix multiplications and softmax operations.

The attention mechanism learns these weights by comparing the current word with every other word in terms of their semantic similarity and relevance within the given context. Words that are more contextually relevant receive higher attention scores, while irrelevant words receive lower scores.

The resulting attention scores create a new representation of the input sentence, emphasizing the most critical words and their relationships. This enriched representation captures the contextual nuances and dependencies within the sentence, enabling the model to understand and generate more accurate outputs for various natural language processing tasks. Please read the illustrated transformer, if you want to deep dive into the detailed functioning of self-attention.

Self-Attention Internal Mechanism
Self-attention mechanism

Let’s understand Self-Attention functionality at a high level in context sentiment understanding

Sentence: “The movie was absolutely fantastic, and the acting was superb.”

  1. “Fantastic” and “superb”: These positive sentiment words would receive high attention scores, as they directly convey positive emotions.
  2. Movie” and “acting”: These are contextually important words that help the model understand the subject of the review. They receive moderate attention scores.
  3. Absolutely”: While not sentiment-bearing itself, it intensifies the positivity of the review, so it would also get some attention.
  4. The” and “was”: Common stop words that provide structure but carry little sentiment, hence receiving low attention scores.

Self-attention allows the model to focus on words carrying sentiment and context, helping it discern the overall positive sentiment in the review by assigning higher importance to words like “fantastic” and “superb.”

Now what is Parameter efficient Tuning or PeFT and how it is different to Fine Tuning

Fine-tuning means taking a pre-trained model and training it more on a new task with new data. Usually, this means training the entire pre-trained model, including all its parts and settings. But this can take a lot of computer power and time, especially for big models.

Parameter-efficient fine-tuning, on the other hand, is a way to fine-tune by focusing only on some of the pre-trained model’s settings. It figures out which parameters are most important for the new task and only changes those during training. This makes PEFT much faster because it doesn’t have to work on all the parameters of the model.

There are different types of PeFt techniques:

  1. Adapter
  2. LoRA
  3. Prefix Tuning
  4. Prompt Tuning.

Now Let’s deep dive into LoRA

It is a technique for parameter-efficient fine-tuning of large language models (LLMs). LoRA works by approximating the model’s weight matrices with low-rank matrices. This reduces the number of parameters that need to be fine-tuned, which makes the fine-tuning process faster and more efficient.

LoRA was first proposed by researchers at Microsoft in 2022. They showed that LoRA can be used to fine-tune LLMs with billions of parameters on a variety of tasks, including natural language inference, question answering, and text summarization. LoRA has been shown to be more efficient than traditional fine-tuning methods, while still achieving comparable performance.

The Hypothesis behind the LoRA

It is based on the hypothesis that the intrinsic rank of the weight matrices in a large language model is low. Researchers have shown that low-rank approximations of the weight matrices in large language models can achieve comparable performance to the original weight matrices on a variety of tasks. This suggests that most of the information in the weight matrices can be captured by a small number of parameters.

How LoRA reduces trainable parameters without compromising on the performance of new tasks.

Let’s say we have a 100x100 (dxd) matrix called A. This matrix represents the weight matrix of a layer in a large language model. We want to use LoRA to fine-tune this layer on a new task.

LoRA works by approximating the weight matrix A with two matrices of size 100x5 (dxr) and 5x100 (rxd). These matrices are called A_low_rank and B_low_rank.

A_low_rank is formed by taking the first 5 singular values of A and multiplying them by the corresponding singular vectors of A. B_low_rank is formed by taking the transpose of the singular vectors of A.

The number of parameters in A is 100x100 = 10,000. The number of parameters in A_low_rank and B_low_rank is 5x100 + 100x5 = 1000.

This means that we can fine-tune the layer with only 1000 parameters, instead of the original 10,000 parameters. This is a significant reduction in the number of parameters, which can make the fine-tuning process faster and more efficient.

Training on Low-Rank Matrices

In the context of LoRA, A_low_rank, and B_low_rank are the low-rank approximations of the weight matrix A. They are used to fine-tune the layer on the new task. The original weight matrix A is not fine-tuned.

In the example above, we only used the first 5 singular values and corresponding singular vectors to form A_low_rank and B_low_rank. This is sufficient to capture most of the information in the original weight matrix A- that is what the researchers' hypothesis is.

Now, the next question that comes to mind that where we should inject these Low-Rank Trainable matrices, In general, it is a good idea to inject the low-rank trainable matrices at the attention layer of the transformer architecture. This is because:

  • Task-Specific Features: Self-attention captures intricate relationships between words, making it crucial for understanding context and nuances in various tasks. Fine-tuning in this layer enables the model to adapt its attention patterns to specific tasks.
  • Semantic Understanding: Self-attention aids in capturing semantic relationships between words, which is vital for tasks like sentiment analysis, question-answering, and summarization. Fine-tuning here helps the model to better understand the semantics relevant to the task.

However, it is also possible to inject the low-rank trainable matrices at other layers in the transformer architecture. For example, we could inject them at the feed-forward layer or the residual connection layer. This would further reduce the number of parameters in the model, but it would also increase the computational cost of the fine-tuning process.

The decision of where to inject the low-rank trainable matrices depends on the specific application. If the application requires a fast-fine-tuning process, then it may be better to only inject them at the attention layer (why the attention layer is important we have seen it in the earlier section of the article). However, if the application requires a high-accuracy model, then it may be better to inject them at multiple layers in the transformer architecture.

Ultimately, the best way to determine where to inject the low-rank trainable matrices is to experiment with different settings and see what works best for your specific application.

The flexibility that LoRA brings for downstream tasks

LoRA can be used to fine-tune an LLM for multiple tasks on multiple data sets separately. This is because LoRA only fine-tunes a small number of parameters, which makes it possible to fine-tune the model on multiple tasks without significantly increasing the model size or the computational cost of the fine-tuning process.

Once the LLM has been fine-tuned for multiple tasks separately, it is possible to merge any fine-tuned LoRA weights to frozen weights during inference. This is because the low-rank trainable matrices are stored separately from the pre-trained model parameters. This means that we can load the low-rank trainable matrices for the desired task and use them to infer from the model on the fly.

In this article, we have discussed the LoRA PEFT technique for parameter-efficient fine-tuning of large language models. We have seen how LoRA PEFT works by decomposing the weight matrices of the self-attention layers into low-rank matrices. We have also seen how this can help to reduce the computational cost of fine-tuning while still maintaining the performance of the original model.

We have also discussed the importance of understanding self-attention for understanding LoRA PEFT. Self-attention is a neural network mechanism that allows an LLM to attend to different parts of its input sequence. This allows the LLM to learn long-range dependencies between words in the sequence. LoRA PEFT relies on self-attention to learn these long-range dependencies on new downstream tasks, so it is important to have an understanding of self-attention in order to apply LoRA PEFT.

Overall, LoRA PEFT is a promising technique for reducing the computational cost of fine-tuning large language models. It can be a valuable tool for researchers and developers who are working with LLMs.

--

--