Fine-Tuning Embedding Model with PEFT and LoRA

Kelvin Lu
8 min readAug 1, 2023

--

In our previous discussion, we explored the evaluation of embedding models and the potential benefits of hosting these models to achieve better results.

The E5-large-v2, a slim, versatile open-source model developed by Microsoft researchers, was the leader on the MTEB leaderboard when I wrote that post. However, just 7 days later, it was overtaken by the GTE-large, an even smaller model that is half the size of the E5-large-v2 and outperforms it by a margin. The GTE-base, which is only 1/6 the size of E5-large-v2, also outperformed it.

MTEB Leaderboard as of 31 Jul 2023

A cursory glance at the model sizes and performance metrics reveals that larger models do not necessarily perform better. In fact, the success of the E5 and GTE models suggests that smaller models still have great potential. For example, the text-embedding-ada-002 model was beaten by the gte-small model, which has only 70 million parameters. This is a remarkable achievement, and it demonstrates the potential of smaller models.

This new development is of paramount importance. Not only does it represent a significant technical breakthrough, but it also opens up new possibilities for enterprise applications. And the enterprise applications are in my best interest.

In the future, enterprises will likely use a diverse set of large and small language models to address a wide range of needs. Some models will be trained to handle common tasks, such as security and PII detection. Others will be trained to address more specific needs, such as enterprise search, log analysis, and project-specific tasks. Additionally, smaller models can be bundled together to form an expert board that can provide more comprehensive and nuanced insights.

All these require the models to be small and efficient, and we must also be able to tune the models to our needs. In this article, I’m going to explore how to fine-tune the embedding model and demonstrate how to fine-tune an E5 model.

Challenge of Language Model Fine-Tuning

For a typical computer vision model, the structure is normally a stack of fully connected layers, convolution layers, pooling layers, and activation layers. The fine-tuning is fairly simple: freeze all other layers and only retrain the top one or two layers.

Classical Alexnet for Digit Recognition

Compared to the computer vision model, the structure of a language model is very different. A typical language model is a stack of Transformers. The outputs of each layer of Transformer are complicated by three embedding vectors: token embedding, segment embedding, and position embedding. Each of the embedding vectors is of the following size:

(input length, embedding size)

If we follow the old wisdom and only fine-tune the last layer, we’ll find out that it is still a very costly process. It takes a large amount of training data, a beefy machine, and a long hour to train. And the worst thing is that after the fine-tuning, the model will forget the knowledge it gained from the pre-training. This is called catastrophic forgetting. That happened because the knowledge was stored in every layer of embedding. When the embedding layer changed during the fine-tuning, the knowledge was lost.

Photo by Nico Smit on Unsplash

The second way of fine-tuning a language model is by adding an additional layer on top. This can prevent catastrophic forgetting; however, because the language model has to process the information layer by layer, the additional layer will slow down the response time of the model. And still, it is costly to fine-tune the model.

LoRA and PEFT

And the third way of fine-tuning a language model is to create a set of the same-sized matrices as each of the original embeddings, then freeze the original embeddings but keep all the weights gained during the fine-tuning in the new matrices. The next clever move is to decompose each of the matrices into two lower-ranked matrices. During the evaluation phase, the low-ranking matrices will be used to recover the fine-tuned weight matrix and then added to the original model weight to get the final weight.

This technology was known as Low-Rank Adaptation of Large Language Model, or LoRA for short.

LoRa Structure (image from the original paper)

Adoption of LoRA has three major benefits: First of all, the decomposition reduces the number of trainable parameters. For an original matrix of (1024, 1024), there are 1 million trainable parameters. If we decompose the same matrix into two rank 8 matrices, the trainable parameters will be:
2 * 8 * 1024 = 16384, or 1/64 of the original size.

The second benefit is that because the pretrained weights are frozen and the additional weight matrices are decomposed into lower ranks, catastrophic forgetting is less likely to happen.

And lastly, because the LoRA weight is much smaller, it’s possible to host a single base model with multiple specially fine-tuned LoRA weights to serve different purposes.

In addition to the LoRA, there’s another technology to reduce the cost of model training. Researchers noticed that the deep learning weights are in the format of a 32-bit float, which is unnecessary in most cases. They introduced the concept of Parameter-Efficient Fine-Tuning (PEFT) to down-cast the data types from 32-bit float into lower precision data types. Down-casting data types to make model training faster is not a new idea.

However, it wasn’t so popular because lower-precision data types may not fit all the weights, and downcasting causes significant model performance degradation. Until the PEFT researchers found that in a matrix multiplication computation, the outliers were more important than the non-outliers, the outliers could be stored in FP16 while the non-outliers were stored in the insane 4-bit format known as FP4. Then the outliers and non-outliers are put back together to receive the full result in FP16.

Experiment Design

In the experiment, I’m going to continue using E5-small-v2 as the base model. Despite the help of LoRA and PEFT, the training is still better run on a GPU, so I set up a GCP Compute Engine G2 instance with NVIDIA L4, 40 GB of disk space, 4 vCPUs, and 16 GB of memory.

I hope the fine-tuning can be closer to real-world business scenarios, so I chose the Quora duplicated question open dataset as training data [1]. The dataset has more than 400 thousand pairs of questions and the label of whether the questions are duplicated or not.

Quora duplicated question dataset

This duplicated question dataset is more relevant to the real project's needs. Imagine we identified a model accuracy issue, One of the intuitive ways to organize the training data is to provide both positive and negative examples so that we can teach the model how to correctly understand the query. This type of training data organisation is called contrastive training. A variation of contrastive training is triplet training, which has both positive and negative references in a single line.

However, the 400K of data is more than I need. So I only picked the questions with a number of characters between 60 and 268 (just a random setting) to compose a smaller dataset of 91K rows. Then I randomly split out 20% of the data as the testing set and the rest as the training set.

The training process was:

  • generate embeddings for question1 and question2;
  • referring to the label to generate the training loss. The training function is:

In other words, it pushes the result to either 1 for positive question pairs or 0 for negative question pairs.

The Git repo can be found from [2]

Experiment Execution

When the instance of Compute Engine is ready, we need to log into the server via ssh and then run install.sh to prepare the machine. After that, checkout the repo and install the Python dependencies listed in the script/requirements.txt.

Now the code is ready to go. Change directory into the script/ and run the following command:

~/.local/bin/accelerate launch  \
--mixed_precision="fp16" \
peft_lora_embedding_semantic_search.py \
--dataset_name="../data/quora_dq_train.csv" \
--max_length=70 \
--model_name_or_path="intfloat/e5-small-v2" \
--per_device_train_batch_size=64 \
--per_device_eval_batch_size=128 \
--learning_rate=5e-4 \
--weight_decay=0.0 \
--num_train_epochs 3 \
--gradient_accumulation_steps=1 \
--output_dir="../model/peft_lora_e5" \
--seed=42 \
--with_tracking \
--report_to="wandb" \
--use_peft \
--checkpointing_steps "epoch"

If everything is OK, the code run will finish in less than 10 minutes. Please notice the following lines in the log:

LoRA has reduced the trainable parameters to less than 0.5 percent! Without this fantastic technology, we wouldn't be able to run this experiment.

Here is the chart of training losses:

Lora on E5-small-v2 training loss

The evaluation loss is as follows:

ROC AUC of LoRA on E5-small-v2

Before the fine-tuning, the original E5-small-v2 scored 0.852 on the test set. After being fine-tuned with 73K of training data, the performance is 0.955. A pretty encouraging enhancement.

Conclusion

This experiment demonstrated that fine-tuning a high-performance LLM is a considerable option in real-world project development. With a little bit of hassle, we can make the model better fit the specific requirements, which is impossible for many commercial models.

However, the 73K of training data still doesn’t look very nice. That is too much for project-based developers. When we provide contrastive examples for the model, we just use them to tell the model to make sure the positive pairs are getting closer and the negative pairs are getting farther. We actually can’t guarantee that the model will only move the training target toward the references, not the reference to the training target.

This is not a major problem when pre-training the LLM because pre-training the LLM requires a huge amount of low-quality data. But it matters in the fine-tuning phases because we appreciate more precise control of the training result. Moreover, we are able to manually craft small amounts of high-quality data.

I’ll explore options for accurate fine-tuning in the future.

--

--