Optimizing RAG, Fine-tuning Embedding and Reranking models with your Data with Llamaindex

Deltaaruna
Effectz.AI
Published in
13 min readApr 16, 2024

1. Introduction

This article will describe a cool trick you can use to improve retrieval performance in your RAG pipelines. That is fine-tuning the embedding model (for embedding) and the cross-encoder (for reranking).

First, We will describe how embeddings and reranking works. Then we will show you how to train an embedding system with your own data and then plug it into a LlamaIndex pipeline. Complete source code is available in github. If you need an introduction to Llamaindex, you can refer this article in our blog. The fine-tuned models can be plugged into any RAG framework as well. We are using LlamaIndex as an example.

Cross-encoders are crucial for reranking but are way too slow for retrieving large numbers of documents. This finetuning technique gives you all the speed advantages of direct embedding lookup but also better performance than a non-fine tuned embedding model.

2. Vector embeddings

Llamaindex creates vector embeddings from your documents. Let’s discuss how to optimize such a scenario first.

2.1 Bi-encoders

Bi-encoders produce vector embeddings of a particular string. You can visualize it as follows.

Now let’s discuss how a vector DB search wil work. Once you ask a question, the question can be converted into a vector embedding, then a similarity search will be performed against the vector store and find the documents with highest similarity using something like cosine similarity index. This involves following steps.

  1. Create vector embeddings for your data
  2. Convert your questions in to vector embeddings
  3. Retrieve documents with highest similarity index

To understand this more, let’s do some coding. Make sure you run the code on colab GPU.

!pip install -U sentence-transformers rank_bm25
import json
from sentence_transformers import SentenceTransformer, CrossEncoder, util
import gzip
import os
import torch
if not torch.cuda.is_available():
print("Warning: No GPU found. Please add GPU to your notebook")
#We use the Bi-Encoder to encode all passages, so that we can use it with semantic search
bi_encoder = SentenceTransformer('multi-qa-MiniLM-L6-cos-v1')
bi_encoder.max_seq_length = 256 #Truncate long passages to 256 tokens
top_k = 32 #Number of passages we want to retrieve with the bi-encoder
# As dataset, we use Simple English Wikipedia. Compared to the full English wikipedia, it has only
# about 170k articles. We split these articles into paragraphs and encode them with the bi-encoder
wikipedia_filepath = 'simplewiki-2020–11–01.jsonl.gz'
if not os.path.exists(wikipedia_filepath):
util.http_get('http://sbert.net/datasets/simplewiki-2020-11-01.jsonl.gz', wikipedia_filepath)
passages = []
with gzip.open(wikipedia_filepath, 'rt', encoding='utf8') as fIn:
for line in fIn:
data = json.loads(line.strip())
#Add all paragraphs
#passages.extend(data['paragraphs'])
#Only add the first paragraph
passages.append(data['paragraphs'][0])
print("Passages:", len(passages))
# We encode all passages into our vector space. This takes about 5 minutes (depends on your GPU speed)
corpus_embeddings = bi_encoder.encode(passages, convert_to_tensor=True, show_progress_bar=True)
Here we encode all our data into a vector space. Then lets do a semantic search in the vector embeddings
# This function will search all wikipedia articles for passages that
# answer the query
def search(query):
print("Input question:", query)
##### Semantic Search #####
# Encode the query using the bi-encoder and find potentially relevant passages
question_embedding = bi_encoder.encode(query, convert_to_tensor=True)
question_embedding = question_embedding.cuda()
hits = util.semantic_search(question_embedding, corpus_embeddings, top_k=top_k)
hits = hits[0] # Get the hits for the first query

Top 3 hits are as follows.

Now you can clearly see there is an issue. Bi-encoders, although they are quite fast, do not always produce accurate results. Then what can we do better?

2.2 Cross-encoders

There is a class of encoders called cross-encoders that behave differently. Unlike bi-encoders, they take two sentences as an input and output a number between 0 and 1. You can visualize it as follows.

Unlike bi-encoders, they produce much more accurate results. So the problem is solved right? For information retrieval we can use a cross-encoder. We can compare the question against all the sentences in the vector space and get the once with the highest similarity. Definitely this method will provide high accuracy results. But this method has some issues. Running cross-encoders in this manner is computationally very expensive. Because there are a lot of comparisons.

2.3 Cross-encoders VS bi-encoders

So what can we do? In this scenario bi-encoders are fast and computationally inexpensive, but less accurate. But cross-encoders are much more accurate but expensive. So how to get the better of both worlds? Computational requirements of cross-encoders are directly tied against the number of comparisons they make. So we need to think about a way to limit that. If we can exclude unnecessary comparisons, we can definitely do that. We can use bi-encoders to retrieve information and use cross-encoders to rerank. Bi-encoders, although less accurate, will produce smaller but good enough data sets. Now cross-encoders can work on the smaller data set and produce real good results spending less computing power.

# This function will search all wikipedia articles for passages that
# answer the query
def search_with_cross_encoder(query):
print("Input question:", query)
##### Semantic Search #####
# Encode the query using the bi-encoder and find potentially relevant passages
question_embedding = bi_encoder.encode(query, convert_to_tensor=True)
question_embedding = question_embedding.cuda()
hits = util.semantic_search(question_embedding, corpus_embeddings, top_k=top_k)
hits = hits[0] # Get the hits for the first query
##### Re-Ranking #####
# Now, score all retrieved passages with the cross_encoder
cross_inp = [[query, passages[hit['corpus_id']]] for hit in hits]
cross_scores = cross_encoder.predict(cross_inp)
# Sort results by the cross-encoder scores
for idx in range(len(cross_scores)):
hits[idx]['cross-score'] = cross_scores[idx]
# Output of top-5 hits from bi-encoder
print("\n - - - - - - - - - - - - -\n")
print("Top-3 Bi-Encoder Retrieval hits")
hits = sorted(hits, key=lambda x: x['score'], reverse=True)
for hit in hits[0:3]:
print("\t{:.3f}\t{}".format(hit['score'], passages[hit['corpus_id']].replace("\n", " ")))
# Output of top-5 hits from re-ranker
print("\n - - - - - - - - - - - - -\n")
print("Top-3 Cross-Encoder Re-ranker hits")
hits = sorted(hits, key=lambda x: x['cross-score'], reverse=True)
for hit in hits[0:3]:
print("\t{:.3f}\t{}".format(hit['cross-score'], passages[hit['corpus_id']].replace("\n", " ")))

Here are the results.

Way better right?

Now we know how to optimize our RAG retrievals. We can use a cross-encoder to reranking.

3. Augmenting encoders with your data

If we use an efficient embedding model we can improve accuracy. If we can train such an embedding model using our own data, we can be even more accurate. This is the idea behind the paper Augmented SBERT: Data Augmentation Method for Improving Bi-Encoders for Pairwise Sentence Scoring Tasks.

We can use this method to make the embedding more fine-tuned, making the final results more accurate.

The pipeline is as follows.

  1. Train your cross-encoder with your dataset(let's call it "Golden-Dataset" shown as Dataset A in the image) : For example this could be a data set about your company.
  2. Get a more narrowed data set(let's call it "Silver-Dataset" shown as Dataset Bin the image) : This new dataset could be about a specific department of your company.
  3. Use the cross-encoder you trained on step1 to label the "Silver-Dataset" described in step2.
  4. Use this labeled data to train the bi-encoder.
  5. Use the bi-encoder to encode data.

How to do this actually? Let’s forget lalamaindex for a moment and think about implementing the above pipeline. All source codes pertinent to this discussion are available on GitHub. To engage with this content further, please clone the repository and continue with the post.

3.1 Train your cross-encoder on your “Golden-Dataset”

Earlier we discussed cross-encoders. Now let’s see how to train your cross-encoder with an existing data set.

Here we are using supervised learning for training the cross-encoder. The dataset should consist of sentence pairs with a numerical correlation. Numerical correlation should indicate the similarity between two sentences in the sentence pair.

from torch.utils.data import DataLoader
from sentence_transformers import models, losses, util, LoggingHandler, SentenceTransformer
from sentence_transformers.cross_encoder import CrossEncoder
from sentence_transformers.cross_encoder.evaluation import CECorrelationEvaluator
from sentence_transformers.evaluation import BinaryClassificationEvaluator
from sentence_transformers.readers import InputExample
from datetime import datetime
from zipfile import ZipFile
import logging
import csv
import sys
import torch
import math
import gzip
import os


# You can specify any huggingface/transformers pre-trained model here, for example, bert-base-uncased, roberta-base, xlm-roberta-base
model_name = "bert-base-uncased"
batch_size = 16
num_epochs = 1
max_seq_length = 128
use_cuda = torch.cuda.is_available()

###### Read Datasets ######
sts_dataset_path = "datasets/stsbenchmark.tsv.gz"
qqp_dataset_path = "quora-IR-dataset"


# Check if the STSb dataset exists. If not, download and extract it
if not os.path.exists(sts_dataset_path):
util.http_get("https://sbert.net/datasets/stsbenchmark.tsv.gz", sts_dataset_path)

# Check if the QQP dataset exists. If not, download and extract
if not os.path.exists(qqp_dataset_path):
logging.info("Dataset not found. Download")
zip_save_path = "quora-IR-dataset.zip"
util.http_get(url="https://sbert.net/datasets/quora-IR-dataset.zip", path=zip_save_path)
with ZipFile(zip_save_path, "r") as zipIn:
zipIn.extractall(qqp_dataset_path)

# Cross encoder will be saved in this location. Path will be like this output/cross-encoder/stsb_indomain_bert-base-uncased
# Knowing the path correctly will be helpful in next step
cross_encoder_path = (
"output/cross-encoder/stsb_indomain_"
+ model_name.replace("/", "-")
)

# Bi encoder will be saved in this location. Path will be like this output/cross-encoder/qqp_cross_domain_bert-base-uncased
# Knowing the path correctly will be helpful in next step
bi_encoder_path = (
"output/bi-encoder/qqp_cross_domain_"
+ model_name.replace("/", "-")
)

###### Cross-encoder (simpletransformers) ######

logging.info("Loading cross-encoder model: {}".format(model_name))
# Use Huggingface/transformers model (like BERT, RoBERTa, XLNet, XLM-R) for cross-encoder model
cross_encoder = CrossEncoder(model_name, num_labels=1)

###### Bi-encoder (sentence-transformers) ######

logging.info("Loading bi-encoder model: {}".format(model_name))

# Use Huggingface/transformers model (like BERT, RoBERTa, XLNet, XLM-R) for mapping tokens to embeddings
word_embedding_model = models.Transformer(model_name, max_seq_length=max_seq_length)

# Apply mean pooling to get one fixed sized sentence vector
pooling_model = models.Pooling(
word_embedding_model.get_word_embedding_dimension(),
pooling_mode_mean_tokens=True,
pooling_mode_cls_token=False,
pooling_mode_max_tokens=False,
)

bi_encoder = SentenceTransformer(modules=[word_embedding_model, pooling_model])


# Step 1: Train cross-encoder model with STSbenchmark

logging.info("Step 1: Train cross-encoder: {} with STSbenchmark (source dataset)".format(model_name))

gold_samples = []
dev_samples = []
test_samples = []

with gzip.open(sts_dataset_path, "rt", encoding="utf8") as fIn:
reader = csv.DictReader(fIn, delimiter="\t", quoting=csv.QUOTE_NONE)
for row in reader:
score = float(row["score"]) / 5.0 # Normalize score to range 0 ... 1

if row["split"] == "dev":
dev_samples.append(InputExample(texts=[row["sentence1"], row["sentence2"]], label=score))
elif row["split"] == "test":
test_samples.append(InputExample(texts=[row["sentence1"], row["sentence2"]], label=score))
else:
# As we want to get symmetric scores, i.e. CrossEncoder(A,B) = CrossEncoder(B,A), we pass both combinations to the train set
gold_samples.append(InputExample(texts=[row["sentence1"], row["sentence2"]], label=score))
gold_samples.append(InputExample(texts=[row["sentence2"], row["sentence1"]], label=score))


# We wrap gold_samples (which is a List[InputExample]) into a pytorch DataLoader
train_dataloader = DataLoader(gold_samples, shuffle=True, batch_size=batch_size)


# We add an evaluator, which evaluates the performance during training
evaluator = CECorrelationEvaluator.from_input_examples(dev_samples, name="sts-dev")

# Configure the training
warmup_steps = math.ceil(len(train_dataloader) * num_epochs * 0.1) # 10% of train data for warm-up
logging.info("Warmup-steps: {}".format(warmup_steps))

# Train the cross-encoder model
cross_encoder.fit(
train_dataloader=train_dataloader,
evaluator=evaluator,
epochs=num_epochs,
evaluation_steps=1000,
warmup_steps=warmup_steps,
output_path=cross_encoder_path,
)

Let’s go through the important points of the code.

Here we are using the “bert-base-uncased” model. For demonstration purposes I have made batch_size = 16 and num_epochs = 1 and max_seq_length = 128. But for production it should be more.

model_name = "bert-base-uncased"
batch_size = 16
num_epochs = 1
max_seq_length = 128
use_cuda = torch.cuda.is_available()

Then we download STS benchmark dataset and Quora Question Pair dataset.

# Check if the STSb dataset exists. If not, download and extract it
if not os.path.exists(sts_dataset_path):
util.http_get("https://sbert.net/datasets/stsbenchmark.tsv.gz", sts_dataset_path)

# Check if the QQP dataset exists. If not, download and extract
if not os.path.exists(qqp_dataset_path):
logging.info("Dataset not found. Download")
zip_save_path = "quora-IR-dataset.zip"
util.http_get(url="https://sbert.net/datasets/quora-IR-dataset.zip", path=zip_save_path)
with ZipFile(zip_save_path, "r") as zipIn:
zipIn.extractall(qqp_dataset_path)

Then mean pooling is applied to get a fixed size vector. This is very important to fine tune cross encoders to a downstream task. Next we specify our cross-encoder as a SentenceTransformer with our embedding model and the pooling model.

# Use Huggingface/transformers model (like BERT, RoBERTa, XLNet, XLM-R) for mapping tokens to embeddings
word_embedding_model = models.Transformer(model_name, max_seq_length=max_seq_length)

# Apply mean pooling to get one fixed sized sentence vector
pooling_model = models.Pooling(
word_embedding_model.get_word_embedding_dimension(),
pooling_mode_mean_tokens=True,
pooling_mode_cls_token=False,
pooling_mode_max_tokens=False,
)

bi_encoder = SentenceTransformer(modules=[word_embedding_model, pooling_model])

Then it is time to specify the data loader and the evaluator. After that we can train the cross-encoder with the Golden-Dataset.

# We wrap gold_samples (which is a List[InputExample]) into a pytorch DataLoader
train_dataloader = DataLoader(gold_samples, shuffle=True, batch_size=batch_size)


# We add an evaluator, which evaluates the performance during training
evaluator = CECorrelationEvaluator.from_input_examples(dev_samples, name="sts-dev")

# Configure the training
warmup_steps = math.ceil(len(train_dataloader) * num_epochs * 0.1) # 10% of train data for warm-up
logging.info("Warmup-steps: {}".format(warmup_steps))

# Train the cross-encoder model
cross_encoder.fit(
train_dataloader=train_dataloader,
evaluator=evaluator,
epochs=num_epochs,
evaluation_steps=1000,
warmup_steps=warmup_steps,
output_path=cross_encoder_path,
)

Now you have the cross-encoder fine tuned by your Golden-Dataset.

3.2 Label the Silver-Dataset with trained cross-encoder

As you know cross encoder can be used to determine the similarity between sentence pairs. Now we can use the cross encoder trained on Golden-Dataset to label sentence pairs in Silver-Dataset. The Silver-Dataset should have some relevance with the Golden-Dataset.

The aim of this step is creating more high quality training data, so we have enough data to fine tune the bi-encoder. If the Silver-Dataset is bigger than the Golden-Dataset, it is better. Since cross-encoders are more accurate, it can produce an accurate training data set to be used in supervised training of bi-encoders.

For each sentence pair, the cross-encoder will calculate scores. Here scores are specified as binary decisions. If the score is greater than 0.5 it is considered as “one” and else it is considered as “zero”. Also please note that we are using a cross-encoder and it will take some time for this labeling to take place.

# Step 2: Label QQP train dataset using cross-encoder (BERT) model

logging.info("Step 2: Label QQP (target dataset) with cross-encoder: {}".format(model_name))

cross_encoder = CrossEncoder(cross_encoder_path)

silver_data = []

with open(os.path.join(qqp_dataset_path, "classification/train_pairs.tsv"), encoding="utf8") as fIn:
reader = csv.DictReader(fIn, delimiter="\t", quoting=csv.QUOTE_NONE)
for row in reader:
if row["is_duplicate"] == "1":
silver_data.append([row["question1"], row["question2"]])

silver_scores = cross_encoder.predict(silver_data)

# All model predictions should be between [0,1]
assert all(0.0 <= score <= 1.0 for score in silver_scores)

binary_silver_scores = [1 if score >= 0.5 else 0 for score in silver_scores]

Now we have a quality dataset that can be used for supervised training of a bi-encoder.

3.3 Train the bi-encoder with the Silver-Dataset

# Step 3: Train bi-encoder (SBERT) model with QQP dataset - Augmented SBERT

logging.info("Step 3: Train bi-encoder: {} over labeled QQP (target dataset)".format(model_name))

# Convert the dataset to a DataLoader ready for training
logging.info("Loading BERT labeled QQP dataset")
qqp_train_data = list(
InputExample(texts=[data[0], data[1]], label=score) for (data, score) in zip(silver_data, binary_silver_scores)
)


train_dataloader = DataLoader(qqp_train_data, shuffle=True, batch_size=batch_size)
train_loss = losses.MultipleNegativesRankingLoss(bi_encoder)

###### Classification ######
# Given (question1, question2), is this a duplicate or not?
# The evaluator will compute the embeddings for both questions and then compute
# a cosine similarity. If the similarity is above a threshold, we have a duplicate.
logging.info("Read QQP dev dataset")

dev_sentences1 = []
dev_sentences2 = []
dev_labels = []

with open(os.path.join(qqp_dataset_path, "classification/dev_pairs.tsv"), encoding="utf8") as fIn:
reader = csv.DictReader(fIn, delimiter="\t", quoting=csv.QUOTE_NONE)
for row in reader:
dev_sentences1.append(row["question1"])
dev_sentences2.append(row["question2"])
dev_labels.append(int(row["is_duplicate"]))

evaluator = BinaryClassificationEvaluator(dev_sentences1, dev_sentences2, dev_labels)

# Configure the training.
warmup_steps = math.ceil(len(train_dataloader) * num_epochs * 0.1) # 10% of train data for warm-up
logging.info("Warmup-steps: {}".format(warmup_steps))

# Train the bi-encoder model
bi_encoder.fit(
train_objectives=[(train_dataloader, train_loss)],
evaluator=evaluator,
epochs=num_epochs,
evaluation_steps=1000,
warmup_steps=warmup_steps,
output_path=bi_encoder_path,
)

# Evaluate Augmented SBERT performance on QQP benchmark dataset

# Loading the augmented sbert model
bi_encoder = SentenceTransformer(bi_encoder_path)

logging.info("Read QQP test dataset")
test_sentences1 = []
test_sentences2 = []
test_labels = []

with open(os.path.join(qqp_dataset_path, "classification/test_pairs.tsv"), encoding="utf8") as fIn:
reader = csv.DictReader(fIn, delimiter="\t", quoting=csv.QUOTE_NONE)
for row in reader:
test_sentences1.append(row["question1"])
test_sentences2.append(row["question2"])
test_labels.append(int(row["is_duplicate"]))

evaluator = BinaryClassificationEvaluator(test_sentences1, test_sentences2, test_labels)
bi_encoder.evaluate(evaluator)

3.4 Use trained bi-encoder to generate vector embeddings

Now we know the theory behind custom bi-encoder training. Now it is time to use the trained bi-encoder and the cross-encoder with Llamaindex. In the colab directory structure, you will see the trained bi-encoder and the trained cross-encoder as follows.

In addition we need some data to be encoded right? For demonstration purposes, we have created a directory called data inside datasets and added a text file called paul_graham_essay.txt. It should look like the following.

Now let’s look at the code. We can use the trained bi-encoder for embedding and trained cross-encoder for reranking.

import os

from llama_index.core import VectorStoreIndex,SimpleDirectoryReader
from llama_index.core.postprocessor import SentenceTransformerRerank
import openai

openai.api_key = ""
# Load documents from a datasets/data
documents = SimpleDirectoryReader('datasets/data').load_data()

#Now lets use the bi encoder trained earlier
print("######## Using bi-encoder trained with our data")

index = VectorStoreIndex(
documents, embed_model="local:output/bi-encoder/qqp_cross_domain_bert-base-uncased", show_progress=True
)
query_engine = index.as_query_engine()

response = query_engine.query("What did the author do growing up?")
print(response)

#Now lets use the cross encoder trained earlier for re-ranking
print("######## Using cross-encoder trained with our data for reranking")

# Initialize the reranker
rerank = SentenceTransformerRerank(
model="./output/cross-encoder/stsb_indomain_bert-base-uncased", top_n=3
)


# Build the query engine
query_engine = index.as_query_engine(similarity_top_k=10, node_postprocessors=[rerank])

response = query_engine.query("What did the author do growing up?")
print(response)

In addition to cross-encoders reranking, llamaindex support few more reranking options. They and LLMrank, Cohere rerank. LLMrank, Cohere rerank, Colbert rerank as well. The examples in the llamaindex site are real good and you can follow that.

⭐️ Follow me on LinkedIn or Twitter for updates on AI ⭐️

I’m currently the Co-Founder & CEO @ Effectz.AI. We specialize in Privacy Preserving AI Solutions & AI Consulting.

4. References

  1. https://sbert.net/examples/applications/cross-encoder/README.html
  2. https://sbert.net/examples/training/data_augmentation/README.html
  3. https://arxiv.org/abs/1908.10084
  4. https://arxiv.org/abs/2010.08240
  5. https://docs.llamaindex.ai/en/stable/getting_started/starter_example
  6. https://medium.com/rahasak/build-rag-application-using-a-llm-running-on-local-computer-with-ollama-and-llamaindex-97703153db20
  7. https://docs.llamaindex.ai/en/stable/examples

--

--