Finding the Right Embedding Model for Your RAG Application

Roman Purkhart
CONTACT Research
Published in
6 min readJun 11, 2024

Introduction

Selecting the ideal embedding model is crucial in the development of natural language processing (NLP) applications. This choice is particularly vital in Retrieval Augmented Generation (RAG) systems, where the embedding model significantly influences the effectiveness of the retriever component.

This article explores the methods we employed to identify the most suitable embedding model for our specific project: a RAG Chatbot designed to navigate the documentation of our product, CONTACT Elements.

Understanding the Documentation Structure

The first step in our process involved a thorough analysis of the documentation structure. We needed to preprocess the content and segment it into equally structured parts, or “chunks,” which could then be embedded and stored in a vector database. Preprocessing parameters like chunk size, chunk overlap or splitting algorithm were defined.

To evaluate the quality of the retriever part of our RAG application, we gathered a set of typical questions related to our documentation and paired them with the appropriate documentation pages. Although this was time-intensive, it laid a strong foundation for enhancing the retriever’s accuracy and establishing a reliable evaluation system.

Building the Evaluation Pipeline with DVC

For efficient management of our data and processes, we integrated Data Version Control (DVC), a version control system tailored for machine learning projects that facilitates handling large data files, code, and models along with the existing Git workflows. This integration helps in tracking the entire lifecycle of ML models, including data changes, dependencies, and pipelines in a systematic manner. Our pipeline included stages for preprocessing, embedding, and uploading documentation. Additionally, for the queries, we included embedding and retrieval stages, followed by a final stage to record and analyse the experiment’s metrics. By using DVC, we could easily manage and version large datasets and models, significantly enhancing reproducibility and collaboration across our team. This structured approach facilitated systematic tracking and adjustments, allowing for a comprehensive evaluation of different embedding vector strategies.

Our DVC Pipeline

Selecting an Embedding Model

A crucial decision was whether to use a commercial embedding model available through an API, such as those offered by OpenAI or Cohere, or to host our own model. For our project, commercial models were adequate as our documentation contained no sensitive information.

However, for other internal projects requiring confidentiality, self-hosted models were necessary. We compared open-source models from Hugging Face with commercial options using our non-confidential documentation. The Model Text Embedding Benchmark (MTEB) Leaderboard on Hugging Face was a starting point, but we further validated these models against our specific queries to ensure their effectiveness.

Hosting an Embedding Model

To host models independently, we utilized the Sentence Transformers library from Hugging Face. Loading a model and calculating embeddings can be done in a few lines of code:

from sentence_transformers import SentenceTransformer

# 1. Load a pretrained Sentence Transformer model
model = SentenceTransformer("all-MiniLM-L6-v2")

# The sentences to encode
sentences = [
"The weather is lovely today.",
"It's so sunny outside!",
"He drove to the stadium.",
]

# 2. Calculate embeddings by calling model.encode()
embeddings = model.encode(sentences)

Attention to Detail in Model Configuration:

For optimal performance in our question answering application, certain models require specific prefixes to be added to the queries and document chunks before embedding. This necessity arises because these models were originally trained with these prefixes, which helps them understand and categorize the input data correctly.

For instance, the model intfloat/e5-large-v2 requires the prefix "Passage:" to be added to each chunk of documentation and "Query:" to be prefixed to the queries. Similarly, the model BAAI/bge-large-en-v1.5 requires that queries begin with "Represent this sentence for searching relevant passages:" to align with its training. Adhering to these specific configurations ensures that our models process and interpret the data as intended, thereby maximizing their effectiveness in retrieving accurate answers.

The final code was run on a dedicated server equipped with a Nvidia TITAN GPU (12GB VRAM). Most of the models we tested only occupied between 2–4 GB VRAM.

To enhance the flexibility and accessibility of calculating embeddings across various devices, we integrated Sentence Transformer with Flask. This setup allows us to send texts for embedding directly to an API, mirroring the functionality of commercial models. Embedding our comprehensive documentation, which typically comprises approximately 60,000 chunks, is efficiently completed in about five minutes.

For storing and retrieving the calculated embeddings, we chose Solr due to its robust vector search capabilities. Our setup allowed us to manage data efficiently through an API, facilitating seamless integration into our DVC pipeline.

Running Experiments and Presenting Results

With everything configured, we utilized DVC to automate a series of experiments, which we could run to systematically test various embeddings and preprocessing parameters such as chunk size, overlap, or the splitting algorithm. The maximum Recall for 5 and 10 documents retrieved as well as the maximum NDCG for each model along different preprocessing parameters are shown in the following figures (commercial models shown as orange, self-hosted as green).

The average Recall for k=5 retrieved Documents to our Test Questions
The average Recall for k=10 retrieved Documents to our Test Questions
The average Normalized Discounted cumulative gain (NDCG) for retrieved Documents to our Test Questions

+---------------+-----------------------------------------------------------+
| Label | Model (Provider, Dimension) |
+---------------+-----------------------------------------------------------+
| openai-s | text-embedding-3-small (OpenAi, 768) |
| openai-l | text-embedding-3-large (OpenAi, 1536) |
| openai-s-full | text-embedding-3-small (OpenAi, 1536) |
| cohere | embed-english-v3.0 (Cohere , 1024) |
| bge-m3 | BAAI/bge-m3 (huggingface, 1024) |
| e5-large | intfloat/e5-large-v2 (huggingface, 1024) |
| bge-large-en | BAAI/bge-large-en-v1.5 (huggingface, 1024) |
| snowflake | Snowflake/snowflake-arctic-embed-l (huggingface, 1024) |
| all-minilm | sentence-transformers/all-MiniLM-L6-v2 (huggingface, 384) |
+---------------+-----------------------------------------------------------+

It showed that some of the self-hosted models performed as well as the state-of-the-art commercial ones. In our case the two best performing models were intfloat/e5-large-v2 and Snowflake/snowflake-arctic-embed-l even outperforming the commercial models in some of the metrics.

Conclusion

By rigorously testing different embedding models and preprocessing techniques, and by automating these tests with DVC, we were able to significantly enhance the functionality of our RAG Chatbot. This approach not only improved our system’s ability to retrieve relevant documentation but also enriched the user experience by providing accurate, context-aware responses to queries.

About CONTACT Research. CONTACT Research is a dynamic research group dedicated to collaborating with innovative minds from the fields of science and industry. Our primary mission is to develop cutting-edge solutions for the engineering and manufacturing challenges of the future. We undertake projects that encompass applied research, as well as technology and method innovation. An independent corporate unit within the CONTACT Software Group, we foster an environment where innovation thrives.

--

--