Choosing the Right Embedding Model for RAG in Generative AI

Shivika K Bisen
Bright AI
Published in
5 min readJul 5, 2024

Embedding Models are the key to effective Retrieval Augmented Generation (RAG)

Recap:

Embedding models create fixed-length vector representations of text, focusing on semantic meaning for tasks like similarity comparison.

LLMs (Large Language Models) are generative AI models that can understand and produce general language tasks and has more flexibility of input/output formats

Let’s understand different embeddings with a use-case

Use Case Scenario

A user reaches out to a customer support chatbot with an issue related to their account. The issue is logged in the company’s data warehouse. The Gen AI chatbot needs to retrieve relevant information from the data warehouse to diagnose and resolve the issue.

A. Embedding Models

1. Static Embeddings

Static embeddings generate fixed vector representations for each word in the vocabulary, regardless of the context or order in which the word appears. While contextual embeddings, produces different vectors for the same word based on its context within a sentence.

Example

Customer Query: “I can’t access my bank account.”

Error Log: “Account access denied due to incorrect password.”

With Word2Vec, GloVE, Doc2Vec (Dense vector based) and TF-IDF (keyword /Sparse vector based), the vectors for “access” and “account” in both the query and the log will be similar, returning relevant results based on cosine similarity

Limitations

  • Polysemy Issue: Words with multiple meanings (e.g., “bank”) have the same vector regardless of context [river bank, financial bank]
  • Context Insensitivity once embeddings are generated: Cannot differentiate between “access denied” due to various reasons (e.g., incorrect password, account lockout).

Comparison Summary

2. Contextual Embeddings

BERT, RoBERTa, SBERT, ColBERT, MPNet

  • Bidirectional: Captures context from both directions within a sentence, leading to a deep understanding of the entire sentence.
  • Focused Context: Primarily designed for understanding the context within relatively short spans of text (e.g., sentences or paragraphs).

Example

Customer Query: “I can’t access my bank account.”

Error Log 1: “Account access denied due to incorrect password.”

Error Log 2: “Account access denied due to multiple failed login attempts.”

Error Log 3: “User cannot login after password reset.”

BERT, RoBERTa , all-MiniLM-L6-v2 or SBERT (Masked language Model), Paraphrase-MPNet-Base-v2 (Permutated Language Model) embeddings capture the context and understand that “can’t access my account” is related to “access denied” and “cannot login” because they all involve issues with account access. Good choice for retrieval step

ColBERT (Contextualized Late Interaction over BERT) is a retrieval model that uses BM25 for initial document retrieval and then applies BERT-based contextual embeddings for detailed re-ranking, optimizing both efficiency and contextual relevance in information retrieval tasks.

Limitations

  • Context Limitation: Masked and Permuted Language Model is good at understanding context within a given text span (like a sentence or paragraph), but it doesn’t have the capacity to generate text or handle tasks beyond understanding and retrieving relevant documents.

Comparison Summary

3. GPT-Based Embeddings

  • Unidirectional: Captures context from the left side only, building understanding sequentially as it generates text.
  • Broad Context: Can maintain coherence over longer text sequences, making them effective for generating extended passages of text.

OpenAI’s text-embedding-3 -large

google-gecko-text-embedding

amazon-titan

GTR-T5 is Google’s open-source embedding model for semantic search using the T5 LLM as a base

E5 (v1 and v2) is the newest embedding model from Microsoft.

Example

Customer Query: “I can’t access my bank account.”

Error Log 1: “Account access denied due to incorrect password.”

Error Log 2: “Account access denied due to multiple failed login attempts.”

Error Log 3: “User cannot login after password reset.”

Error Log 4: “Login failed after updating security settings.”

Generative-based embeddings: Good for the generation step of RAG. They recognize that “cannot login after password reset” and “login failed after updating security settings” are related to “can’t access my account.” They can also generate relevant responses based on deeper understanding and broader context.

Limitations:

  • Generative models like GPT can be more resource-intensive than purely contextual models like BERT.

B. Large Language Models (LLMs)

Combine the retrieved information (embedding) for response generation by LLM models

Open AI GPT 4o

Google Gemini Pro

Anthropic Claude3.5 Sonner

Here is anexample that combines embedding model with LLM

Example:

Customer Query: “I can’t access my bank account.”

Retrieved Logs:

“Account access denied due to incorrect password.”

“Account access denied due to multiple failed login attempts.”

“User cannot login after password reset.”

“Login failed after updating security settings.”

Generated Response: “It appears you are unable to access your account due to multiple failed login attempts due to incorrect password. If you have recently reset your password or updated your security settings, please ensure you are using the latest credentials. You may also try resetting your password again.”

Overall Summary

Metrics for choosing Embeddings

  1. MTEB retrieval score (Huggingface Massive Text Embedding Benchmark)
  • ex: Google gecko > Open AI text embedding 3 large > miniLM (Sbert)
  • GTR-T5 (google’s open source) is good MTEB retrieval score but slow

2. Latency (Speed in response)

  • Latency is often measured at the 95th percentile(p95, not on a log scale). OpenAI API p95 responses took almost a minute from GCP and almost 600 ms from AWS.
  • ex: all-miniLM (Sbert) < Google gecko < Open AI text embedding 3 large
  • all-miniLM (Sbert) being a small model is faster, it is also default embedding for vector database like chroma, which makes their deployment easy if it works for your usecase

Resources
https://huggingface.co/blog/mteb

https://blog.getzep.com/text-embedding-latency-a-semi-scientific-look/

--

--

Shivika K Bisen
Bright AI

Gen AI/ML, Data Scientist | University of Michigan Alum | Generative AI, Recommendation & Search & NLP, Predictive models. https://sbisen.github.io/