Sitemap

Why Your RAG Doesn’t Work

RAG is still promising, but today it’s a DRAG

11 min readMar 4, 2024

Countless businesses are experimenting with Retrieval Augment Generation (RAG), yet there’s broad disillusionment as they struggle making these systems production-quality. Not only do their RAGs work poorly, they’re at a loss for why and what next steps to take.

Over the last few months, I’ve spoken with dozens of AI teams and experts. Through these conversations and personal experience, I’ve found that a key culprit hindering RAG systems is semantic dissonancethe discordance between your task’s intended meaning, the RAG’s understanding of it, and the underlying knowledge that’s stored. And because the underlying technology of vector embeddings is magic (i.e. finicky and dreadfully opaque), the overarching discordance is challenging to diagnose, making it a substantial barrier to productionization.

Our goal is to demystify key reasons why Vanilla RAG fails, and to give concrete tactics and strategies for getting your RAG one step closer to production.

In this post, we’ll:

  • Distinguish the promise of RAG in its ideal form from the realities of Vanilla RAG
  • Explain how semantic dissonance creeps in
  • Illustrate diagnosing and curbing semantic dissonance
  • Wrap up with additional high-ROI strategies for getting your RAG production-ready

(Note: For simplicity, we focus on Q&A text-based examples, but the core ideas can generalize to other use cases).

Why RAG?

RAG (Retrieval Augmented Generation) is a paradigm currently going through a hype cycle. It sounds snappy and, in essence, is a search engine for your AIs. As a once aspiring musician, I kind of wish someone else dubbed it something more like ROCK (Retrieval of Curated Knowledge?).

RAG gained ground soon after GPT-3 became a big hit. An immediate problem that businesses face when building LLM-powered AIs is that models like GPT aren’t trained on their specific data and domain. However, LLM practitioners quickly discovered that GPT functioned surprisingly well when business-specific context (such as support docs) was provided directly in the prompt. This gave businesses an alternative to the daunting task of fine-tuning models.

Enter RAG. In principle, it’s a specialized search engine for your AIs. Give it a question, perhaps along with user-specific information, and it will return the most relevant context for GPT.

While this sounded great in theory, there have been major challenges manifesting production-grade RAGs, which we’ll explore in the following sections.

RAG is a specialized search engine for your AIs (image source)

RAG Is Promising, Vanilla RAG Is Just the Beginning

RAG is merely a framework, and a perfectly functioning RAG, no matter its backend, would provide enormous value to countless use cases. In this section we provide a pedagogical overview of Vanilla RAG and the underlying workings of semantic search. If you’ve already gone through the mind-bending journey of rationalizing, rejecting, and ultimately embracing the magic of vector embeddings, then feel free to skip this section.

Vanilla RAG (def): A single-step semantic search engine that stores business knowledge, such as support documents, in a vector database, such as Pinecone, using an off-the-shelf embedding model. Information retrieval is then performed by creating a vector embedding from the text of the question and using a comparison metric, such as cosine similarity, to rank the top-k most relevant documents.

(image source)

Let’s break these ideas down further.

A vector embedding model takes an arbitrary string and returns a fixed dimensional mathematical vector. Popular embedding models include OpenAI’s text-embedding-ada-002 and their newest model text-embedding-3-small. These models translate text blobs into ~1500-dimensional vectors and have virtually no human interpretability.

(image source)

Vectors are prolific and incredibly useful tools because you can take non-quantitative things, and 1) break them down into a rich array of dimensions, and 2) quantitatively compare them. Some examples are:

  • The (red, green, blue) color palette is a vector, where each value lies between 0–255.
  • With industry standards like Barra, stocks can be represented as a vector quantifying its sensitivity to economic factors like broad US growth, change in interest rates, etc.
  • Platforms like Netflix can decompose user preferences as a vector, where components can represent genres and other features.

Cosine similarity is arguably the de facto metric for comparing vectors in semantic search, and it works by applying cosine to the angle between two vectors via the dot product. The closer the cosine is to 1, the more similar the vectors. (There are other ways of measuring semantic similarity, but typically this isn’t where low-hanging fruit lies, and we will use cosine similarity throughout).

(image source)

It cannot be emphasized enough, however, that vector comparison metrics like cosine similarity are delicate to work with because they have no absolute meaning — The values depend entirely on the embedding model and the context of the text involved. Let’s say you match a question with an answer and get a cosine similarity of 0.73. Is this a good match?

As a quick illustration, let’s take the question, “What is rain?”, and compare it to the three texts of varying relevance. We see in the table below that the range and interpretation of cosine similarities from using two different OpenAI models are wildly different. For the first model, 0.73 indicates a totally irrelevant match, yet for the second model 0.73 indicates high relevance. This indicates that any well-functioning RAG system needs to calibrate its own understanding of what these scores mean.

Text1 (definition): “Rain is the precipitation of water droplets from clouds, falling to the ground when they become too heavy to stay suspended in air.”

Text2 (mentions rain): “The winds blowing moisture over the mountains are responsible for rain in Seattle.”

Text3 (irrelevant info): “Stripe is a payments infrastructure business.”

Cosine similarities of the question “What is rain?” against texts of different levels of relevance. This illustrates that for the same texts, different models can evaluate to wildly different values.

Semantic Dissonance Creates Problems

Several challenges with Vanilla RAG can be attributed to semantic dissonance and poor explainability of embeddings. Semantic dissonance is the discordance between your task’s intended meaning, the RAG’s understanding of it, and the underlying knowledge that’s stored.

How does this come into play?

Comparing apples-to-oranges

This can be roughly said as “questions aren’t semantically the same as their answers,” so direct comparison between a question and your raw knowledge base will only be so fruitful.

Imagine a lawyer needs to search thousands of documents for evidence of investor fraud. The question “What evidence shows Bob committed financial fraud?” has essentially no semantic overlap with “Bob bought stock XYZ on March 14th” (where it is implicit XYZ is a competitor and March 14th is a week before earnings announcements).

Vector embeddings and cosine similarity are fuzzy

There’s inherent imperfection in a vector’s ability to fully capture the semantic content of any given statement. Another subtle imperfection is that it’s not a given that cosine similarity should result in precise ranking, as it implicitly assumes each dimension is on equal footing.

In practice, semantic search with cosine similarity tends to be directionally correct, but inherently fuzzy. It can be great for ball-parking top-20 results, but it’s typically a lot to ask for it alone to reliably rank the best answer first.

Embedding models trained on the internet don’t understand your business and domain

I used to work at Stripe where we had products such as Connect, Radar, and Link. On top of that, Direct was a common adjective that had very different meanings depending on which product we were talking about. Needless to say, semantic dissonance was apparent even between employees at Stripe. This is a deep and important topic that can be explored further and merits its own blog post.

Overall, sources of semantic dissonance compound and contribute to unreliable rankings. In the next section, we illustrate diagnosing and addressing semantic dissonance, and in the last section, we outline high-ROI strategies to improve RAG implementation.

Illustration: Diagnosing and Curbing Semantic Dissonance

In this illustration, we’re going to diagnose complete semantic dissonance in your RAG — that is, when your comparisons are consistent with random noise and therefore unreliable. We’re also going to see early indications of how to improve performance with additional structure.

This example is driven from a real-life use case, but also intentionally simplistic for the purposes of this blog post in order to get into the weeds and illustrate key points.

Setup

(The full details of the setup can be found in this Google Colab Notebook).

Imagine the use case of an e-commerce startup that’s building a RAG for internal use that finds the best SQL table for a given business question. Below is the setup of the example, in which we:

1) Created two distinct SQL table schemas (using ChatGPT)

  • events.purchase_flow: Highly detailed, raw user events within a product flow
  • aggregates.purchases: Rolled-up table with summary analytics

2) Created a few hypothetical questions (using ChatGPT) for evaluation

  • What is the impact of IP address on the types of products viewed and purchased?
  • What is the overall trend in shoe sales this quarter?
  • Is there unusual behavior within a few seconds of each hour?
  • How does user engagement change around major events like New Years?

3) Generated additional metadata (using ChatGPT) including

  • Brief descriptions of each table
  • Sample questions each table is uniquely qualified to answer

4) Inspected what noisy cosine similarity scores look like by comparing our input texts to “garbage”

5) Compared four different retrieval strategies for ranking, to see which types of text are “most semantically similar” to our inputs.

  • Strategy 1: Table schema only
  • Strategy 2: Table schema + brief description
  • Strategy 3: Table schema + brief description + sample questions
  • Strategy 4: Sample questions only

Spotting Noisy Cosine Similarities

To build an intuition for what noise could look like, we compared the cosine similarities of random snippets of text against each question and the raw table text (an illustration below). We found that cosine similarities for junk inputs was around 0.04–0.23. Below is an example comparison:

Cosine similarity values between the irrelevant text, “Silly text”, and the raw text of the questions and SQL table statements. This helps develop a baseline for identifying when there’s weak-to-no semantic overlap.

Comparisons of the Four Strategies

As we can see from the results below, Strategy 4, comparing questions to sample questions only, had the highest semantic overlap and best rankings. Strategies 1 and 2 performed similar to each other and were consistent with noise — that is, there was weak, if any, semantic overlap between the business questions and the SQL table statements.

This might feel obvious, but then again, I frequently see RAGs being developed with similar apples-to-oranges comparisons. But what might not be obvious is that Strategy 3, which mashes everything together, performed worse than Strategy 4, which isolated the questions without additional detail. Sometimes it’s better to use a scalpel than a sledgehammer.

Noise (Random, irrelevant text): Cosine similarities lied between 0.04–0.23.

Strategy 1 (Table Schema Only): Values lie between 0.17–0.25 (consistent with noise).

Strategy 2 (Table Schema + Description): Values lie between 0.14–0.25 (still consistent with noise).

Strategy 3 (Table Schema + Description + Sample Questions): Values lie between 0.23–0.30. Clear improvement, we’re beginning to see signal from noise.

Strategy 4 (Sample Questions Only): Values lie between 0.30–0.52. Clearly the best performing strategy, and it lies completely outside the noise range. Furthermore, it led to the biggest separation, and thus stronger signal, between the cosine similarities of the correct table and the incorrect one.

Takeaways

As a recap, we first built out a baseline range of cosine similarity values that indicate comparisons with random junk. We then compared four different retrieval strategies. Using the baseline we developed, we found that two strategies looked consistent with noise. The best strategy didn’t directly match business questions to the raw SQL tables, but rather matched them to example business questions that the tables were known to answer.

Further Strategies to Improve Your RAG

We’ve only scratched the surface. Here are some worthwhile approaches for step-function improvements in your RAGs.

Structuring your data for apples-to-apples comparisons

In our illustration above, we saw early hints that you can improve RAG with additional structure, which was to first link questions to an existing question bank, which would subsequently direct you to the right answer. This is opposed to directly linking the question to the correct text in a single step.

For your Q&A system built on support docs, you very well may find that question→question comparisons will materially improve performance opposed to question→support doc. Pragmatically, you can ask ChatGPT to generate example questions for each support doc and have a human expert curate them. In essence you’d be pre-populating your own Stack Overflow.

Want to take this “Stack Overflow” methodology one step further?

  • For each document, ask ChatGPT to generate a list of 100 questions it can answer
  • These questions won’t be perfect, so for each question you generate, compute cosine similarities with each other document
  • Filter those questions which would rank the correct document #1 against every other document
  • Identify the highest-quality questions by sorting those which have the highest difference between cosine similarity of the correct document and the second ranked document
  • Send to human for further curation

Semantic + Relevance Ranking

This might be one of the bigger bangs for your buck, and virtually every major search engine you use does this. We’ve seen cosine similarity is great for ball-parking, but is ultimately incapable of higher fidelity ranking.

Fortunately, your business probably has more information available to help AIs make better decisions. For example, you might have collected metrics such as page views and thumbs-up, and even better, you may have these metrics by persona. You can create a relevance score incorporating a wide-array of user/task features to fine-tune your rankings and get your RAG working much better. Concretely, you could make your ranking a linear combination,

rank = (cosine similarity) + (weight) x (relevance score)

Using AIs a scalpel, not a sledgehammer

Over decades, software engineering practices evolved towards favoring designs with a lot of small components with tight, well-defined guarantees. The craze around chat interfaces has turned this paradigm wildly on its head, and in 5 years, could easily be seen as dubious.

ChatGPT, and much of the emerging ecosystem, incentivizes the paradigm of “Give me any text, and I’ll give you any text.” There are no guarantees of efficacy, or even cost and latency, but rather, these AIs have the hand-wavy promise of “I’m probably somewhat right, some of the time.” However, businesses can build more robust AIs by providing more scoped and opinionated interfaces to build robust AIs.

Using analytics as an example, today no one has succeeded in delivering the promise of taking an arbitrary data question and providing an accurate SQL query. Not to be discouraged, you can still build remarkably useful tech. For example, a more scoped AI could help users search from a fixed universe of SQL tables and templated queries curated by your data scientists. Even better, because most data-driven business questions have been answered in the past, maybe your AI just needs to be a search-bot against data questions in Slack.

Closing Remarks

We’re seeing a new era of AI being ushered in. What’s new about this era is not the advent of NLP and language models — Google’s been at this for ages. Rather, a major component is that off-the-shelf technology has lowered the barrier-to-entry for businesses to leverage natural language technology for their specific use cases. But we shouldn’t lose sight of the fact that this technology today is still in early development, and that when building RAGs for your AIs, you are building a complex search engine on top of your knowledge base. It is achievable, but knowing these challenges and addressing these limitations is half the battle.

Contact

If you’d like to discuss these topics further and see if our team can help, feel free to reach out at cdg at ellipticlabs dot ai.

--

--

Christian Griset
Christian Griset

Responses (8)