Are You Pre-training your RAG Models on Your Raw Text?

Anshu
ThirdAI Blog
Published in
6 min readAug 30, 2023

Retrieval-Augmented Generation (RAG) experiences significant enhancements through targeted pre-training on raw text. Currently, only ThirdAI’s NeuralDB offers push-button pre-training and/or fine-tuning capabilities for any text during insertion.

RAG is currently the most dependable and proven technique for creating grounded AI-agents over any specific collection of text. To learn more about RAG and how to build hallucination free AI-agents, check out our previous blog. In this post, we will focus on retrieval quality and the importance of pre-training models on the available text corpus, which is currently solely used for indexing.

A common misunderstanding is that using a ‘good enough’ pre-trained foundation model is sufficient for generating embedding followed by retrieval using vector database. However, this misconception is now being widely challenged. Well-known models like Google’s T5 or OpenAI’s Ada are struggling to achieve high accuracy compared to much simpler models that are properly adjusted for the retrieval task.

An increasing number of case studies are emerging that highlight the significant impact of choosing the right embedding model, and there’s no one-size-fits-all solution. If businesses believe that customizing embedding models and further refinement are rarely necessary, they are only postponing another reality check that is inevitable. Search is a fundamentally challenging task, especially at scale, where multiple pieces of information may seem related but not relevant. Even with more than a decade of research, all the popular semantic search systems that we are aware of, including the ones at Google and Amazon, are constantly fine-tuned and re-trained frequently to accommodate behavioral and domain-dependent changes

Most likely you are not getting the best accuracy if you are not specializing your embedding models for your task.

There are two ways to specialize an embedding or neural model towards a domain dependent retrieval — 1. Self-supervised pre-training on target text and 2. Fine-tuning using supervised behavioral data. We will describe both of them briefly and later argue why they both are needed.

What is Self-supervised Pre-training on Raw Text and Why its helps RAG? A model that performs well with general English still needs to be fine-tuned to excel in the domain of enterprise data. Given any text corpus, we can define a natural ‘auxiliary task’ for training neural networks. Training on these ‘auxiliary tasks’ refines the neural network’s internal representations of text, specializing them for the specific domain. One of the most popular tasks is refining the model’s weights to enhance next-word prediction, a widely used generative task. When applied to domain-specific text, next-word prediction can yield significant improvements (see experiments below). Another prevalent self-supervised task involves using two consecutive sentences to fine-tune the neural network’s representations, aiming to minimize the cosine similarity gap compared to non-consecutive sentences.

What is Supervised Fine-tuning in RAG and Why it is necessary for Semantic Search? It is well known that without behavioral data and ongoing engineering, semantic search cannot be made practical. For example, when we type ‘apple login’, our intention is to log into the Apple account. Relying solely on semantic meaning results in numerous pages with FAQs and information about logging into Apple appearing, even though we don’t intend to visit those. The information retrieval community recognizes that only past behavioral data can accurately capture intent. Without these behavioral signals, semantic search based solely on text and its structure will hardly be as usable. The term ‘semantic’ was actually designed to encompass ‘user intent’ rather than solely English understanding. Therefore, it’s not surprising that most representations based on GenAI models are unusable for ‘semantic search’.

A randomly initialized NeuralDB model with tiny amount of pre-training and fine-tuning beats popular Foundation Models!

We chose well-known BEIR benchmark datasets of different sizes (two small and two large), which include a standard evaluation set for retrieval purposes. We used three popular foundation embedding methods as reference benchmarks for evaluating RAG: 1. Google’s T5, 2. Instructor-Large, and 3. OpenAI’s Ada. These foundation models are somewhat biased towards BEIR due to its standard benchmark status. Hence, additionally, we included the CUAD dataset, derived from a real business scenario involving contract reviews. All these datasets contain a test set.

To highlight the importance of pre-training in RAG, we conducted an extreme experiment. We took ThirdAI’s NeuralDB model and exclusively pre-trained it on the indexed text corpus. We also do fine-tuning if the data is available. Essentially, NeuralDB only learned from the text that was indexed, without any exposure whatsoever to any other text. In contrast, the popular foundation models underwent extensive pre-training on hundreds of gigabytes to terabytes of text. Contrary to the prevailing belief, we see that with only a few minutes of domain specific pre-training NeuralDB can surpass the performance of these foundation models.

The results are surprising and summarized in the Table below:

Retrieval accuracy (Precision@1 higher is better) on a test set of different pre-trained foundation models is compared with NeuralDB. The NeuralDB model was randomly initialized and underwent a tiny amount of unsupervised pretraining (over CPUs) exclusively on the available text that was intended for insertion (set pre-train flag ‘true’ during insertion in NeuralDB). Whenever fine-tuning data was accessible, it was utilized. Reproduce the NeuralDB numbers by running this notebook

Why is this observation overlooked? The reason such experiments are rare is that pre-training has typically been considered a task for experts, limited to sophisticated data science and engineering teams. However, with NeuralDB, both pre-training and fine-tuning are made incredibly accessible. All you need to do is set a flag to “true” during the index insertion process, and within a matter of minutes, you have a retrieval model that’s pre-trained on the designated text. For detailed instructions on how straightforward it is to pre-train and replicate the experiments outlined below, please consult this notebook. Anyone can pre-train and/or fine-tune a NeuralDB model on any text corpus using simple CPUs.

Does the amount of pre-training matter? It seems not. The dataset’s text volume varies significantly. For instance, the SCIFACT data contains only about 7.5 megabytes of text. NeuralDB was solely pre-trained from scratch (without any supervised labels) on a mere few megabytes of domain-specific text, yet it still outperforms all the foundation models.

Contract Reviewing AI-Agent (Real Use Case): Pre-training and Fine-tuning is critical! Based on a real customer need to create a practical AI agent for contract review and legal Q&A, we share our results. We used the comprehensive CUAD dataset, designed for contract-related retrieval-augmented generation (RAG) tasks, providing real-world context. By focusing on domain-specific training and refining, NeuralDB significantly outperforms foundation models by a large margin.

The results also underscores the inherent bias of open-source foundation models towards established BEIR benchmarks. When it comes to NON-BEIR benchmarks, foundation models appear notably subpar compared to the impressive results that can be attained with relatively less effort. Not surprisingly, all three foundation models seem to get similar (poor) accuraries.

Discussions: An ICML 2023 Paper shed some light.

A recent paper presented at ICML 2023, titled ‘Large Language Models Struggle to Learn Long-Tail Knowledge,’ effectively echos what we have seen. The study’s comprehensive experiments reveal that regardless of the language model’s size, its performance declines when it lacks exposure to data relevant to the task. Quoting a line from the abstract: ‘Specifically, our study shows that a language model’s capacity to answer factual questions depends on the number of documents associated with that question, seen during its initial training.

Clearly, by exclusively pre-training the model on the text intended for indexing, we provide it with more relevant information than months of expensive pre-training on terabytes of data.

Bottomline

Currently, fine-tuning and pre-training foundation models to make them functional remains a significant challenge for enterprises. With the ongoing GPU shortage and the specialized expertise required for model tuning, this task is not expected to become any easier in the near future. We’re pleased to announce that ThirdAI’s NeuralDB is available right now for everyone. With NeuralDB you can harness all your text, whether through pre-training and/or fine-tuning, without the need to wait for GPUs or assemble an expert team.

Shape your own AI as you desire. Scale it to any size you need, without hardware or people constraints. The future has arrived!

--

--

Anshu
ThirdAI Blog

Professor of Computer Science specializing in Deep Learning at Scale, Information Retrieval. Founder and CEO ThirdAI. More: https://www.cs.rice.edu/~as143/