RAG Challenge Dataset I— Can Advanced RAG Beat NeuralDBs Online-Tuning?

Anshu
ThirdAI Blog
Published in
6 min readJan 17, 2024

The market is slowly coming to a realization that the existing RAG stack for building AI Chatbots is falling short of business expectations. There are many recommendations floating around, and on top of that, new ideas are popping up every day. Advanced RAG, a significantly complex system with too many components and associated knobs to tune, is now the recommended solution. Adjusting the knobs in advanced RAG for any given use case requires a dedicated team of experts, making the solution non-repeatable

Online-Tuning — Why you need it and how it differs from Fine-tuning?

Fine-tuning is the process of updating the embedding model to specialize it for the specific use case. The model update is done offline before building the vector index. It should be noted that if there is a drift in concepts or new associations are found, such as “Flu” becoming closely related to “Covid,” you need to update the embedding model. Unfortunately, any update to the embedding model requires rebuilding your vectordbs all over again. The hope of existing semantic search ecosystem is that concept drifts are rare and disadvantages of having a stale model can be overcome with hand tailored advanced RAG components.

There are alternate embedding-free “learning to index” technologies — NeuralDB [here], where tuning the end-to-end neural index can be done in an online fashion, and the model can infer and keep up with concept drift in the production usage. We refer this capability to continually evolve the neural search engine with usage-driven concept drift as online-tuning. NeuralDB is currently the only commercially available library with this capability.

The difference between fine-tuning and online-tuning, in the context of neural search, is summarized in a figure below:

Top figure shows fine-tuned retrieval while the bottom figure highlights NeuralDBs retrieval system with constant feedback-driven online-tuning

There are two major advantages of online-tuning:

  1. Claim I — Online-tuning results in significantly better accuracy due to constant user behavior adaptation.
  2. Claim 2 — Online-tuning is a repeatable neural search solution across domains and use-cases. It does not require any application specific knobs to materialize accuracy gains

Below, we present verifiable evidence and case study to support the above two claims: We are releasing a real Amazon product search case study from publicly available dataset. In addition, we also provide all the scripts to replicate our numbers and findings [here]. Note: In an earlier post we presented another case study with similar findings, however the dataset used was proprietary.

The Challenge Dataset and The High BAR of Online-tuning on Real User Queries over Amazon Products.

The AmazonTitles-1.3MM dataset is publicly available and consists of real user-typed queries on amazon.com. It is a dataset that consists of real user textual queries and associated titles of statistically significant products found relevant to that query, as measured by implicit or explicit actions of the users as recorded by the production system. Performing well on this dataset directly translates into click relevance. We cross-reference this dataset with the 3 million Amazon product catalog from Kaggle to obtain textual descriptions of the products by joining product titles.

Overall, we obtain a dataset of about 49,602 products along with their textual description. We additionally have about 870,000 (647,127 training, 238,435 testing) queries and product associations. Any product search system in production will have these associations. Note: The train-test partition is a direct subset of all the original train and test sets in the Amazon 1.3M dataset for which we have product description, ensuring that we don’t bias the partition in any way.

We present a starter evaluation on this dataset. We evaluated semantic search on this dataset and provide accuracy metric. For comparisons we used the default elastic search, the default ChromaDB, as well as default NeuralDB solution with online-tuning. The ChromaDB used “all-mini-LM-v6” Embedding model. We mostly used default configurations since our aim is to get the actual accuracy out of the box. NeuralDB was pre-trained on the products description as well as fine tuned on the available trained set.

We are also opening the dataset to the broader community to see if RAG or advanced RAG solutions can match or surpass the benefits of online-tuning.

The table below summarize the basic comparisons.

Precision@1 (search relevance) of different methods.

We would like to highlight four observations:

1. Online-tuning brings drastic improvement in real-search relevance: Relevance (Precision@1) improves from 9% to whopping 42%.

2. It is unlikely that pure language understanding will bring about such improvements. We can argue that there are better baselines than Chroma, but they are unlikely to fill the remarkable gap of 366% relative improvement. A cursory look at some examples where vectordbs proved wrong over NeuralDB reveals why language understanding is unlikely to solve the problem.

For instance, in the evaluation set, there is a query where the user typed “san francisco giants black reusable tote bag.” It turns out that there is no product with an exact match. Vectordbs provide the best approximate match as “built comfy reusable shopping tote bag,” giving more importance to the concept “tote bag,” which seems reasonable based on our knowledge of English. However, the statistically significant product, as per the gold standard where users show real interest in production, was actually “san francisco giants disposable pens.” Clearly, user behavior shows a preference for ‘san francisco giants’ merchandise, emphasizing its significance in this domain-specific context. In contrast, the ‘tote bag’ is lower priority.

No surprise, NeuralDB, which can be online-tuned on the dataset, got it right, indicating that user preferences can be inferred automatically from their past behaviors

3. The Need of Online-tuning over Fine-tuning: As we can see, more data for tuning increases precision. With only 50% of tuning data, we achieve only 17% precision, while with the full tuning set, we reach 42%. Given that tuning data is now static and the evaluation set is fixed, a fixed fine-tuning can also solve the problem. However, it misses an important point. Clearly, in production, the tuning data keeps growing as more users interact with the system. We need a mechanism to continually use this data. Unfortunately, the nature of vectordbs and embedding models necessitates a complete rebuild with any changes in the embedding model, making it quite cumbersome in practice.

4. The improvements do not require any knobs that are specific to the application or domain of interest. All we used was default NeuralDB configuration, and we did not use any application-specific tuning — a purely feedback-driven improvement. This is a critical point, because NeuralDB can be used and deployed for a variety of use cases and domains without a lot of customizations and complex workflows like those employed in advanced RAG. NeuralDb out of the box will adapt to any domain with the power of online-tuning

Final Remarks

While the AI community has achieved remarkable improvements in zero-shot capabilities with Large Language Models (LLMs), they are still far from what enterprises can achieve in practice with both the combination of zero-shot and domain-specific improvements.

Despite having many components, it is surprising that fine-tuning is not one of them in Advanced RAG, even though many studies note that fine-tuning can provide significant benefits. This ignorance of fine-tuning is likely because vector databases are not friendly with evolving embedding models. The irony of the situation is that in a world where embedding models are becoming obsolete by the day, if you invest enough in one embedding model to build your vector store, you hope the embedding model doesn’t evolve.

Currently, all comparisons are done on artificially curated datasets where the test set itself was generated in an ad-hoc artificial way or, worse, the evaluation queries were generated from GenAI models. This could be a major contributor to why progress in benchmarks is not translating into business productivity.

Don’t underestimate the importance of right benchmarks. Have you considered the following fact — all the best performing benchmarks of the 90s and even early 2000s were SVMs (or advanced SVMs) and deep learning were far behind for two decades.

Links to Dataset and Code: Here

--

--

Anshu
ThirdAI Blog

Professor of Computer Science specializing in Deep Learning at Scale, Information Retrieval. Founder and CEO ThirdAI. More: https://www.cs.rice.edu/~as143/