Cross the Chasm with RAG: Implicit Feedback and Click-Through Data

Anshu
ThirdAI Blog
Published in
6 min readMar 5, 2024

Time to learn from more than two decades of advancements in making commercial web search engines practical and useful for end users.

Retrieval Augmented Generation (RAG) has caught the attention of numerous businesses aiming to extract value from their unstructured text data. Many enterprises have conducted initial assessments of RAG across various business use cases. However, it is becoming clear in the market that the current RAG stack used for developing AI Chatbots falls short of meeting end users’ expectations. In response, the RAG software stack has undergone significant revisions, now incorporating complex query processing and knowledge graphs. Despite these efforts, RAG continues to face challenges, such as poor accuracy in downstream tasks and unpredictability, hindering its widespread adoption in practice.

Image Credits: Stability AI

The current narrative of building chatbots on your data has shifted from the expectation that they will work out of the box to the realization that they need compound strategy requiring customization, constant experimentation, and development, as illustrated in this recent paper from UC Berkeley. LLM-driven customized chatbots now resemble yet another “non-repeatable” AI solution, with numerous sensitive knobs, requiring specialized teams working over years with expensive resources to produce something that may still struggle to find its way into real production.

Fortunately, there is a positive twist to the story — the longstanding magic of implicit feedback that has been enhancing the semantic relevance of commercial search and recommendation engines for about two decades

Before exploring implicit feedback, let’s quickly understand why LLM-Driven search is failing to meet users’ expectations

Why LLM-Driven Search is failing to meet the user expectations — Because RAG is clueless about day-to-day business dynamics.

The rush to create intelligent knowledge agents through semantic search on unstructured data using Large Language Models (LLMs) is based on the reasonable assumption that user inquiries are often repetitive and factual. Such queries, with expected responses that don’t change, can be efficiently automated with AI. However, is this enough to justify the complexity involved? The allure of Generative AI (GenAI) lies in its ability to surpass the realm of static factual queries. Even existing methods like “smart caching” and rule-based chatbots from before the GenAI era are effective for repetitive factual questions.

Power Law Phenomena: Once we move beyond static and repetitive factual queries, we cannot ignore daily business specific dynamics. Query distribution, and their desired solutions, in any business domain typically follows a power law and it evolves over time. We illustrate this with an example.

Let’s consider an automated customer support desk, where a large number of customers are reporting the issue “My credit card is not working.” In the past, there have been 10 known possibilities causing this problem, ranging from customers mistyping information to exceeding credit limits, bank authorization issues, and using unsupported cards like American Express. Suppose that in the past week, 90% of cases involved customers using American Express, and their cards were consistently being denied. In a scenario where the same issue is reported again, a human representative, informed by recent patterns, would probably respond, “Are you using American Express by any chance? We’ve noticed this issue a lot lately.” It’s clear that Large Language Models (LLMs) that only comprehend general English and reasoning would be unaware of the “situational trend”, and it would likely provide the same generic laundry list of ten possibilities to the customer, causing frustration.

Public Benchmarks are a Distraction in a Power Law Dominated Use Cases: Just like commercial search engines, overall user satisfaction and sentiment depends on the relevance of semantic search on the most popular queries. Even if the accuracy of RAG is 99% but if it fails to retrieve the right information on 1% of popular queries, we end up annoying almost all the customers. The interesting part, the popularity drifts over time, which mean that 1% of important queries will not be the same after a while.

Talk of the town: Complex Fine-tuned Retrieval and Data Labeling Pipelines

In an earlier series of blogs [Link1, Link2], we emphasized the significant difference in search relevance achievable through online-tuning or perpetual fine-tuning, witnessing a remarkable 360% relative improvement. We are pleased to note the market’s gradual recognition that the true value of AI lies in the frequent overhaul of AI pipelines with fine-tuning foundation and/or embedding models, enabling customized retrieval.

Fine-Tuning Challenges: Fine-tuning typically involves manual data labeling, presenting a challenge. As a consequence, a new emerging trend in the market is the rise of players offering labeling services and dashboards, enabling domain experts to contribute valuable training data. Data labeling services represent yet another challenge that the AI community has grappled with in the past with limited success.

Good-Old Implicit Feedbacks and Personalization: High quality labels without any manual data labeling.

Despite the recent advancements in GenAI-driven semantic search, there’s valuable wisdom to be gained from commercial search engines. Since the early 2000s, these engines have effectively eliminated the need for manual data labeling by remaining perpetually adaptive to user behaviors. The key lies in leveraging implicit feedback and relying heavily on freely available click-through data in user interactions.

Understanding Implicit Feedback and Click-Through Data: Implicit feedback involves obtaining labeled data from natural human interactions without explicit labeling. For instance, in a user-facing system presenting search results or a social media feed, basic user actions like clicks, expands, mouse hover, eye movement, and dwell time can be tracked. Analyzing such actions provides essential feedback. For example, if a user searches for “apple” and clicks (or spends more time) on results related to “apple company” rather than “apple fruit,” it indicates, without explicit labeling, the semantic meaning of “apple” in the given context.

The Importance of Implicit Labels: Around the early 2000s, studies on human behavior and information content in freely available implicit signals can be game changer in improving the accuracy of learning based search [link]. The success of commercial search and recommendation engines, such as Google’s, depends on their ability to automatically adapt to user behavior over time, driven by usage and clicks. Without this capability and implicit feedback, a search engine solely relying on textual relevance, language, or common reasoning is unlikely to provide the desired level of search relevance.

Meeting Business Expectations with a Repeatable Solution: RAG We Have vs. RAG We Need

Upon reflection, it is evident that both the existing RAG and advanced RAG fall short of fully automating complex tasks or “crossing the Chasm” without acquiring substantial domain expertise and ongoing training through perpetual tuning. However, a phased approach, akin to the evolution of self-driving cars, with implicit feedback can be game changing in breaching the production mark.

We illustrate the disparity by contrasting two scenarios:

RAG We Have (Not repeatable across domains and use-cases): Following the standard recommendation, achieving repeatability across use cases seems unattainable. It necessitates a dedicated team to undertake the following for any given specific use case:

  1. Data collection
  2. Data labeling
  3. Continuous fine-tuning of embedding models
  4. Rebuilding and re-indexing the complete vector databases for every embedding model refinement.

Only after achieving satisfactory accuracy we can navigate around production and deployment constraints. This approach resembles the old, cumbersome AI pipelines, where statistics indicate that 87% of models fail to reach production due to mismatches between real usage and offline testing.

RAG We Need (Repeatable across domains and use cases): As mentioned earlier, expecting immediate full automation of domain-specialized RAG solutions is unrealistic. Instead, we can progress towards full automation incrementally in phases. The concept involves initiating a perpetually adaptive RAG software designed for continuous online tunability — a system that improves daily. The software, capable of auto-tuning, manages the entire AI lifecycle, from pre-training and fine-tuning to deployment. We introduce the software in a shadow or partial automation mode alongside existing predominantly manual systems. It continuously logs implicit user clicks, glance statistics, and other natural feedback. The RAG system automatically fine-tunes itself, leveraging usage and implicit feedback data. Once sufficiently fine-tuned, the system progresses from no automation to partial automation and eventually reaches full automation. Notably, the software remains repeatable across use cases and is always production-ready, as it operates (or shadows) in real-time alongside existing systems.

ThirdAI’s NeuralDB: Carefully designed to be the RAG system that we need, NeuralDB stands out with its technological differentiators. Explore NeuralDB’s capabilities in detail [link]. Check out two case studies (here and here) showcasing significant enhancements in search relevance achieved through perpetual online tuning. The API documentation delves into diverse functionalities, including turnkey online-tuning of embedding models.

--

--

Anshu
ThirdAI Blog

Professor of Computer Science specializing in Deep Learning at Scale, Information Retrieval. Founder and CEO ThirdAI. More: https://www.cs.rice.edu/~as143/