Generative AI & RAG Agents and Applications: Ready for Prime Time or Still Prototypes? How AI Leaders Can Tell the Difference

Anshu
ThirdAI Blog
Published in
6 min readAug 14, 2024

Goldman Sachs released a report titled ‘GEN AI: TOO MUCH SPEND, TOO LITTLE BENEFIT?’ raising concerns about the promise of Generative AI. The report summarizes observations after more than a year of expensive efforts by leading industries to rush GenAI into production, with limited success. It is becoming evident that GenAI, like traditional AI, faces significant challenges when scaling from prototypes and demos to production systems that can directly impact real business outcomes.

For teams and businesses that have experienced AI’s success, there is no doubt that it is a groundbreaking technology with significant gains that continue to improve over time. However, these teams also understand that extracting value from AI requires a certain finesse. Business leaders, in particular, must focus on subtle differences that demand careful investigation and monitoring to ensure they are moving in the right direction.

Below, I summarize some of the key distinctions that separate teams with successful AI-first products from those that merely sprinkle AI onto their existing products, resulting in limited gains and subsequent struggles to justify the cost of AI.

Image Credits: ChatGPT

Distinction 1: Prioritize Controlled Deployments Over AI Prototypes: Experienced AI practitioners know the fastest route to production is to work backward — start with a basic, end-to-end version that delivers initial value to customers. Deploy it for live testing in a controlled environment, using an agile, iterative process to refine the system based on real-world feedback and metrics. Live A/B testing is critical in this phase, allowing for the evaluation of AI models and strategies in real conditions.

The choice of AI model and platform dictates the solution’s scalability, latency, data movement, and cost. AI accuracy is malleable, and it typically always gets improved iteratively over time. Business constraints like latency in e-commerce are non-negotiable. For instance, it is known that even a 100ms delay in search results can reduce customer retention by 1%, leading many e-commerce platforms to opt for less accurate but faster models.

Real Evaluation Metrics, Not Just Textbook Ones: Once we are in deployment stage, the focus can now shift on the accuracy of AI models, allowing for the swapping and comparison of different strategies. Until this stage, it’s impossible to accurately estimate the real impact of AI or capture the right observable metrics, such as engagement, implicit clicks, and feedback. This approach shifts the focus away from mere textbook metrics and towards metrics that directly translate into business outcomes.

Relying on AI prototypes with theoretical benchmarks for production is a flawed strategy. Successful AI systems — like ChatGPT, Google’s Gemini, or Claude by Anthropic — evolve through continuous, usage-driven refinement. Over time, they achieve a near-perfect equilibrium, consistently delivering high value to users. Google Search, for example, evolves daily based on user interaction. Notably, earlier versions of ChatGPT, i.e., GPT-1 and GPT-2 weren’t top performers on benchmarks, but their AI-first approach allowed them to evolve rapidly. Simplified prototypes may be suitable for academic exercises, but they are often misleading to production-focused teams.

Distinction 2: Focus on Entire AI Systems, Not Just Models: Investing in AI infrastructure and PoC experiments without considering end-to-end costs, latency, data movement, and hardware requirements will likely prevent reaching production. For example, if a RAG pipeline has a 1-second latency per million documents with no path to achieving a 10x improvement, or if it requires duplicating data across locations, it won’t scale. AI solutions that aren’t optimized for specific applications — like e-commerce or security — are likely to fail due to network call latency exceeding production limits.

ChatGPT is a prime example of the need to focus on entire systems. It’s not just a single AI model but a complex system with multiple components, such as query understanding, expert model routing, response post-processing, and feedback mechanisms. This system continually evolves with usage data and human input. Viewing ChatGPT as just one AI model is a misconception; relying on a single model to deliver such complexity is bound to fail.

Distinction 3: AI First End-to-End Approach vs Multiple Piecemeal Components — Fewer Components Mean More Control, Accountability and Also Better Accuracy: Consider the popular example of a RAG (Retrieval-Augmented Generation) pipeline. Suppose we use three different components: one for embedding, one for vector databases, and one for re-ranking, each managed by separate teams. After integrating these components, we might find that the accuracy is poor and the system is five times slower than expected.

In this scenario, the accuracy issues could be blamed on any of the components: the re-ranker, the embedding models, or the vector database’s similarity function. Additionally, it’s hard to determine which team is responsible for the latency problem. This situation leads to endless blame games, and business leaders, who are responsible for timelines and deliverables, struggle to identify and understand the root cause of the issues. As a result, debates continue without resolution because ownership and accountability are unclear.

However, using a single end-to-end AI-first system like a Neural Database, which replaces all three components with one learning-to-index neural network, makes attribution straightforward. To improve accuracy and reduce inference time, we just need to tune the neural network and make its inference faster. This refinement can even be done by knowledge workers using simple UI, eliminating the need for a developer or data science expert. If a retrieval mistake occurs, it can be fixed in minutes.

Distinction 4: Avoid AI Solutions Where Hyper-Customization Is a Future Conversation — Zero-Shot Accuracy Is Far from Business-Ready: Most businesses require hyper-customized AI tailored to their specific problems and domain specializations to be truly useful. However, in these early stages, the prevailing approach has been to build a GenAI prototype with zero-shot capabilities, evaluate its value, and then hope the community will eventually address the need for customization. After about a year of experiments, enterprises are finding that without hyper-customization, GenAI will not make it into production. Zero-shot accuracy falls significantly short of expectations, especially at scale. (Read two case studies here and here.)

Worse, the infrastructure built for these prototypes is prohibitively rigid, making it difficult to incorporate fine-tuning or hyper-customization — both of which are essential for production deployment. Even minor modifications to retrieval or NLP models can trigger months-long cycles of code changes, fixes, testing, and redeployment. Additionally, there is no guarantee that the redeployed model will meet engineering constraints like latency, potentially requiring a complete re-optimization.

We highlighted the fundamental problems with embeddings and vector databases in an earlier blog. Clearly, if we built a RAG pipeline with embedding models and a vector database, and now need to upgrade the embedding models, we are forced to rebuild the entire vector database. Furthermore, the memory requirements of embedding storage and vector databases are prohibitive for applications at scale. So, rebuilding the vector database every time we have even a small update in embedding models is something no one cared during zero-shot prototyping and now they are stuck with a rigid stack where customization is prohibitive.

Finally — Call to Action for Business Leaders: For leaders managing product delivery and timelines, understanding the four key distinctions described above and their subtleties is crucial. It’s increasingly clear that leaders must ask many tough questions before committing time and resources to AI projects. Whether you’re weighing open-source versus closed-source, build versus buy, or software versus consulting, these four distinctions form the foundation of a successful AI strategy.

In our next blog, we’ll dive into a platform we’ve developed specifically for this space — ThirdAI’s platform for building hyper-customized AI Agents with unmatched scale and latency. We’ll explore how it empowers direct line of businesses to achieve the four key distinctions without needing a team of AI/ML experts — because these distinctions are baked into its very design. Stay tuned!

--

--

Anshu
ThirdAI Blog

Professor of Computer Science specializing in Deep Learning at Scale, Information Retrieval. Founder of ThirdAI. More: https://www.cs.rice.edu/~as143/