Build & Deploy Perpetually-improving Medical Q&A Engine at Scale (120M Chunks) with NeuralDB (No GPUs).

Published in

ThirdAI Blog

6 min readMar 27, 2024

Got Spare CPU Boxes? Lets build and deploy Generative AI Applications at Scale!

We present a case study where we utilized NeuralDB’s automated scaling to construct the largest auto-tuning, auto-scaling Q&A system, incorporating over 120 million text excerpts from PubMed articles. This achievement was realized using merely six AMD Bergamo CPUs, without the need for any GPUs, all within a weekend. Importantly, following its initial setup, the system can be efficiently deployed on just a single AMD Milan Box with a latency of 10 queries per second. Furthermore, the deployed system continuously improves its responses through usage and implicit feedback over time via online tuning, without requiring any engineering or code modifications.

Goal: Developing a User-Facing, Self-Improving Medical Q&A System (RAG) Based on 120 Million Text Chunks from PubMed

PubMed has made available 35 million articles which, after undergoing standard processing, yield 120 million text chunks. These form the basis for a Q&A bot designed to provide accurate and reliable information grounded in PubMed citations, a source trusted by the medical community. The aim is for this system to self-adjust and enhance its performance over time, akin to Google Search. For instance, if users consistently search for “cold and cough” and show a preference for “Covid” related answers over those related to “influenza”, the system will learn from these patterns through implicit feedback from natural usage, eliminating the need for manual updates.

We are utilizing NeuralDB to implement Retrieval Augmented Generation (RAG) for contextually grounded generation, with a strong emphasis on online-tunable retrieval to facilitate continuous self-optimization.

Fine-tuning Generative Models vs. Fine-tuning Retrieval Models: It is critical to recognize that fine-tuning a generative model is a sensitive and complex task. Although ChatGPT can be freely used for generation, attempting to fine-tune ChatGPT by updating its weights may introduce a host of difficulties in managing the GenAI Model. A viable alternative is to fine-tune the retrieval component using the straightforward NeuralDB APIs. By customizing the retrieval system, we can automatically tailor the generation process by adjusting the context for each query. This method often proves adequate for customizing the generated responses without needing to modify the generative model directly.

NeuralDB Difference — No Embedding, No GPUs, No VectorDBs, No Network calls. Get a Turnkey Fine-tuning with Data Residency: The AI first approach to retrieval, read [here]

Introducing NeuralDB: An Auto-Scaling Software Library with Online Tunability at Scale

We deployed NeuralDB Enterprise on six AMD EPYC 9754 machines, each boasting 128 cores, acquired through our collaboration with AMD. Utilizing a straightforward script that requires only a single function call — and with no manual tuning or engineering effort — the scheduling, parallelization, and monitoring of tasks become effortless. NeuralDB automates parallel data reading, domain-specific pre-training of its models, and subsequent indexing in a local key-database, completing the entire setup in 60 hours. This efficiency benefits from data parallelism and is designed for horizontal scaling. For example, expanding the installation from six to twelve CPUs could halve the setup time to 30 hours.

Once pre-training and indexing are complete, the system is primed for one-click deployment, showcasing its remarkably lightweight framework. A single AMD Milan server with dual socket 64 cores is ample for deploying this extensive chatbot system. In contrast to traditional embedding and vector database approaches, which might require managing up to two terabytes of index (refer to table), NeuralDB’s innovative technology only needs a 100GB index, making a single CPU box more than sufficient for large-scale hosting. The hosted model is accessible here.

NeuralDB facilitates continuous improvement via both implicit and explicit user feedback. It answers user queries with citations, diversified to ensure broad coverage akin to commercial search engines [Read More]. Users enhance system accuracy by endorsing helpful citations or linking questions with specific insights or partial answers. Moreover, the system captures implicit feedback through standard use. Regular updates, informed by these feedback mechanisms, transform it into an ever-evolving medical RAG Q&A system. This establishes NeuralDB as the first medical chatbot specifically designed to increase its relevance and utility through natural user interactions, paralleling Google Search.

Why not the usual suspects: Embedding and VectorDB?

Existing RAG approaches pose significant challenges to our objectives for two primary reasons:

Prohibitive Cost, Memory, and Data Movement at the Scale of 120M Chunks: In the following section, we detail the cost, memory footprint, and data movement requirements. Memory emerges as a major bottleneck at this scale. For approximately 120M text chunks, the vector embedding (1536 dimensions) and the index itself occupy about 2TB of storage (1TB for the Embedding + 1TB for the Search Index), which requires constant management. In contrast, NeuralDB necessitates less than 100GB for its index, which, along with 50GB of data, still fits within the confines of server-grade CPU boxes, such as Milan.
Lack of Adaptability to Constant Evolution (or Concept Drift) in Embedding and VectorDB: Any modifications to the embedding model necessitate a complete rebuild of the VectorDBs. Using outdated embeddings in comparison with updated ones is not viable. Consequently, online tuning is not feasible within this ecosystem. This necessitates continuous oversight and a dedicated team to maintain retrieval quality in line with concept and usage drift over time.

Cost and Resource Analysis:

Enterprises concerned with scale prioritize data residency, latency, high availability, and ease of use. Currently, many potential large-scale use cases are not pursued due to concerns about data residency. NeuralDB adopts an enterprise-first approach to Retrieval-Augmented Generation (RAG), designed with these priorities in mind.

To illustrate its scalability and cost-effectiveness, we compare NeuralDB with a popular enterprise-grade solution for RAG — OpenAI Embedding combined with Weaviate VectorDB. This solution is known for its transparent pricing and high availability. The comparison, detailed below and in the accompanying table, focuses on:

Target Querying Latency: 20 QPS (Queries Per Second)
Desirable Property: Managing Concept Drift Over Time

Option 1: Embedding + VectorDB (Cannot Tolerate Drift):

Cost for 120M vectors: $80,217.9 per month, based on 13,363.2 times 6, assuming a dimension of 1536 (OpenAI Ada) and enterprise-level availability, as calculated with the Weaviate pricing calculator. The embedding cost is $2400 for indexing using OpenAI Ada-2, with details available here. Fine-tuning at this scale is challenging and expensive.

Memory Overhead: Approximately 2TB in total. Embeddings account for about 1TB (1536 x 120M ~ 200 Billion Floats → 800GB for embeddings). The index requires another 1TB. Managing this 2TB of memory, especially loading it, can affect latency. Any fine-tuning necessitates a complete update of both embeddings and index. The data comprises only 50GB of text.

Option 2: NeuralDB (Designed for Scale and Constant Drift): NeuralDB offers a simplified, software-only solution. Using existing CPUs, the hardware cost is negligible, with only a flat software subscription fee per core. By deploying NeuralDB Enterprise, you can transform all your CPU cores into an on-demand semantic-search infrastructure, while still allocating free cores to other tasks.

To calculate the total cost, even using premium EC2 pricing, numerous cost-effective options exist for CPU cores. For instance, sixteen hpc6a.48xlarge (96 vCPU AMD EPYC) instances can build the index on 120M in about 60 hours. With on-demand pricing at $2.88/hour, the hardware cost approximates $2700. The software subscription, at roughly 2 cents per hour per core for 1600 cores over 60 hours, totals approximately $1980. Thus, the entire setup costs around $4680 for index construction.

Hosting: After building the index, merely a single r7a.32xlarge (4th generation AMD EPYC CPU) instances can support 20 QPS inference, leading to an estimated cost of $3224 (reserved pricing) for EC2 hardware and approximately $1920 for software for 128 cores monthly. These software cost can be further reduced with bulk enterprise software pricing.

Memory Overheads: The data encompasses 50GB, with the index under 100GB, including 10B parameter deep models. Notably, NeuralDB does not use embeddings, meaning the 120M text chunks are processed by a roughly 10B parameter neural network, occupying less than 40GB for parameters.

Challenges in Medical Q&A: Reliability Concerns and Limited User Feedback

Medical chatbots, including ChatGPT, offer responses to health-related inquiries but often face scrutiny over the reliability of their sources. While databases like PubMed provide credible citations, their presentation is not as user-friendly as that of chatbots.

A significant issue is that neither ChatGPT nor PubMed integrates user feedback into their learning processes. For instance, if users frequently search for “cold and cough” and consistently select COVID-related results, an efficient system would identify such patterns and adjust its responses accordingly. The lack of utilization of “implicit feedback” in current PubMed searches and chatbots, such as ChatGPT, diminishes user satisfaction. The concept of implicit feedback, which is over two decades old, is a cornerstone of most successful commercial search engines, enhancing user experiences daily. To understand more about the role of implicit feedback in popular search engines, read this blog.

Important Links:

Hosted Pubmed Search and Q & A

Python Script to Build your own Pubmed Search with 120M Text Chunks

NeuralDB Enterpise Installation and More Information