Is long context > RAG
The performance of RAG in conjunction with the latest models vs. long context with Gemini Pro
By Gopi Vikranth, Thompson Nguyen, Abhilash Asokan and Sourabh Deshmukh
As organizations increasingly harness the power of Large Language Models (LLMs), they are unlocking new opportunities to extract valuable insights and answers from vast amounts of unstructured data, including PDFs and product documents. This has given rise to a range of use cases that can drive business value, including:
- Streamlined Productivity: Automating routine tasks and freeing up human resources for more strategic work. For example, businesses can use LLM-powered chatbots to handle routine customer inquiries, freeing up human representatives to focus on more complex and high-value tasks.
- Insight Generation: Deriving valuable patterns and insights from large datasets, enabling data-driven decision-making. For example, by analyzing customer purchase data and behavioral patterns to identify trends and preferences, businesses can optimize targeted marketing campaigns and product development strategies that drive business growth.
- Enhanced Customer Experience: Providing accurate and informative responses to customer inquiries through intelligent chatbots. For example, businesses can utilize LLM-powered chatbots to answer frequently asked questions, provide order tracking updates, and even offer personalized recommendations based on customers’ ordering history.
- Customer Service Optimization: Improving response times and quality for Level 1 and Level 2 support representatives, enhancing customer satisfaction.
- Regulatory Compliance: Staying ahead of regulatory updates and changes through proactive monitoring and tracking and assessing impact. For example, businesses can use LLMs for compliance monitoring to track changes in regulatory requirements and assess their impact on the business operations.
To ensure the accuracy and effectiveness of these use cases, a robust Retrieval-Augmented Generation (RAG) framework is crucial. Recent advancements in LLMs have significantly expanded their context windows, with some models now capable of processing 1M tokens, such as Gemini Pro. This expanded capacity enables more accurate and informed decision-making, ultimately driving business success.
Taking a Systematic Approach
In response to the recent advancements in Large Language Models, we have conducted a thorough analysis to assess the requirements for Retrieval-Augmented Generation (RAG) and its implications on various use cases. To inform this evaluation, we drew upon domain-specific datasets in medical, business, and legal domains and tested them against a selection of top-performing models, including GPT-4o, GPT-4-turbo, Claude 2, Claude 3, Claude 3.5, and Gemini 1.5 Pro.
Key Takeaways
Our evaluation reveals the following insights:
Gemini 1.5 Pro: If budget is not a constraint and speed and accuracy are paramount, Gemini 1.5 Pro can deliver satisfactory results without RAG, making it an excellent choice for MVPs and rapid research queries that do not require technical expertise.
Cost, Speed, and Efficiency: Considering the cost, speed, and efficiency of deploying use cases in production, our evaluation highlights the following key takeaways:
- Models with RAG: Models with Retrieval-Augmented Generation (Claude 2/3/3.5, GPT-4-turbo, GPT-4-o) demonstrate strong performance, but require more time and resources to implement.
- GPT-4o with RAG: GPT-4o with RAG achieves the best results among all RAG implementations, delivering outcomes comparable to a non-RAG implementation of Gemini 1.5 Pro. Claude 3.5 with RAG comes in as a contender, with a slightly less accuracy score but is more cost efficient than both GPT-4o and Gemini Pro.
- Gemini 1.5 Pro without RAG: This option offers a faster path to production, but may necessitate additional tuning and fine-tuning.
Significant Time-to-Market Impact: A good RAG implementation can significantly impact the time-to-market for use cases, as seen in our previous RAG details and use cases.
By understanding the strengths and limitations of various Large Language Models and RAG approaches, organizations can make informed decisions about which solutions best suit their specific needs and goals.
Approach
To assess the capabilities of Large Language Models (LLMs) across diverse domains and use cases, we developed a comprehensive evaluation framework that evaluated their performance in multiple aspects. Our approach rigorously examined and tested the capabilities of both non-RAG (Gemini 1.5 Pro) and RAG processes (Claude 2, Claude 3, Claude 3.5, GPT4-Turbo, GPT4-o) on a range of documents within three distinct domain areas: Medical, Business, and Legal, each with its own distinct set of question types. The three domains are:
- Medical: This domain focuses on healthcare-related questions, drawing from clinical studies. The corpus consists of 50 complete documents on clinical assessments and clinical trial studies. Example questions include summarizing the comparative effectiveness of anti-vascular endothelial growth factor (anti-VEGF) agents in treating diabetic macular edema and central retinal vein occlusion, as detailed in the corpus.
- Business: This domain targets questions related to business operations, finance, and management, leveraging over 10 complete 10-K filings. An example question is summarizing the company’s approach to employee welfare and benefits over the last four years, highlighting notable enhancements or changes in policies.
- Legal: This domain concerns legal documents, covering explanations and legal jargon. The corpus includes 33 complete license and contractor agreements. An example question is summarizing the terms and conditions related to sublicensing rights within license agreements, including common limitations or requirements imposed.
Evaluation Categories
We assessed the Large Language Model’s capabilities by asking a large amount of questions in four distinct evaluation categories:
- Data-Specific Understanding: These questions test the model’s ability to extract specific information from a given dataset or text, including identifying key statistics, recalling dates, or extracting numerical values.
- Contextual Depth and Detail: These assessments evaluate the model’s capacity to provide in-depth answers, considering the context and nuances of a given situation. Examples include explaining complex concepts, providing examples, or discussing hypothetical scenarios.
- Analytical and Inferential Questions: These questions require the model to analyze complex data, draw conclusions, or make inferences, such as identifying patterns, making predictions, or drawing analogies.
- Data Retrieval and Synthesis: These questions test the model’s ability to retrieve relevant information and synthesize it into a coherent response, including summarizing texts, generating reports, or creating hypothetical scenarios.
The scoring rubric below was used to measure the responses:
Results
Cost Efficiency: Claude 3.5 stands out as the most cost-effective option, with a cost per 1,000 queries of $52.58, significantly lower than GPT4-o ($87.63), GPT4-Turbo ($167.75), Claude 2 ($126.33), and Claude 3 ($254.50). Gemini 1.5 Pro is the most expensive option, as it is not a RAG pipeline, at $1,176.33 (in this evaluation, thousands of documents were uploaded to Gemini 1.5 Pro, so the tokens here are significantly more than the other models with RAG pipelines).
Speed: GPT4-o boasts the fastest average response time, processing queries at a rate of 53.8 tokens per second, outperforming its competitors.
Consistency: Despite its speed, GPT4-o exhibits higher variability in response times, with a standard deviation of 6.8 tokens per second. This suggests less predictability compared to other models, such as Claude 3, which has the lowest variability (1.6 tokens per second).
Quality of Responses:
- Accuracy and Relevance: GPT4-o, Gemini 1.5 Pro, and Claude 3.5, and GPT4-turbo all have the same accuracy score of 8, indicating strong alignment with query intents and relevance.
- Informativeness and Completeness: GPT4-o scores well (Informativeness at 8 and Completeness at 8), closely competing with Gemini 1.5 Pro (9 and 9 respectively) which are recognized for its detailed responses. Claude 3.5 scores are close at 7 and 7.
Analytical and Data Handling Abilities:
- Data-Specific Understanding: GPT4-o’s performance (8) demonstrates a strong capability to handle detailed and specific data inquiries, outperforming Claude 3.5 (7) and GPT4-Turbo (7), and approaching the level of Gemini Pro (9) in this regard.
- Contextual Depth and Detail: GPT4-o and Gemini 1.5 Pro excel in handling complex contextual inquiries, achieving scores of 9, which indicates a superior ability to manage and synthesize extensive data. Claude 3.5 and GPT4-turbo have scores of 8, proving to be contenders in this category.
- Analytical Skills: With scores of 9, GPT4-o and Gemini 1.5 Pro demonstrate exceptional capability in inferential reasoning and analytical tasks. Claude 3.5 and GPT4-turbo are also very good with scores of 8.
- Data Synthesis: While strong, GPT4-o’s ability to integrate and synthesize data (6) falls short of its other capabilities. Compared to Gemini 1.5 Pro’s (8) higher synthesis performance, GPT4-o’s performance is significantly lower. Additionally, GPT4-o underperforms GPT4-Turbo (7) and Claude 3.5 (7) in data synthesis.