Contract Advisor RAG

Abel Bekele
15 min readFeb 24, 2024

Hybrid Large Language Models (LLM) have emerged as a game-changer in the field of contract management. The primary goal is to develop, assess, and refine a Retrieval Augmented Generation (RAG) system for Contract Q&A. This innovative system merges the robust capabilities of powerful language models with external data sources, resulting in highly accurate and context-rich outputs. RAG is particularly suitable for contract Q&A, as it ensures that the AI comprehends the context of each question.

The evaluation process is characterized by its thoroughness, ensuring that the AI is not only precise but also adaptable. The potential applications of RAG are wide-ranging, extending to areas such as legal research and compliance monitoring. This has led to a significant demand for experts in this field, who are equipped with the skills and knowledge to leverage this groundbreaking technology.

Project objective

The primary objective of our project is to construct, assess, and enhance a Retrieval Augmented Generation (RAG) system tailored for Contract Q&A. This system will facilitate seamless interaction with contracts, enabling users to engage in conversations and pose inquiries about contractual agreements.

Technology overview


RAG, or Retrieval-Augmented Generation, is a cutting-edge technique that enhances the accuracy and reliability of generative AI models by integrating external knowledge sources. This approach enables AI models to retrieve relevant information and seamlessly incorporate it into text generation tasks. For instance, RAG can be employed to enhance our bot’s ability to respond to user inquiries by cross-referencing the contract provided to it. Additionally, RAG improves the quality and diversity of text generation for a variety of tasks, including summarization, translation, and storytelling. This capability allows RAG to deliver more accurate and nuanced responses.


Chunking, on the other hand, involves breaking down input text into smaller segments to facilitate easier processing by language models. This process optimizes semantic responses, reduces text complexity, and enhances content relevance. Various chunking methods, such as fixed-size, sentence-based, paragraph-based, and topic-based chunking, can be employed, each with its own advantages and disadvantages. Chunking is a critical design decision in RAG systems, as it impacts retrieval and generation performance and accuracy.

Vector databases

Vector databases, meanwhile, are specialized databases that store and retrieve data as high-dimensional vectors, which are mathematical representations of features or attributes. These databases enable efficient and fast similarity searches, making them invaluable for AI applications such as image recognition, natural language processing, and recommendation systems. Unlike traditional relational databases that store data in rows and columns and use SQL queries to access them, vector databases utilize specialized algorithms, such as k-nearest neighbor (k-NN), to find the most similar vectors to a given query vector.

Integration Frameworks

Integration Frameworks, like LangChain, play a crucial role in developing applications powered by large language models (LLMs). They enable context-aware applications that can connect a language model to sources of context, such as prompt instructions, few-shot examples, or external content. Additionally, integration frameworks enable applications that can reason, allowing them to answer questions, take actions, or generate text based on the provided context.


Retrieval Techniques are methods that enhance the quality and efficiency of information retrieval for text generation. These techniques aim to overcome the limitations of the naive RAG approach, which simply retrieves the most relevant documents based on the input query and feeds them to a language model. Advanced retrieval techniques use various strategies, such as auto-merging, sentence-window retrieval, dense retrieval, and hybrid retrieval, to improve the quality and efficiency of information retrieval.

RAG Generators

RAG Generators, such as RAG-Token, RAG-Sequence, and FiD, are models that produce text output based on input queries and retrieved documents. They leverage the natural language understanding and reasoning capabilities of large language models (LLMs) to generate relevant, accurate, and informative responses. Each RAG Generator has its own strengths and weaknesses, with RAG-Token offering precise responses but potentially suffering from repetition or inconsistency, RAG-Sequence generating coherent and diverse responses but potentially losing detail or specificity, and FiD producing fluent and consistent responses but requiring more computation and memory.

RAG Evaluation

RAG Evaluation is the process of measuring and improving the performance and accuracy of Retrieval-Augmented Generation (RAG) applications. It involves several components and dimensions, including indexing, retrieval, and generation. RAG Evaluation requires a systematic and robust framework that can measure and compare different aspects and trade-offs of the RAG pipeline, including relevance, diversity, faithfulness, and quality.

Project Blueprint

Enhancing RAG Systems: Conducting a comprehensive review of existing academic literature to refine RAG systems. Focusing on improving performance metrics, scalability, personalization, and contextualization, while also addressing biases in the system’s responses.

Building Q&A Pipeline: Creating an efficient and functional Q&A pipeline using a suitable Large Language Model (LLM). This will involve blending retriever and generator components, designing APIs, and thoroughly testing the system.

Evaluating RAG Systems: Establishing and implementing metrics to accurately gauge the performance of RAG systems. This includes identifying and employing the requisite evaluation tools and frameworks, constructing an evaluation dataset, and creating automated testing procedures for consistency and efficiency.

Optimizing Contract Q&A: Enhancing the system’s ability to process and understand legal language. This includes implementing techniques to heighten the accuracy and reliability of responses and developing or adapting custom models specifically for legal texts.

Implementing Enhancements: Implementing and assessing enhancements to the system’s performance. This involves implementing two enhancements and evaluating their impact on the system’s effectiveness through rigorous evaluation and analysis.

Project implementation

As I embark on this project, I am adopting an Evaluation Driven Development (EDD) approach. Just as Test Driven Development (TDD) is integral to Java/Spring development, EDD plays a crucial role in ensuring the high-quality development of Retrieval-Augmented Generation (RAG) pipelines. By focusing on the evaluation of the pipeline’s outputs, EDD helps guarantee that the responses generated are not only accurate but also relevant and aligned with user expectations.

The following are the six main benefits that EDD brings to the development of RAG pipelines:

Enhancing Accuracy and Relevance: EDD aids in identifying and addressing potential issues with the pipeline’s output, ensuring that the generated responses are accurate, relevant, and consistent with the provided context.

Identifying Weaknesses and Opportunities: By continuously evaluating the pipeline, EDD helps pinpoint areas where improvements can be made, allowing developers to refine specific aspects and optimize overall performance.

Guiding Model Selection and Parameter Tuning: Through the evaluation of different models and parameter configurations, EDD guides the selection of the most suitable model architecture and hyperparameters for the specific task at hand.

Ensuring Robustness and Generalization: EDD ensures that the pipeline performs consistently across various input scenarios and data distributions, enhancing its robustness and generalization capabilities.

Aligning with User Expectations: EDD helps ensure that the pipeline’s output is tailored to the specific needs of the target audience, aligning with user expectations and requirements.

Continuous Improvement and Iteration: EDD fosters a culture of continuous improvement and iteration, allowing developers to make informed decisions based on objective evaluation metrics.

By adopting an EDD approach, I aim to develop RAG pipelines that not only meet high standards of accuracy and relevance but also provide a seamless user experience that aligns with user expectations.

Selecting vector database

When selecting the right vector database for our project, there are several factors to consider, including the specific requirements of our application, the type of data you’ll be storing and querying, and the scalability and performance needs of our system.

The Milvus vector database is an ideal choice for this project, as it is designed to handle fast and scalable similarity searches in high-dimensional data. It offers improved search efficiency, supports various search algorithms, and scales well for large datasets, handling billions of vectors with concurrent queries at low latency and high throughput. Moreover, Milvus can be easily integrated with popular AI frameworks like TensorFlow, PyTorch, OpenAI, and Spark, making it suitable for a wide range of applications. Additionally, it provides features like role-based access control, quotas and limits, and data backup for reliability and security.


Exploration of a range of chunking methods, including TikToken, spaCy, SentenceTransformers, and KoNLPY, has found that TikTok performed better when dealing with contracts. This discovery has led to improved accuracy and better outcomes for this project.

Selecting the ideal chunk size

Choosing the right chunk_size is crucial for RAG systems, impacting efficiency and accuracy. A smaller chunk_size (e.g., 128) offers more granularity but risks missing vital information. A larger chunk_size (e.g., 512) ensures comprehensive context but may slow the system down. To balance this, we use Faithfulness and Relevancy metrics to measure accuracy and responsiveness. Testing with different sizes is key to finding the optimal configuration for each use case and dataset.

Using a chunk size of less than 1024 exceeded the set limit, while using 1024 resulted in better outcomes.


RAG evaluation is the process of assessing the performance and effectiveness of a Retrieval-Augmented Generation (RAG) system. It involves measuring how well the system retrieves relevant information and generates accurate and coherent responses in response to user queries.

RAG evaluation is important for several reasons:

Improvement of System Performance: By evaluating the RAG system, we can identify areas that need improvement and refine the system to enhance its performance.

User Satisfaction: Evaluating the system helps ensure that the generated responses are accurate and relevant to the user’s queries, leading to higher user satisfaction.

Quality Control: Regular evaluation of the RAG system helps maintain the quality of the generated responses and ensures that the system continues to meet the required standards.

Research and Development: Evaluation of the RAG system provides valuable insights that can be used for research and development purposes, leading to the creation of more advanced and efficient systems.


Ragas is a framework that helps you evaluate your Retrieval Augmented Generation (RAG) pipelines. RAG denotes a class of LLM applications that use external data to augment the LLM’s context. There are existing tools and frameworks that help you build these pipelines but evaluating it and quantifying your pipeline performance can be hard. This is where Ragas (RAG Assessment) comes in.

In the evaluation, a set of questions and their corresponding ground truth answers were utilized. These questions were specifically crafted to test various aspects of the agreement between the Company and the Advisor. The examples dictionary was created to organize the questions and their corresponding ground truth answers for the evaluation.

Utilizing the evaluation set, Ragas was employed to evaluate the RAG pipeline by generating answers to the questions and comparing them against the ground truth answers. This facilitated a quantitative assessment of the pipeline’s performance, aiding in the identification of areas for improvement of the RAG.

Ideas Optimizing Contract Q&A

  1. Vector Search Mechanism: This involves the implementation of specialized algorithms and data structures that efficiently handle similarity searches in high-dimensional data. For example, Faiss is known for its efficient indexing and search capabilities, while Annoy offers a lightweight, approximate nearest neighbor search solution. Milvus, on the other hand, is a comprehensive vector database that supports both CPU and GPU computation, making it suitable for large-scale datasets.
  2. Memory Management: This aspect focuses on managing the history of conversations within the system. It involves storing and retrieving past interactions between users and the AI, which is crucial for maintaining context and personalization. Optimizing conversation history management includes techniques such as efficient storage, context preservation, privacy and security measures, and performance optimization. These strategies ensure that the system can recall previous conversations accurately and use them to enhance the user experience by providing relevant and personalized responses.
  3. Legal Language Understanding: This involves developing or adapting natural language processing (NLP) models and techniques specifically for legal language. Legal texts often contain complex terminology, jargon, and nuanced meanings, which require specialized handling. Techniques such as domain-specific word embeddings, fine-tuning pre-trained language models on legal corpora, and incorporating legal ontologies or knowledge graphs can enhance the system’s understanding of legal language.
  4. Evaluation Frameworks: This involves creating standardized evaluation metrics, datasets, and test procedures to assess the system’s performance accurately. Common metrics for RAG systems include accuracy, precision, recall, and F1-score, while evaluation datasets consist of queries, contexts, and expected answers. Automated testing procedures can be implemented to ensure consistency and efficiency in the evaluation process.
  5. User Feedback Mechanisms: This involves implementing mechanisms for users to provide feedback on the system’s responses, which can be used to continuously improve the system. This can include features such as like/dislike buttons, user ratings, and comment sections. Advanced techniques such as sentiment analysis and natural language processing can be used to analyze and interpret user feedback, providing valuable insights for system refinement.

Implementing Enhancements

Better vector Search Mechanism

The choice of search mechanism depends on various factors like the size and dimensionality of the data, the desired accuracy, and performance requirements. Milvus provides flexibility in selecting the most suitable search mechanism for a specific use case.

  1. Inner product search: This is the most fundamental search mechanism used in Milvus. It calculates the distance between the query vector and each vector in the database using the dot product operation. Vectors with higher dot product values are considered more similar to the query vector.
  2. IVF search (Inverted File Vector): This technique partitions the vector space into smaller subspaces using a clustering algorithm. The query vector is then compared only to vectors in the relevant subspaces, significantly improving search efficiency for large datasets.
  3. HNSW search (Hierarchical Navigable Small World): This method constructs a hierarchical graph where similar vectors are connected as neighbors. During a search, Milvus traverses the graph, starting from the node closest to the query vector, and explores neighboring nodes that are likely to contain similar vectors. This approach reduces the search space and improves efficiency for high-dimensional data.
  4. Hybrid search: Milvus often combines these techniques to achieve optimal performance. For instance, it might use IVF for initial filtering and then refine the results using HNSW for more precise similarity retrieval.

For this project, the similarity_search_with_score function plays a crucial role in finding similar items in a database by comparing a query vector to others. This process returns the most relevant items along with scores indicating their similarity level, based on a chosen distance metric such as L2 or cosine similarity. This functionality is invaluable for tasks like recommendation systems, image retrieval, and document search, as it allows for the efficient identification of items that closely match a given query.

Memory Management

Langchain’s Memory module empowers large language models (LLMs) with context and memory, enabling them to hold information across interactions. This functionality is crucial for natural and informative conversations. Here’s a breakdown of its key aspects:


Stores information: The memory module retains user inputs, system responses, and other relevant details from past interactions.
Informs decisions: During subsequent interactions, the LLM accesses stored information to understand the context and make informed decisions, leading to more coherent and relevant responses.
Multiple memory types: Langchain offers various memory types, each suited for specific purposes. These include conversation buffers, entity trackers, and custom memory implementations.


Improved conversation flow: By remembering past interactions, the LLM can maintain a coherent conversation thread and avoid repetitive responses.
Personalized experiences: The LLM can tailor its responses to individual users based on their past preferences and interactions.
Enhanced task completion: By retaining information across steps, the LLM can effectively complete multi-step tasks or answer complex questions requiring previous context.

After enhancing the RAG pipeline, the same evaluation set was used to assess its performance. The evaluation set was carefully crafted to cover various aspects of the agreement between the Company and the Advisor. An examples dictionary was created to organize the questions and their corresponding ground truth answers for evaluation.

Ragas was employed once again to evaluate the enhanced RAG pipeline. This tool generated answers to the questions and compared them against the ground truth answers, facilitating a quantitative assessment of the pipeline’s performance. The evaluation results were used to identify areas for further improvement in the RAG pipeline.

The comparison between the original evaluation results and the results after enhancement shows several improvements in the RAG pipeline’s performance:

Faithfulness Score

The original average faithfulness score was 0.8166, which increased to 0.825 after enhancement.
This indicates that the enhanced RAG pipeline is more faithful to the context of the questions.

Answer Relevancy Score

The original average answer relevancy score was 0.9155, which decreased slightly to 0.7807 after enhancement.
This suggests that while the relevance of answers has slightly decreased, the overall quality and fidelity of the answers have improved. To address this Prompt engineering was applied later on to guide the output it shall generate.

Context Precision Score

The original average context precision score was 0.3, which increased significantly to 0.5265 after enhancement.
This indicates that the enhanced RAG pipeline is better at providing precise context for the answers.

Context Recall Score

The original average context recall score was 0.65, which increased to 0.95 after enhancement. Which is a significant increase.
This suggests that the enhanced RAG pipeline has improved recall for the context of the questions.

In summary, the enhancements made to the RAG pipeline have resulted in a more faithful, contextually precise, and contextually recallable system, with a slight decrease in answer relevancy.

User interaction


I’ve built a web API using FastAPI that allows users to ask questions through a web interface and receive answers. The API utilizes a custom ContractAdvisor class to process questions, possibly leveraging data from a file and interacting with external systems like Milvus. Security is ensured by allowing requests from different origins using CORS. While one endpoint provides a placeholder response, the main functionality is in the answer2 endpoint, which handles user questions and returns relevant answers. This API showcases the capabilities of FastAPI for creating efficient and user-friendly web services.

Front end

I’ve successfully created a React-based front end with a host of user-friendly features:

Document Upload: This feature allows users to upload documents as PDFs. The system processes the upload and provide users with a preview of the document, making the interaction seamless and efficient.

Conversation History: The platform maintains a comprehensive conversation history, storing all user interactions and responses. Users can easily refer back to previous discussions, view past responses, and track the progress of their conversations.

Export History: This feature enables users to export their conversation history, allowing for easy access and sharing of valuable information. Users can choose the format in which they want to export the data, such as CSV, JSON, or PDF.

Copy to Clipboard: Users can quickly copy relevant information to their clipboard, streamlining their workflow. This feature is particularly useful when users need to share specific details from their conversation history with others.

Responsive Design: The front end is designed to be fully responsive, ensuring a seamless experience on both desktop and mobile devices. The interface adjusts dynamically to different screen sizes, ensuring that users can access and interact with the platform from any device.

Time Stamps: Each message is accompanied by a timestamp, providing users with a clear understanding of when each interaction occurred. This feature helps users track the progress of their conversations and ensures that they have a complete record of their interactions.

Theme: The front end offers the ability to switch between dark and light modes, as well as automatically adapting to the system’s settings for a seamless experience. This feature allows users to choose the appearance that suits their preferences or let their device’s settings dictate the theme for them.


In conclusion, the development of the ContractAdvisor RAG system represents a significant advancement in the field of contract management. By harnessing the power of hybrid Large Language Models (LLMs) and integrating external data sources, this innovative system is poised to revolutionize the way contracts are managed, evaluated, and understood. The thorough evaluation process ensures that the system is not only precise but also adaptable, making it suitable for a wide range of applications, including legal research and compliance monitoring. As the demand for experts in this field continues to grow, the ContractAdvisor RAG system stands as a testament to the potential of Retrieval Augmented Generation (RAG) technology and its ability to transform industries.

Future works

Continual Evaluation and Optimization: Continuous evaluation and optimization of the system are critical to ensuring its reliability and effectiveness. This includes regular monitoring of system performance, identifying areas for improvement, and implementing enhancements to address any issues that arise.

Enhanced User Experience: Improving the user interface and experience should be an ongoing effort. This includes refining the document upload process, optimizing conversation history management, and implementing additional user feedback mechanisms.



Abel Bekele

Generative AI Engineer passionate about integrating technology to solve complex challenges. With a background in Mechatronics, Data Science & Machine Learning.