Build Agentic-RAG with OpenVINO™ and LlamaIndex

Raymond Lo, PhD
OpenVINO-toolkit
Published in
6 min readAug 16, 2024

Authors: Ethan Yang, Raymond Lo, and Stephanie Maluso

Background

Retrieval-Augmented Generation (RAG) is a form of prompt engineering that enhances large language models (LLMs) by integrating external data into prompts. This improves the models’ knowledge, relevance, and professionalism. However, traditional RAG systems have limitations. They often lack flexibility, relying heavily on retrieval results from vector databases, which can lead to the uncritical incorporation of data. As these databases scale, standard RAG struggles to efficiently classify and filter input requests, turning the retrieval process into a time-consuming and labor-intensive task akin to finding a needle in a haystack. In this blog, we’ll provide a detailed example of RAG using LlamaIndex and the performance benefits gained by OpenVINO™.

  1. Background
  2. Full Example: Creating a RAG System with OpenVINO™ and LlamaIndex
    2.1 Model Conversion and Quantization
    2.2 Model Task Initialization
    2.3 Create a RAG Tool
    2.4 Build the Agent Pipeline
  3. Summary and Outlook
This diagram illustrates how an AI system processes a user’s query. The process begins with the user submitting a query, which is received by an agent — the central unit responsible for determining how to handle it. The agent may use specialized tools to gather data or perform specific tasks, or it may rely on a RAG module to pull in external information. Finally, the agent compiles the results from the tools and/or RAG to generate and deliver a response back to the user.

The Agentic-RAG system addresses the limitations of standard RAG by employing AI agents to integrate various RAG retrievers as specialized tools. These agents intelligently determine whether to utilize RAG’s context search and select the appropriate retriever based on the user’s query. For instance, historical questions may prioritize a historical RAG retriever, while mathematical questions might bypass RAG entirely, directly engaging calculation-related tools for immediate answers from LLMs. Additionally, Agentic-RAG can combine RAG with other tools to tackle complex tasks collaboratively. In the accompanying diagram, the AI agent delegates tasks, invoking different tools or RAG components sequentially to produce the final solution.

Next, we’ll explore how to construct an Agentic-RAG system using OpenVINO™ and LlamaIndex.

Full Example: Creating a RAG System with OpenVINO™ and LlamaIndex

Model Conversion and Quantization

A RAG system requires both LLM and embedding models as essential components. These models can be converted into the OpenVINO™ IR format and quantized using the Optimum-Intel CLI.

Installation:

To install the necessary tools for model conversion and quantization, use the following pip command:

pip install optimum[openvino]

This command installs the optimum package with OpenVINO™ support, enabling model conversion and optimization.

LLM Conversion:

An OpenVINO™ LLM model can be exported through the text-generation-with-past task in Optimum Intel. Since the LLM serves as the agent in this pipeline, selecting a model with strong reasoning capabilities, such as Llama3-8B or Phi3-Medium, is recommended.

optimum-cli export openvino --model {llm_model_id} --task text-generation-with-past --trust-remote-code --weight-format int4 {llm_model_path}

Embedding Model Conversion:

Similarly, an OpenVINO™ embedding model can be exported through the feature-extraction task. Popular models like BGE or Jina are compatible. For this example, we use BAAI/bge-small-en-v1.5 for its balance of performance and accuracy.

optimum-cli export openvino --model {embedding_model_id} --task feature-extraction {embedding_model_path}

Model Task Initialization

Once the LLM, Embedding, and Reranker tasks based on OpenVINO™ are integrated into the LlamaIndex framework, developers can effortlessly initialize these tasks using the exported models.

Installation:

pip install llama-index llama-index-llms-openvino llama-index-embeddings-openvino

LLM Initialization:

In LlamaIndex, an OpenVINO™-based LLM task is created using the OpenVINOLLM class.

from llama_index.llms.openvino import OpenVINOLLM

llm = OpenVINOLLM(
model_id_or_path=str(llm_model_path),
max_new_tokens=1000,
model_kwargs={"ov_config": ov_config},
device_map=llm_device.value,
)

Embedding Initialization:

The text embedding model converts input text into feature vectors, created using OpenVINOEmbedding.

from llama_index.embeddings.huggingface_openvino import OpenVINOEmbedding

embedding = OpenVINOEmbedding(model_id_or_path=embedding_model_path, device=embedding_device.value)

Create a RAG Tool

Next, we develop a RAG tool using the LLM and Embedding components. In this example, Meta-Llama-3–8B-Instruct is used as the LLM, and bge-small-en-v1.5 as the embedding model. The initial setup involves a standard RAG retriever within LlamaIndex, using the default vector similarity search method for contextual filtering. For a comprehensive RAG implementation, refer to the example in the OpenVINO™ notebooks repository.

from llama_index.readers.file import PyMuPDFReader
from llama_index.core import VectorStoreIndex, Settings
from llama_index.core.tools import FunctionTool

Settings.embed_model = embedding
Settings.llm = llm
loader = PyMuPDFReader()
documents = loader.load(file_path=text_example_en_path)
index = VectorStoreIndex.from_documents(documents)
query_engine = index.as_query_engine(similarity_top_k=2)

Once the RAG retriever is created, you can use the LlamaIndex interface to wrap it as an agent. It’s essential to describe the tool so that the LLM can determine which tool to invoke based on the task.

from llama_index.core.tools import QueryEngineTool

budget_tool = QueryEngineTool.from_defaults(
query_engine,
name="Xeon6",
description="A RAG engine with basic facts about Intel Xeon 6 processors with E-cores",
)

To demonstrate Agentic-RAG’s capability to handle complex tasks, we also develop two distinct mathematical tools for LLMs to choose from.

def multiply(a: float, b: float) -> float:
return a * b

multiply_tool = FunctionTool.from_defaults(fn=multiply)

def add(a: float, b: float) -> float:
return a + b

add_tool = FunctionTool.from_defaults(fn=add)

Build the Agent Pipeline

Since Llama3 currently does not support function calls, we create an agent using the ReAct approach. The ReActAgent.from_tools interface allows you to build an agent pipeline with a single line of code. Bind the tools and LLM components to establish a basic ReAct agent.

agent = ReActAgent.from_tools([multiply_tool, add_tool, budget_tool], llm=llm, verbose=True)

Finally, let’s test this example. When asked, “What is the maximum number of cores in an Intel Xeon 6 processor server with 4 sockets?” the agent proceeds as follows: It queries the “Xeon 6” RAG system for the maximum number of threads in a single CPU socket, then uses the mathematical tool to multiply the result by 4, providing the final answer.

response = agent.chat("What's the maximum number of cores in an Intel Xeon 6 processor server with 4 sockets? Go step by step, using a tool to do any math.")

Result:

  • Action: Xeon6
    - Input:
    ‘maximum cores in a single socket’
    - Observation: Maximum cores in a single socket: 144
  • Action: multiply
    - Input:
    {‘a’: 144, ‘b’: 4}
    - Observation: 576
  • Answer: The maximum number of cores in an Intel Xeon 6 processor server with 4 sockets is 576.

Compared to standard RAG, this Agentic-RAG pipeline supports more complex tasks, like data analytics, by reasoning and invoking external tools. However, the performance of an Agentic-RAG system should be optimized, given the multiple rounds of actions required to generate a response.

Summary and Outlook

By integrating agents and RAG, we enhance LLMs’ ability to handle complex tasks directly, making this approach more suitable for industry applications compared to traditional RAG methods. With the adoption of a multi-agent approach, agent-based RAG systems are likely to replace standard RAG systems, offering a more adaptable and precise framework for LLM applications over time.

References:

About the Authors:

Ethan Yang, based in Shanghai, China, focuses on scaling and integrating the OpenVINO ecosystem with over six years of AI solution deployment and customer service experience. He holds a master’s degree in communication engineering from the University of York and previously worked at Open AI Lab on the Tengine AI inference framework.
Raymond Lo, currently based in Silicon Valley, is the global lead of Intel’s AI evangelist team, focusing on the OpenVINO™ Toolkit. With a diverse background that includes founding the augmented reality company Meta, Raymond has also held key roles at Samsung NEXT and Google Cloud AI. His work spans startup entrepreneurship and enterprise innovation, with a strong presence in global conferences like TED Talks and SIGGRAPH.
Stephanie Maluso is a product marketer and analyst for Intel, specializing in the OpenVINO™ Toolkit. With over three years of experience on the team, beginning as an intern, she has developed a deep passion for creating impactful content around the innovative AI products and tools she supports.

Notices & Disclaimers

Intel technologies may require enabled hardware, software, or service activation.

No product or component can be absolutely secure.

Your costs and results may vary.

© Intel Corporation. Intel, the Intel logo, and other Intel marks are trademarks of Intel Corporation or its subsidiaries. Other names and brands may be claimed as the property of others.

--

--

Raymond Lo, PhD
OpenVINO-toolkit

@Intel - OpenVINO AI Software Evangelist. ex-Google, ex-Samsung, and ex-Meta (Augmented Reality) executive. Ph.D. in Computer Engineer — U of T.