GraphRAG local setup via vLLM and Ollama : A detailed integration guide.
Introduction to GraphRAG
GraphRAG is an innovative approach to Retrieval-Augmented Generation (RAG) that leverages graph-based techniques for improved information retrieval. It is a structured, hierarchical approach as opposed to naive semantic-search approaches using plain text snippets. In this comprehensive guide, we’ll walk through the process of setting up and using GraphRAG with open-source inference alternatives like vLLM for inferencing our Large Language Model and Ollama for embeddings.
Prerequisites
Before we begin, ensure you have:
- A system with NVIDIA GPUs
- Python 3.10–3.11 ( vLLM supports 3.8 to 3.11 || GraphRAG supports 3.10 and above)
- Patience, as some steps can be time-consuming
Setting Up the Environment
We’ll start by creating a fresh environment for our GraphRAG setup. You can use either Conda or Python’s built-in venv. Given that will be required to install CUDA as well, let’s proceed ahead with conda for easy setup.
conda create --name graphenv python=3.11 -y && conda activate graphenv
Note: We are using Python 3.11 because of vLLM compatibility.
Installing GraphRAG
Once your environment is set up, install GraphRAG:
pip install graphrag==0.1.1 ollama
Preparing the Workspace
Create a directory for your RAG project:
mkdir -p ./ragdir/input
Now add the text file inside the input directory. Keep the text content very short because GraphRAG is computationally expensive and it will take time for indexing.
Initializing the GraphRAG workspace:
python -m graphrag.index --init --root ./ragdir
This will create two files: .env
and settings.yaml
in the ./ragdir
directory.
.env
contains the environment variables required to run the GraphRAG pipeline. If you inspect the file, you'll see a single environment variable defined,GRAPHRAG_API_KEY=<API_KEY>
. This is the API key for the OpenAI API or Azure OpenAI endpoint.settings.yaml
contains the settings for the pipeline. You can modify this file to change the settings for the pipeline.
Configuring GraphRAG
Modifying .env
Replace the API key in .env
with:
GRAPHRAG_API_KEY=EMPTY
Updating settings.yaml
Modify these four changes in settings.yaml
to use vLLM and Ollama:
llm:
api_base: http://localhost:8000/v1
model: meta-llama/Meta-Llama-3.1-8B-Instruct
embedding:
llm:
model: nomic-embed-text
api_base: http://localhost:11434/api
vLLM runs on a default port of 8000
and ollama on 11434
.
I will soon explain why we are making these changes. But for the time being, let’s continue with the setup.
Setting up vLLM
vLLM requires CUDA 12.1 for its compiled binaries. Let’s set it up:
Check and install the NVIDIA drivers
First, check your driver version of NVIDIA and its CUDA compatibility:
nvidia-smi
- If you’re not seeing any output, you need to first install the drivers. You can refer to this link for more details on installing NVIDIA drivers on Ubuntu 22.04 LTS.
- After typing the command, you’ll see a box-like description. At the top, check the driver version and CUDA compatibility.
- We need to see if our CUDA compatibility is 12.1 or above. If it’s below, upgrade the drivers by referring to the link provided above.
Install CUDA
Once your CUDA drivers are up to date, let’s install CUDA in our activated Conda environment:
conda install nvidia/label/cuda-12.1.0::cuda-toolkit -y
Now check our CUDA version: nvcc --version
Verify that the version is 12.1 or above before proceeding.
Install Pytorch and vLLM
Next, make sure we have PyTorch installed:
pip install torch torchvision
Then we can install vLLM:
pip install vllm
Download the Model
GraphRAG requires a model with a context window of 32k, as mentioned in their docs. However, for GPU-constrained setups, meta-llama/Meta-Llama-3.1–8B-Instruct can be a good option to test it out:
huggingface-cli download meta-llama/Meta-Llama-3.1-8B-Instruct --exclude "original/*" --local-dir meta-llama/Meta-Llama-3.1-8B-Instruct
The path after --local-dir
is where we will store our model.
Running the vLLM inference Server
Now let’s go ahead and run our inference server. I have a 4 Tesla T4 GPU setup which has a compute capability of 7.5. Hence, I cannot use bfloat16 dtype, so I’ll go ahead with fp16 by specifying half
and tensor parallel size of 4 to utilize all my GPUs.
I had to adjust the extra flags and sampling params due to limited compute, so I reduced the context window length to 65k, my concurrent sequence requests to 128, and one very important thing is guided_decoding_backend
.
We have two options here: outlines
and lm_format_enforcer
. By default, vLLM uses outlines
, which gave me extremely slow responses. So we will switch to lm-format-enforcer
for guided JSON responses.
We could have completely avoided it by specifying model_supports_json: false
in our settings.yaml
in ragdir
, but “in the spirit of inflating my debug-warrior ego”, I chose to do it anyway because it took a lot of time to get a hang of what is happening.
Here’s the command to start the vLLM server:
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Meta-Llama-3.1-8B-Instruct \
--dtype half \
--api_key EMPTY \
--tensor-parallel-size 4 \
--trust-remote-code \
--gpu-memory-utilization 0.92 \
--max-num-seqs 128 \
--max-model-len 65536\
--guided-decoding-backend lm-format-enforcer \
This will run our server on the default port of 8000. You can modify the parameters as per your GPU configurations.
Setting up Ollama for Embeddings
We are using Ollama because currently, the encoder-decoder models are still on vLLM’s roadmap (we can try using intfloat/e5-mistral-7b-instruct, but my current setup does not allow me to).
- Install Ollama by visiting the official Ollama download page.
- Once that’s done, open the terminal and pull the embedding model:
ollama pull nomic-embed-text
Modifying GraphRAG Library
Now we are required to make two very necessary changes for Ollama and vLLM to work with GraphRAG:
- Search the directory where GraphRAG is installed. For that, type:
pip show graphrag
From there, look at the “Location” field. You’ll see something like: /Users/username/miniconda3/envs/graphenv/lib/python3.11/site-packages
2. Visit that directory and navigate to the GraphRAG folder.
From there, navigate to two important files:
a. In graphrag/llm/openai/openai_configuration.py
:
- Search for
self._n
in the__init__
function of classOpenAIConfiguration(Hashable, LLMConfig)
- Hardcode the value of
self._n = 1
This will resolve our equality error from vLLM.
b. Replace the entire content of graphrag/llm/openai/openai_embeddings_llm.py
with the following code:
from typing_extensions import Unpack
from graphrag.llm.base import BaseLLM
from graphrag.llm.types import (
EmbeddingInput,
EmbeddingOutput,
LLMInput,
)
from .openai_configuration import OpenAIConfiguration
from .types import OpenAIClientTypes
import ollama
class OpenAIEmbeddingsLLM(BaseLLM[EmbeddingInput, EmbeddingOutput]):
_client: OpenAIClientTypes
_configuration: OpenAIConfiguration
def __init__(self, client: OpenAIClientTypes, configuration: OpenAIConfiguration):
self._client = client
self._configuration = configuration
async def _execute_llm(
self, input: EmbeddingInput, **kwargs: Unpack[LLMInput]
) -> EmbeddingOutput | None:
args = {
"model": self._configuration.model,
**(kwargs.get("model_parameters") or {}),
}
embedding_list = []
for inp in input:
embedding = ollama.embeddings(model="nomic-embed-text", prompt=inp)
embedding_list.append(embedding["embedding"])
return embedding_list
Save the files after making these changes.
We now are ready to index because we have already made the changes to settings.yaml
and .env
files .
Running GraphRAG
Finally, navigate to our ragdir
directory that we initially created and then write the command:
python -m graphrag.index --root ./ragdir
If you’re inside the directory, modify the --root
as ./
.
Now wait for some time and let it create all the indexing and configurations.
Querying GraphRAG
Once that is done, we can start making our --global
or --local
queries.
To query our GraphRAG, use:
python -m graphrag.query --root ./ragdir --method global "Your query from the context"
For more details on different ways to query, refer to the GraphRAG Docs
Alternative Local Inference Options
Ollama now has support for concurrency, so we can pull LLMs that have context length of 32k and beyond & use it for inference on GPU scarce devices. Y’all can refer this or other similar sources for setting it up.
Conclusion
We’ve successfully set up GraphRAG with vLLM inference engine for our language model and Ollama for embeddings. This configuration provides a powerful, open-source alternative to OpenAI.
While this setup requires some initial configuration and modifications to the GraphRAG library, it offers great flexibility and control to setup our local GraphRAG sandbox. The use of vLLM allows for efficient utilization of GPU resources, while Ollama provides a lightweight solution for embeddings.
By using vLLM and Ollama, we’ve created a setup that can work well even on systems with limited GPU resources, but the irony is we are still required to use the almighty vLLM which definitely doesn’t go hand in hand with limited resources. 😂😂😂
Remember, GraphRAG can be computationally intensive, especially during the indexing phase. Keep your input documents concise for optimal performance. Also, the library is relatively nascent, so we can expect broader integrations in the future, in that case we wouldn’t be required to make any modifications to the library.
This is my first medium article, so apologies for the lengthy writeup.
Happy querying! May god bless everybody with tons of GPUs.
References:
Originally published at http://github.com.