GraphRAG local setup via vLLM and Ollama : A detailed integration guide.

Saurabh Rajaram Yadav
6 min readJul 21, 2024

--

AI generated image for represenational purpose
AI generated Image for Representation, Source : Gemini

Introduction to GraphRAG

GraphRAG is an innovative approach to Retrieval-Augmented Generation (RAG) that leverages graph-based techniques for improved information retrieval. It is a structured, hierarchical approach as opposed to naive semantic-search approaches using plain text snippets. In this comprehensive guide, we’ll walk through the process of setting up and using GraphRAG with open-source inference alternatives like vLLM for inferencing our Large Language Model and Ollama for embeddings.

Prerequisites

Before we begin, ensure you have:

  • A system with NVIDIA GPUs
  • Python 3.10–3.11 ( vLLM supports 3.8 to 3.11 || GraphRAG supports 3.10 and above)
  • Patience, as some steps can be time-consuming

Setting Up the Environment

We’ll start by creating a fresh environment for our GraphRAG setup. You can use either Conda or Python’s built-in venv. Given that will be required to install CUDA as well, let’s proceed ahead with conda for easy setup.

conda create --name graphenv python=3.11 -y && conda activate graphenv

Note: We are using Python 3.11 because of vLLM compatibility.

Installing GraphRAG

Once your environment is set up, install GraphRAG:

pip install graphrag==0.1.1 ollama

Preparing the Workspace

Create a directory for your RAG project:

mkdir -p ./ragdir/input

Now add the text file inside the input directory. Keep the text content very short because GraphRAG is computationally expensive and it will take time for indexing.

Initializing the GraphRAG workspace:

python -m graphrag.index --init --root ./ragdir

This will create two files: .env and settings.yaml in the ./ragdir directory.

  • .env contains the environment variables required to run the GraphRAG pipeline. If you inspect the file, you'll see a single environment variable defined, GRAPHRAG_API_KEY=<API_KEY>. This is the API key for the OpenAI API or Azure OpenAI endpoint.
  • settings.yaml contains the settings for the pipeline. You can modify this file to change the settings for the pipeline.

Configuring GraphRAG

Modifying .env
Replace the API key in .env with:

GRAPHRAG_API_KEY=EMPTY

Updating settings.yaml
Modify these four changes in settings.yaml to use vLLM and Ollama:

llm: 
api_base: http://localhost:8000/v1
model: meta-llama/Meta-Llama-3.1-8B-Instruct
embedding:
llm:
model: nomic-embed-text
api_base: http://localhost:11434/api

vLLM runs on a default port of 8000 and ollama on 11434.

I will soon explain why we are making these changes. But for the time being, let’s continue with the setup.

Setting up vLLM

vLLM requires CUDA 12.1 for its compiled binaries. Let’s set it up:

Check and install the NVIDIA drivers

First, check your driver version of NVIDIA and its CUDA compatibility:

nvidia-smi
  1. If you’re not seeing any output, you need to first install the drivers. You can refer to this link for more details on installing NVIDIA drivers on Ubuntu 22.04 LTS.
  2. After typing the command, you’ll see a box-like description. At the top, check the driver version and CUDA compatibility.
  3. We need to see if our CUDA compatibility is 12.1 or above. If it’s below, upgrade the drivers by referring to the link provided above.

Install CUDA

Once your CUDA drivers are up to date, let’s install CUDA in our activated Conda environment:

conda install nvidia/label/cuda-12.1.0::cuda-toolkit -y

Now check our CUDA version: nvcc --version

Verify that the version is 12.1 or above before proceeding.

Install Pytorch and vLLM

Next, make sure we have PyTorch installed:

pip install torch torchvision

Then we can install vLLM:

pip install vllm

Download the Model

GraphRAG requires a model with a context window of 32k, as mentioned in their docs. However, for GPU-constrained setups, meta-llama/Meta-Llama-3.1–8B-Instruct can be a good option to test it out:

huggingface-cli download meta-llama/Meta-Llama-3.1-8B-Instruct --exclude "original/*" --local-dir meta-llama/Meta-Llama-3.1-8B-Instruct

The path after --local-dir is where we will store our model.

Running the vLLM inference Server

Now let’s go ahead and run our inference server. I have a 4 Tesla T4 GPU setup which has a compute capability of 7.5. Hence, I cannot use bfloat16 dtype, so I’ll go ahead with fp16 by specifying half and tensor parallel size of 4 to utilize all my GPUs.

I had to adjust the extra flags and sampling params due to limited compute, so I reduced the context window length to 65k, my concurrent sequence requests to 128, and one very important thing is guided_decoding_backend.

We have two options here: outlines and lm_format_enforcer. By default, vLLM uses outlines, which gave me extremely slow responses. So we will switch to lm-format-enforcer for guided JSON responses.

We could have completely avoided it by specifying model_supports_json: false in our settings.yaml in ragdir, but in the spirit of inflating my debug-warrior ego”, I chose to do it anyway because it took a lot of time to get a hang of what is happening.

Here’s the command to start the vLLM server:

python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Meta-Llama-3.1-8B-Instruct \
--dtype half \
--api_key EMPTY \
--tensor-parallel-size 4 \
--trust-remote-code \
--gpu-memory-utilization 0.92 \
--max-num-seqs 128 \
--max-model-len 65536\
--guided-decoding-backend lm-format-enforcer \

This will run our server on the default port of 8000. You can modify the parameters as per your GPU configurations.

Setting up Ollama for Embeddings

We are using Ollama because currently, the encoder-decoder models are still on vLLM’s roadmap (we can try using intfloat/e5-mistral-7b-instruct, but my current setup does not allow me to).

  1. Install Ollama by visiting the official Ollama download page.
  2. Once that’s done, open the terminal and pull the embedding model:
ollama pull nomic-embed-text

Modifying GraphRAG Library

Now we are required to make two very necessary changes for Ollama and vLLM to work with GraphRAG:

  1. Search the directory where GraphRAG is installed. For that, type:
pip show graphrag

From there, look at the “Location” field. You’ll see something like: /Users/username/miniconda3/envs/graphenv/lib/python3.11/site-packages

2. Visit that directory and navigate to the GraphRAG folder.

From there, navigate to two important files:

a. In graphrag/llm/openai/openai_configuration.py:

  • Search for self._n in the __init__ function of class OpenAIConfiguration(Hashable, LLMConfig)
  • Hardcode the value of self._n = 1 This will resolve our equality error from vLLM.

b. Replace the entire content of graphrag/llm/openai/openai_embeddings_llm.py with the following code:

from typing_extensions import Unpack
from graphrag.llm.base import BaseLLM
from graphrag.llm.types import (
EmbeddingInput,
EmbeddingOutput,
LLMInput,
)
from .openai_configuration import OpenAIConfiguration
from .types import OpenAIClientTypes
import ollama

class OpenAIEmbeddingsLLM(BaseLLM[EmbeddingInput, EmbeddingOutput]):
_client: OpenAIClientTypes
_configuration: OpenAIConfiguration

def __init__(self, client: OpenAIClientTypes, configuration: OpenAIConfiguration):
self._client = client
self._configuration = configuration

async def _execute_llm(
self, input: EmbeddingInput, **kwargs: Unpack[LLMInput]
) -> EmbeddingOutput | None:
args = {
"model": self._configuration.model,
**(kwargs.get("model_parameters") or {}),
}
embedding_list = []
for inp in input:
embedding = ollama.embeddings(model="nomic-embed-text", prompt=inp)
embedding_list.append(embedding["embedding"])
return embedding_list

Save the files after making these changes.

We now are ready to index because we have already made the changes to settings.yaml and .env files .

Running GraphRAG

Finally, navigate to our ragdir directory that we initially created and then write the command:

python -m graphrag.index --root ./ragdir

If you’re inside the directory, modify the --root as ./.

Now wait for some time and let it create all the indexing and configurations.

Querying GraphRAG

Once that is done, we can start making our --global or --local queries.

To query our GraphRAG, use:

python -m graphrag.query --root ./ragdir --method global "Your query from the context"

For more details on different ways to query, refer to the GraphRAG Docs

Alternative Local Inference Options

Ollama now has support for concurrency, so we can pull LLMs that have context length of 32k and beyond & use it for inference on GPU scarce devices. Y’all can refer this or other similar sources for setting it up.

Conclusion

We’ve successfully set up GraphRAG with vLLM inference engine for our language model and Ollama for embeddings. This configuration provides a powerful, open-source alternative to OpenAI.

While this setup requires some initial configuration and modifications to the GraphRAG library, it offers great flexibility and control to setup our local GraphRAG sandbox. The use of vLLM allows for efficient utilization of GPU resources, while Ollama provides a lightweight solution for embeddings.

By using vLLM and Ollama, we’ve created a setup that can work well even on systems with limited GPU resources, but the irony is we are still required to use the almighty vLLM which definitely doesn’t go hand in hand with limited resources. 😂😂😂

Remember, GraphRAG can be computationally intensive, especially during the indexing phase. Keep your input documents concise for optimal performance. Also, the library is relatively nascent, so we can expect broader integrations in the future, in that case we wouldn’t be required to make any modifications to the library.
This is my first medium article, so apologies for the lengthy writeup.

Happy querying! May god bless everybody with tons of GPUs.

References:

Originally published at http://github.com.

--

--

Saurabh Rajaram Yadav

A data enthusiast who reads, learns, and implements for fun.