How can Graph Databases help in enhancing the Retrieval Augmented Generation (RAG) capabilities of LLMs?

Shobhan
5 min readOct 13, 2023

--

LLMs or Large Language Models are highly sophisticated text generation models that can produce nearly human-like text given proper context and prompts. With the amount of research that has been going on regarding LLMs, people all over the world have been looking for ways to enhance the output generated by these models to better suit their specific use cases. There are three main ways to go about personalizing an LLM:

1) Fine Tuning:

It is an approach in which we take a pre-trained model and change its weights by feeding it additional data that is specific to our use-case.

2) Few Shot Prompting:

We provide examples of our output based on some input in the prompt itself that is given to the LLM. This is mainly used when we are expecting the output to be in a specific format.

3) Retrieval Augmented Generation (RAG):

RAG is the method of providing additional context to the LLM based on the query that has been passed to it. This additional context can be retrieved from various sources like Vector Databases.

Out of these 3, RAG has been rising in popularity due to its simplicity and effectiveness. Fine Tuning will probably result in the most accurate output but it’s not always possible to Fine Tune a model as it requires a lot of resources and data to fine tune a model to produce industry level outputs. Few Shot prompting is not flexible enough to fit most use cases. RAG provides a nice middle ground as it doesn’t require any additional training of the model and it is flexible enough to handle various use cases.

The process of RAG can be understood as follows:

Retrieval Augmented Generation Pipeline

When a question is asked, the pipeline first retrieves relevant data from the knowledge base regarding the question. Then the question along with the relevant documents are passed to the LLM which then generates an answer for us. By providing relevant information along with the question to the LLM, it gets some much-needed context to answer the question appropriately. With this additional context, we can reduce hallucinations (a term for when an LLM generates incorrect text) efficiently.

So, if RAG is so efficient, then why are we still looking for ways to improve the quality of the generated text? Well, the short answer is that while RAG is good, it’s not good enough.

To start off, there are a lot of hyperparameters that vary based on the type and sizes of the documents that we want to retrieve information from. There is no combination of these parameters that will work properly for each use case and the only way to know the right combination is through experimentation. This tends to be quite time consuming as depending on the size and volume of the documentation, forming the knowledge base can take a lot of time.

Secondly, RAG works well for only one-dimensional questions, as they are inefficient when it comes to drawing inferences from the information and using such inferences for answering questions. Also, in many cases RAG based models are unable to handle multi-hop or multi-document queries.

To take an example, say we have multiple documents that contain information about individual movies like their summary, cast, ratings, release date etc. If we use RAG, and ask a question like ‘Who acted in XYZ movie?’, our model would be able to answer the question easily.

But instead if we were to ask a question like ‘How many movies was X the lead actor in?’, we might not get a correct answer because LLMs have limited context size and each time we are passing the whole movie information. In this case the model will receive information about a few movies that have the name ‘X’ involved in them including summary, cast, release date etc. This is very inefficient as we do not care about the synopsis or the release date in this particular use case and only need the cast details.

Most of these problems arise due to the way in which relevant documents are retrieved, similarity search. Similarity search tries to look for relevant keywords and/or semantic similarity for acquiring relevant information. This is where Graph Databases come into picture.

If we can narrow down the kind of information that we will be needing for our use-case, we can store it in a graph format in a graph database. Graph databases are flexible as they can store all kinds of data whether they are related to each other or not in the form of nodes. These individual nodes can then have their own attributes. Neo4j, OrientDB, ArangoDB, TigerGraph are a few popular graph database providers. I personally use Neo4j.

The first step would be to get our data into a graph format. There are multiple ways to go about it, but the most effective way I know is to use an LLM to extract relevant entities and relations that I want to store in the database. Using some clever prompts, we can get all the entities, their attributes and relations in one go every time a document is ingested in our system.

Once we have a graph ready, there are two ways in which we can extract data from the graphs:

1) Using an LLM to generate cypher queries that can fetch relevant data based on the question asked.

2) Using LLM integration with knowledge graphs to extract nodes directly.

The first method is simple enough, we provide our LLM with the structure of our graph database as context and ask it to generate queries that will help us get the information to answer the question. Then we use this information to finally generate the answer.

The drawback of this method is that we are completely reliant on the ability of our model to generate cypher queries. If the model generates bad queries, the entire pipeline fails.

The second method varies based on the graph database provider in question, Neo4j allows users to directly run similarity search on the graphs with its recent APOC integrations, but they only support OpenAI and VertexAI for now.

We have to first call the APOC endpoint using the following query:

CALL apoc.ml.openai.embedding([$question], $apiKey) YIELD embedding

Where $question is a place holder for user question and $apiKey is a place holder for OpenAI API key in this case. We can define the similarity search metric by using ‘gds.similarity.cosine(embedding, m.embedding)’ in the WITH section of the query. Finally, we can use the MATCH clause to choose what data we wish to retrieve.

This method is quite effective for retrieving nodes and neighbor nodes related to the question and then generating answers on top of that.

The obvious drawback of the graph approach is that the accuracy depends largely on the structure of the graph and queries that are generated. If the required data cannot be added efficiently to the graph, then this pipeline fails completely.

In conclusion, graph databases can be used to boost the accuracy of RAG pipelines and can help to solve complex problems like multi-hop questions, but only when the data can be represented in a graph format. With how fast things are changing, we’ll probably have an efficient way of dealing with these drawbacks quite soon.

--

--