Sitemap

Know-it-all: the curse of knowledge (graphs)

7 min readNov 21, 2024

I used Microsoft GraphRAG with a Streamlit UI for a research project on ethnographic data. The code for the Streamlit interface and for extracting source citations can be found on GitHub, here. The results exceeded expectations but got me thinking about how to meaningfully cite sources when an LLM has potentially drawn on every source for an answer.

Querying private data with GraphRAG and Streamlit

Lost in the middle

My day job often calls for collecting and making sense of large amounts of loosely structured and qualitative economic and social data. I am also a bit of a python and GenAI hobbyist. So lately I have been thinking a lot about how to use these tools to assist and enrich research projects.

I was excited to discover Microsoft’s open source GraphRAG project. For the uninitiated, GraphRAG is a fairly user-friendly framework for transforming your private data into a knowledge graph and querying it. It plugs a gap in conventional RAG: the ability to answer a question on your whole data, not just isolated top-k chunks. That is to say, it helps solve the ‘lost in the middle’ problem.

A little hard graph

How did GraphRAG help me answer big questions about my whole data? With something they call ‘global’ question answering. It begins by instructing an LLM of your choice to extract entities and the relationships between them. These are the building blocks of a knowledge graph, the nodes and edges. The LLM can optionally extract claims at this point. It then generates summary descriptions of closely related elements, capturing the holistic significance of entities, relationships and claims that might be expressed differently and in different contexts throughout the data. Then it identifies graph communities, that is, patterns of related entities, relationships and claims.

Lastly, the LLM is instructed to produce community reports, or narrative summaries of each graph community. This indexing allows rich causal and contextual information to be captured at each point in the knowledge graph. Ultimately, it greatly enhances the retrieval of globally relevant context to pass to an LLM with your query.

Problem solved… or is it?

Fantastic so far. I used GraphRAG 0.3.6 (there have been newer releases since, which I have still to explore) to create a knowledge graph from 240,000 tokens-worth of transcripts of ethnographic research. The transcripts were largely in colloquial Angolan Portuguese, with a smattering of local languages. I used GraphRAG’s Prompt Tuning feature to generate indexing prompts adapted to the source data. The outputs of my subsequent queries, using OpenAI’s gpt-4o-mini model, vastly exceeded my expectations for such esoteric data.

A particular gem was being able to pass a free text instruction to the search engine to format the answers (for instance, ‘multiple paragraph’ or ‘bulleted report’). For such a simple method, it enabled superb control not just over the formatting, but also the structure of the response. A personal favourite instruction was ‘matrix table’, with which the LLM generated robust answers to really complex questions like: “In focus groups with adolescents, segmented by males and females, identify the number of references to [concept]. What are the similarities and differences between the data for [location x] and [location y]?”

There are so many layers in that question that I didn’t really expect the LLM to do much with it. Yet it did, presenting the answer in a hierarchical, tabular structure. And the answers really stood up to corroboration with the source documents. But this now brings me to the tricky bit: corroboration.

“Are these answers just too good to be right?”

“How can I check them?”

Checking the LLM’s homework

The first few times I used GraphRAG’s search methods I was amazed with the results. But then came the troubling thoughts: “Are these answers just too good to be right?” and “How can I check them?”. That’s when I stumbled over what seemed, initially, to be a straightforward problem. Credible academic and professional writing needs diligent citations. Not only to give credit where it is due but also to demonstrate that one’s arguments have solid foundations. However, this was not something the GraphRAG project took care of out of the box. Without giving it much more thought, I set about cobbling some code to append post-processed citations to the answers.

I had noticed that, with the global search method, GraphRAG did include some sort of citation in the answers. These were references to the community reports produced during indexing: the starting point for subsequent RAG. I think of community reports as a collection of short essays that the LLM composes about themes in the data. Interesting and useful, sure, but not conventional sources. The community reports, in turn, referenced the entities and relationships extracted from the source data.

Mapping citations (with Diagrams: Flowcharts & Mindmaps and Blocks and Arrows)

Everything everywhere all at once

A mapping emerged: generated response, to community reports, to entities and relationships, to snippets of original text (text units), to source document titles. “Bingo!” I thought. A few left joins later I had mapped the paths from the generated answers back to the source documents and appended them to the generated answer, as citations. I was rather bemused, then, when the first run returned every source document in the data as a citation. I assumed I must have written some janky code. Still, if I had, I could not find the problem.

Then I stumbled onto an intriguing thought. Perhaps the script was working exactly as intended. Perhaps it was just that the LLM, with GraphRAG, might just be a great deal more thorough than most fleshy researchers could ever be. Perhaps it had, in a manner of speaking, read the entire bibliography?

A diligent human is bounded by time, effort, and bias. An LLM with a knowledge graph is unbounded.

Consider this. A diligent human can only read, recall, and objectively cite so much of any body of knowledge. They are bounded by time, effort, and bias. They might want to read those twenty books, reports, and scholarly sources, for that report they are writing. But they also need to sleep and eat. They might also have, at the outset, a good idea of what they are going to write, and why they are going to write it. So, let’s face it, they might cherry pick the morsels of information that reinforce their message and disregard those that do not. Or they might artfully compare their ideas with some easily discardable alternatives, in a simulacrum of debate.

When I, a fickle human, cite my sources, I am effectively exposing the biases in my analysis. I am communicating that “I have concluded such and such based on these curated snippets that I found time to read and digest. If I haven’t cited some other nuggets, it is probably because I ran out of time and/or motivation to consider them. If I’m being completely honest, anything I didn’t cite is a glaring blindspot. But I’ve cited some compelling bits of the stuff that I did read, and I hope it’s enough to convince you too.”

Compared to a person, an LLM with a knowledge graph is unbounded. It can, in a sense, read, recall, and reference all parts of a body of knowledge at once. Let us also assume that (at least until AGI arrives) it will not have an especially novel take on the information. The LLM will effectively generate a highly refined and inhumanly thorough summary, in which potentially every sentence might derive simultaneously from a specific snippet and the entirety of a corpus. If that is the case, is there even a point in it diligently citing its sources? I don’t have an answer to that question yet, but I still had to do something to corroborate the GraphRAG answers.

Good enough for graphing work

How to corroborate the GraphRAG answers? I tweaked the post-processing script to make the referencing a bit more like how I might do it. That is, imperfectly but well enough. The script calculated the frequency distribution of matches to each source document and returned only references to documents whose frequency was equal to or greater than the 50th quantile of the distribution. In a sense, it simulated citing the documents most frequently drawn on to build the LLM’s context for a query. Lastly, I wrapped GraphRAG’s search methods and the post-processing scripts in a Streamlit user interface, to make the analysis a little easier for myself and for colleagues more comfortable with a browser than a Python IDE. You can check out the repository here.

A global answer to a complex query

A graph-ifying conclusion

GraphRAG is fantastic. The quality of responses on a complex and obscure corpus greatly exceeded my expectations. It was also very economical for a research project. The combined cost of three rounds of indexing, of testing, and of many, many queries, was $3. Although I suspect it could get quite expensive if scaled to enterprise-level volumes of data.

It was a major productivity booster, compressing potentially three months’ worth of analysis into a couple of weeks. Nevertheless, research projects that need conventional scholarly references might need to tweak the default outputs. For my purposes I scripted some post-processing that approximates conventional citations, wrapped in a Streamlit app. I’d love to see the improvements other people could make to it.

Still, when all is said, I find myself wondering whether conventional referencing is a retrogressive step for ‘global’ RAG? It feels a bit like trying to make GenAI do things more like we do, rather than getting it do things better.

--

--

Dara Castello
Dara Castello

Written by Dara Castello

Working hard to be usefully wrong in everything from social entrepreneurship to human-centered design. Enjoying spending more time outside with the kids.

No responses yet