Context is Everything: Logging Your LLM Conversations in a Graph Database

6 min readOct 26, 2023

Note: This article and the underlying LLM application were developed with Daniel Bukowski, Customer Success Architect at Neo4j.

Introduction

This blog is part of our Context is Everything series about using knowledge graphs and graph data science to ground LLM applications. In our prior blog we covered creating a high-quality grounding data set on a knowledge graph. This includes traditional NLP methods as well as graph-based approaches you can perform to understand the grounding data set while identifying errors and outliers that may negatively impact LLM performance.

In this blog we will explore why logging user interactions within your application’s knowledge graph can create deep and meaningful insights that would otherwise be unavailable. As we’ll see, this is especially useful in retrieval augmented generation (RAG) applications.

To learn more about this approach to traditional NLP and graph based approaches for improving your grounding data set, along with how knowledge graphs and graph data science (GDS) can help you build a grounded LLM application, please watch our Road to NODES 2023 Workshop: GDS and Generative AI we presented on Thursday October 5, 2023.

The Data Model

Knowledge Graph Data Model

Our knowledge graph has a basic data model as seen below. The details of creating this graph are covered in our second blog, but essentially each scraped website page is broken into chunks and each chunk has a relationship connecting it to the source URL. Chunks also have KNN similarity relationships with each other based upon the text embedding. Based upon the similarity relationships we can use graph data science algorithms to also detect communities of text chunks and calculate PageRank scores to measure text importance.

Knowledge Graph + Logging Data Model

Logging is implemented following the below data model. This adds nodes to track user sessions, conversation chains, and individual messages. Here we can store LLM parameters, message timestamps and LLM response ratings among many other properties. This allows us to look under the hood of our chatbot and better understand its behavior as well as how our users are interacting with it.

Analyzing The Logs

Visualizing Conversations

Let’s first visualize a conversation that was had with our LLM chat agent. Below we can see a two question interaction, with “user” indicating the question node and “assistant” indicating the LLM response node. Each “assistant” node has a HAS_CONTEXT relationship to the documents that were used to construct the response. We can see that nearly all documents used with the first response are a part of the same graph community (13065), which was identified by running the Label Propagation Algorithm from the GDS library. In this message chain the user asks to clarify the first response given, resulting in some documents being swapped out in the second response.

Graph of a conversation between a user and the ChatGPT-4 LLM.
Context Documents are labeled with their GDS Community.

We can also view document occurrences in a heat map like the one below. The x-axis represents the sequential LLM responses and the y-axis represents the top used documents. Here we see that most documents are accessed only once while a select few are used multiple times. Interestingly the x-axis shows that if documents are used multiple times, they tend to be used in non-sequential responses.

*Graph of a conversation about tuning a Cypher query between a user and the ChatGPT-4 LLM*

Response Ratings: Messages

Users have the ability to rate each LLM response they receive. These ratings can then be analyzed to gain insights into response and document quality. For example we can see the percent of positive ratings over all rated responses as well as the percent of responses that have actually been rated. The former offers a broad picture of how well the LLM is operating, while the latter suggests that perhaps there are ways to make rating more intuitive for users.

*Response Ratings section from a monitoring dashboard*

Response Ratings: Documents

Since we log user interactions and context documents in the same knowledge graph, we can calculate implied document ratings based upon their connections to rated LLM responses. This is done by aggregating the ratings of all messages connected to a single document. In the long term, documents that provide quality information will tend to have higher ratings. This can aid in optimizing the knowledge graph. For example we can set thresholds for the number of times a document is used and its percent positive rating, then flag documents that could be contributing poor information. This is seen in the “Document Warnings” section of the dashboard below.

*Document related sections from a monitoring dashboard*

Community Analysis

The community labels we previously created with the GDS library allows us to gain additional insights into how our users are interacting with the LLM chat agent. These labels provide numeric topics that can aid in identifying what our users are most interested in.

Community Analysis: Frequency

We can view the prevalence of specific communities in our LLM responses by analyzing the document usage frequency grouped by community. Communities that are frequently accessed should be further investigated to ensure that they don’t contain junk data, while those on the other end of the spectrum could be removed to optimize the knowledge graph. This step can help ensure that your data remains relevant to the tasks your LLM needs to handle.

*Most frequently used communities by the amount of times their documents were used to generate an LLM response*

Community Analysis: LLM Responses

We can dig even deeper into our response messages by running the GDS fastRP embedding algorithm over our data. This lets us visualize communities across our LLM response messages.

First the LLM response nodes and context document nodes are projected into a bi-partite graph.
Then GDS Node Similarity is used to create a similarity graph of the LLM response nodes.
Finally we can use the GDS FastRP algorithm to generate an embedding of the LLM response similarity graph.

The below visualization is a 2D plot of the LLM response node FastRP embeddings with:

Icon Color: The response node’s community
Icon Size: The response node’s PageRank

*2D plot of LLM Response Similarity Graph FastRP Embeddings*

This shows that our LLM’s responses have relatively distinct communities and are solid candidates for topic modeling.

Conclusion

In this blog we explored some of the many benefits to logging user interactions within the same database as the knowledge graph for Retrieval Augmented Generation applications. The Neo4j Graph Data Science library allows us to gain unique insights from the resulting graph that are unobtainable in other database types. These insights reveal strategies for optimizing the knowledge graph itself, shed light on the behavior of the LLM, and allow us to better understand our users.