Context is Everything: Analyzing Your Context Data using Knowledge Graphs and Graph Data Science

13 min readOct 2, 2023

Note: This article and the underlying LLM application were developed with Alexander Gilmore, Associate Consulting Engineer at Neo4j. Follow me on LinkedIn for daily posts.

Introduction

This is the second blog in our Context is Everything series about using knowledge graphs and graph data science to ground LLM applications. In our first blog we explained the importance of developing a high-quality grounding data set and outlined five important characteristics: relevant, augmenting, reliable, clean, and efficient. In this blog we will dig into creating a high-quality grounding data set on a knowledge graph. This includes traditional NLP methods as well as graph-based approaches you can perform to understand the grounding data set while identifying errors and outliers that may negatively impact LLM performance.

To learn more about this approach to traditional NLP and graph based approaches for improving your grounding data set, along with how knowledge graphs and graph data science (GDS) can help you build a grounded LLM application, please join our Road to NODES 2023 Workshop: GDS and Generative AI on Thursday October 5, 2023.

Creating the Initial Grounding Data Set

Graph Data Model

When building a grounding data set on a knowledge graph you will need to define a graph data model. Thankfully, the mode for a grounding data set can be both simple and effective. One viable option is to simply load the chunks of data into Neo4j as individual nodes, without any predefined relationships. However, while viable this approach does not take advantage of the benefits graphs offer.

The next step up is to define relationships in the graph that trace where source documents originated from. For example, to create our grounding data set we scraped public Neo4j documentation along with relevant articles from the Neo4j Developer Blog and an unofficial Graph Data Science Support Github repo. We scraped websites using LangChain because we wanted our LLM application to provide specific citations with answers. Therefore, when we divided each document into chunks, we kept the original website associated with that chunk so that we could load it into the knowledge graph with a HAS_SOURCE relationship.

Data Model Connected Documents to Source URLs

While simple, this data model begins to take advantage of some of the benefits of graph. This is the model we used when building our grounded LLM application and later in this blog we will demonstrate how we enhance it with GDS-based relationships.

There are more sophisticated graph data models that we could use for grounding data. For a discussion of these approaches we recommend Tomaz Bratanic’s article Knowledge Graphs & LLMs: Multi-Hop Question Answering.

Source Documents and Chunking

As mentioned above, we used LangChain to scrape public Neo4j documentation along with relevant articles from the Neo4j Developer Blog and an unofficial Graph Data Science Support Github repo. All of this content was created by Neo4j, so there were no copyright or other issues using it. We will not go into detail about scraping in this article other than to mention three points:

Make sure that the API or other tool you are using is appropriate for the data type. For example, a scraper set up for HTML pages may not works as well for Markdown files or Jupyter Notebooks.
You often end up with extra text and duplicate content that adds nose and reduces efficiency in the eventual grounding data. Graphs and GDS provide tools to address this, which we will discuss later in this article.
Confirm you are permitted to use scraped data for your application from both the perspective of the source content (i.e., copyright) and your organizations own data policies.

For many organizations, grounding data will come from internal, non-public sources. For an example of ingesting these types of source documents we recommend our colleague Adam Cowley’s article where he started with Asciidoc format and wrote his own code to ingest this data into a knowledge graph. Adam also implemented a similar graph data model, as he is also referencing Neo4j documentation.

How you “chunk” the documents into smaller portions for embeddings is another factor to consider. Many other resources exist about this topic, including approaches and ideal chunk size, so we will only address a few points here:

Different types of source documents likely require different strategies. For example, unstructured text may be fine with a fixed length chunking strategy. However, highly structured websites may be broken up by sections while code files or notebooks may be broken up by class, function, or markdown headers. If you are ingesting data from multiple types of files, you may need a combination of approaches.
Grounded conversations, especially when “chained” together, can quickly consume even 32k or larger LLM token limits. Carefully consider the limit of the model you will be using along with how many pieces of context you anticipate your users attaching when deciding a text chunk size. As with chunking strategies, the chunk size may also fluctuate depending on the type of source document.
Different models produce text embeddings of different lengths. If you are using an embedding model that outputs smaller embedding lengths, you may want to consider smaller text chunks to better capture nuance within the text.

Statistical Analysis of Text Chunks

Exploring and understanding your data is a critical step in any data science project. It becomes even more important when the data is fundamentally changed in any way, whether chunked for embeddings or converted from unstructured or tabular into graph format. Almost every customer project I consult on has run into data quality issues that could have been identified and mitigated had traditional or graph EDA been performed up-front. This is no different when developing data sources to ground LLM applications.

Traditional NLP statistics are extremely useful for understanding the text chunks and identifying potential issues our outliers that could impact the quality of your grounding data. And as we will demonstrate, traditional NLP statistics become even more useful when combined with the output of GDS algorithms.

Text Chunk Character Length

One of the most useful and straightforward statistics to generate is the character lengths of the text chunks. We recommend running this immediately after implementing your chunking strategy (or strategies) and before generating embeddings. The statistic serves two main purposes:

Help you understand the statistical distribution(s) of your text chunks. If you implemented multiple chunking strategies, we recommend analyzing the output of each strategy individually, along with analyzing the entire corpus together.
Confirm that the chunking strategy produced the expected outputs. For example, when we were first generating text chunks we mistakenly used the same chunking strategy for different source document types which resulted in a distribution with significant outliers.

Document Chunk Length Distribution with Outliers

If we dig into the distribution above we see that multiple documents have a character length of over 15,000, and as high as 32,000 characters. This is because we applied a strategy that split on newline characters to markdown files that did not have any newline characters. Therefore, each of those documents was chunked into a single chunk. Running these statistics helped us immediately identify the error and correct the strategy.

We later re-ran our chunking strategy on the same corpus, but with a target length of 512. We also ensured that we split on appropriate characters, which resulted in the following distribution.

Text Length Distribution without Outliers

While there are issues to investigate, including the text chunks of a single character, this distribution is much improved and does not contain any chunks larger than our target of 512.

Text Chunk Word Count

Another useful statistic is the word count per text chunk. This statistic can be particularly useful when analyzing different chunk types (i.e., text versus code) and for identifying outliers in the data. It is also a useful statistic to use when evaluating LLM performance on pieces of context of different types. When we ran the statistic and generated the distribution on our grounding data we identified the following:

Overall the distribution looks generally healthy. However, it indicates that we have text chunks that contain a single word. We will keep this in mind as we generate and analyze GDS-based statistics on our text chunks.

Chunk Average Word Length

The final statistic we will calculate here is the average word length per chunk. As with the other statistics, this is highly useful for understanding different chunk types and for identifying errors and outliers in our data. When we calculate on our context data it produces the following distribution:

Distribution of Text Average Word Length

The distribution is remarkably similar to the initial character text length distribution above that identified errors in our chunking strategy. What we see here is that is most of our text chunks have an average word length of 10 or fewer characters. However, there are some at or near the maximum of 512 which further signals some errors or other issues in the data that we will need to address.

GDS Algorithms and Grounding Data

As mentioned above we will load the text chunks into the Neo4j knowledge graph using the simple graph data model: (d:Document) — [:HAS_SOURCE] →(u:URL). We can then use algorithms from the Graph Data Science (GDS) Library to gain further insights into the entire corpus. The three traditional statistics we calculated above were certainly helpful for identifying potential issues in our data and combining them with outputs from GDS algorithms will help us gain a much richer understanding of our data, particularly how the individual text chunks relate to each other.

Persist K-Nearest Neighbors Relationships

One of the first GDS algorithms we will apply to the text data is K-Nearest Neighbors (KNN). Rather than use node embeddings like we normally do in GDS workflows, we will use the chunk text embeddings that we attached as properties to the Document nodes and set as an index using the new Neo4j Vector Index capability. After calculating the nearest neighbor similarity relationships we can write them to the graph along with the similarity score as a relationship property. As a result, each Document node will have K similarity relationships to its most similar Document ‘neighbors’ in the graph.

Note that writing these relationships to the graph is not an essential step. In large graphs this can result in tens or hundreds of millions of new relationships. Rather than writing to the graph, we can stream the relationships to a separate file or simply analyze them in-memory via the GDS projection.

One factor to consider is the K to set in KNN. While there is no single answer to the best ‘K’, we recommend running the algorithm in ‘stats’ mode to see the distributions for different Ks. Our grounding data set is relatively small (14,500 nodes) and we set our K to 25. We will discuss this choice more in the context of running a Community Detection algorithm on the similarity graph.

Community Detection

Next we will run a Community Detection algorithm on the graph of Document similarities to find natural communities of closely connected documents. The Neo4j GDS Library contains several options, but Label Propagation (LPA) often performs well on this type of dense similarity graph particularly when the KNN similarity scores are written to the similarity relationships. After calculating the LPA communities we will write them back to the Document nodes for use in later analysis.

With LPA in particular there can be an inverse relationship between the K relationships and the number of identified communities. In our experimentation we found:

With K = 10 relationships, LPA identified approximately 575 communities.
With K = 25, LPA identified 160 communities.
With K = 100, LPA identified seven communities.

Given the size of our grounding data, K = 25 and 160 communities was an appropriate number to start. It also produced a reasonable distribution of community sizes, with most containing between 50 and 100 nodes:

We encourage you to experiment to identify results that work best for your data and use case.

Centrality

We will also calculate a Centrality score for each Document node. It does not make sense to use Degree Centrality because each Document node would have 26 outgoing relationships (25 Similarity and one Source). Instead, we will calculate a Weighted PageRank score for each Document. This will identify the most influential Documents based upon the strength of their relationships to other influential Document nodes. The score can help us identify the most influential documents in the overall corpus (based upon relationships), as well as the most influential documents within each community. The PageRank scores will also be written to the Document nodes as new properties.

Node Embeddings

Finally we will calculate Node Embeddings for each of the Documents in our similarity graph using the FastRP algorithm. Node embeddings are different than the text chunk embeddings because they capture the topology and contents of the graph itself. They will also allow us to visually inspect our Documents, along with the other GDS-based statistics we calculated above. Because our graph is small we can write these embeddings to the nodes without issue, though with a larger graph we may have streamed them directly to a Python or other environment.

Analyzing Grounding Data with GDS

After generating these statistics and embeddings using GDS algorithms, we can export them to our Python notebook to analyze in combination with the traditional NLP statistics we generated earlier.

Visualize Embeddings, Communities, and PageRank Together

First, we will use the FastRP node embeddings to visualize the LPA Communities and PageRank scores for each Document. Using T-SNE to compress the FastRP embeddings into two dimensions, we will represent the LPA Communities via color and the PageRank scores via size.

Document Node Embeddings with Community as Color and PageRank Score as Size

Note: The LPA Communities are integers due to how the algorithm operates. This results in a color gradient which can be more efficient to display than individual community colors alone. As the gradient shows, there is a spread among colors and communities which is sufficient for this high-level visualization.

The resulting plot provides several important insights into our data.

Clusters: We have good separation among communities. Document nodes are tightly clustered within the communities, which themselves are generally well separated from each other.
Communities: There distinct color separation across the visualization. Where there is overlap, the colors are generally of a similar hue indicating that where communities overlap, there appears to be some similarity.
Influence: There is clear differentiation among the influence of Documents, as indicated by the size of the icons in the plot.

While we still need to dig into the communities more, this visualization indicates that we have meaningful, distinct communities in our data set and there is a clear differentiation in the importance of individual text chunks within the data. If the visualization looked scattered, like a bowl of colorful cereal, we may have to re-evaluate the quality of our text chunks to understand if there were errors or other issues reducing their meaningfulness.

Analyze Communities

Next we will construct a DataFrame that enables us to use the traditional and GDS-based statistics to further investigate the document communities. We will do so by aggregating on the individual community and calculating the following features:

Size: Count of Documents in the community
med_textLen: Median text length of Documents in the community
med_wordCount: Median word count of Documents in the community
med_avgWordLen: Median average word length of Documents in the community
med_pageRank: Median PageRank score of Documents in the community

Once we have this DataFrame we can filter and sort it to provide additional insights into the communities. For example, the following is a table of the communities and their median statistics sorted by size.

Communities of Outliers

If we change perspective and sort by median Average Word Length we get a different perspective on the communities.

Document Communities Sorted by Median Average Word Length

Based upon the above table, text in six of our communities has an average word length equal to 512 or 511 characters, which is almost certainly to be an error or other quality issue in the data. If we investigate one of the communities we can see that this is the case.

Digging deeper we can see this text originates from a Markdown file posted on a Github page. If using similar files in the future we will want to adjust our scraping and chunking strategy or explore other options to reduce the likelihood of generating this messy text.

Conclusion

In this blog we explored how to build a knowledge graph of grounding text and explore it using traditional statistics and the GDS library. We also detailed how to evaluate the quality of the text embeddings and use communities to identify potential data quality issues. When building a grounded LLM application, the quality of the grounding text provided to the model is critical to its performance. As this blog demonstrated, Neo4j and Graph Data Science are uniquely suited to build and maintain high-quality grounding data sets. In future posts we will also demonstrate how Neo4j and Graph Data Science can enable you to visualize, understand, and analyze your entire LLM application in was not possible with other databases.