Building Knowledge Graph over a Codebase for LLM

Zimin Chen
7 min readJun 5, 2024

--

Enabling Large Language Models (LLMs) to reason over selected code snippets within their context size is relatively straightforward. By copying and pasting the code snippet with some prompt engineering, LLMs often handle the request quite well. For example, GitHub Coplit Chat already allows you to generate unit test, fixing bugs, or explain a manually selected code snippet. However, how can we extend it to the scope of the whole codebase? Can we also skip the manual selection of code snippet and allow the LLM to figure it out? Building a knowledge graph over the codebase has the potential to tackle these challenges.

In this post I would like to discuss on why and how we can build a knowledge graph over a codebase for LLM to reason over it. The knowledge graph can be used in Retrieval-Augmented Generation (RAG) or Coding Agents for better context retrieval or understanding large codebases. I will discuss the concepts with help of the following example. This is a simple Python codebase with “requirements.txt” defining the dependencies, “README.md” describing the projects, and “src/” directory containing the source code of the codebase. The application converts time between different time zones, and the codebase is structured as follows:

TimeUtils/
├── src/
│ ├── time_app.py
│ └── time_utils.py
├── requirements.txt
└── README.md

First I will discuss on the Why, followed by How.

Why Building Knowledge Graph over Codebase?

The short answer is that source code differs fundamentally from natural language — it has a specific structure and is executable. Treating source code merely as text and using approaches designed for textual data may not be optimal. Files within a codebase have inherent structures and dependencies among them. Moreover, it is challenging to reason over an entire codebase where a single file might exceed the context size limit of an LLM.

Let me illustrate the point by building a RAG chatbot (an oversimplified one) over the simple codebase that we have. I will not cover the basic concepts on how RAG works.

RAG with TextSplitter

In this section, we will only discuss on the indexing step of the RAG chatbot, ie. how we split the data and store them in a database. Splitting the data is necessary because of the context size of a LLM during the generation process when the same data split is retrieved.

In the time_utils.py file, we have:

from datetime import datetime
import pytz

def get_current_utc_time():
"""Returns the current UTC time as a formatted string."""
utc_now = datetime.now(pytz.utc)
return utc_now.strftime("%Y-%m-%d %H:%M:%S")

def convert_time_to_timezone(timezone_str):
"""Converts the current UTC time to a specified timezone."""
try:
local_zone = pytz.timezone(timezone_str)
local_time = datetime.now(local_zone)
return local_time.strftime("%Y-%m-%d %H:%M:%S")
except pytz.exceptions.UnknownTimeZoneError:
return "Invalid timezone."

If we simply treat source code as text, we can use a text splitter (such as CharacterTextSplitter in LangChain) that splits a text based on the characters. It might split the source code into:

############################ Chunk 1 ############################
from datetime import datetime
import pytz

def get_current_utc_time():
"""Returns the current UTC time as a formatted string."""
utc_now = datetime
############################ Chunk 2############################
.now(pytz.utc)
return utc_now.strftime("%Y-%m-%d %H:%M:%S")

def convert_time_to_timezone(timezone_str):
"""Converts the current UTC time to a specified timezone."""
try:
local_zone = pytz.timezone(timezone_str)
local_time = date

############################ Chunk 3############################
time.now(local_zone)
return local_time.strftime("%Y-%m-%d %H:%M:%S")
except pytz.exceptions.UnknownTimeZoneError:
return "Invalid timezone."

Because of the splitter has no notion of what a function is, or how programming languages are structured. Each chunk does not represent any meaningful source code structure. Therefore if we ask a RAG chatbot based on TextSplitter about “What does the convert_time_to_timezone function do?”. It will not be able to answer the question correctly, because the “convert_time_to_timezone” function is divided into two chunks.

RAG with CodeSplitter

Of course, there are also splitter specially designed for source code, such as CodeSplitter in LlamaIndex. They are built upon parsers that generates an Abstract Syntax Tree (AST) over the source code. AST is a tree data structure to represent a program. Below is one example taken from Wikipedia:

while b ≠ 0:
if a > b:
a := a - b
else:
b := b - a
return a
The Abstract Syntax Tree (AST) of the shown code snippet.

For splitting the source code, AST can be used to identify the exact location boundary for each source code component (class, functions …). Using the CodeSplitter, we can get the following chunks:

############################ Chunk 1 ############################
from datetime import datetime
import pytz

############################ Chunk 2 ############################
def get_current_utc_time():
"""Returns the current UTC time as a formatted string."""
utc_now = datetime.now(pytz.utc)
return utc_now.strftime("%Y-%m-%d %H:%M:%S")

############################ Chunk 3 ############################
def convert_time_to_timezone(timezone_str):
"""Converts the current UTC time to a specified timezone."""
try:
local_zone = pytz.timezone(timezone_str)
local_time = datetime.now(local_zone)
return local_time.strftime("%Y-%m-%d %H:%M:%S")
except pytz.exceptions.UnknownTimeZoneError:
return "Invalid timezone."

Now if we ask a RAG chatbot based on CodeSplitter about “What does the convert_time_to_timezone function do?”, it should be able to answer it correctly.

However, even when splitting source code into structure meaningful chunks, we face the challenge of losing information about the relationships within and between these chunks. A RAG system that processes source code in this fragmented manner will struggle with queries such as “How many functions are defined in the time_utils.py file?”, or “List all files where the variable local_zone is used”. The answer to these questions are not necessary located in a single chunk, function, class or file. To effectively answer such queries, we must have the capability to reason across the entire codebase. This is precisely where constructing a knowledge graph over the codebase becomes invaluable.

How do we Build Knowledge Graph over Codebase?

Building a graph over a codebase is not a novel concept in the field of software engineering. Different static and dynamic analysis of the source code already build graph over source code for machine code optimization or vulnerability detection. Even building graph for machine learning models on source code related tasks has been explored long before LLM are introduced [1–3].

Simple Knowledge Graph

Here, we can showcase how a simple knowledge graph over a codebase can be built that allows a LLM to reason over the whole codebase. In the example graph, we use blue node to represent a file/directory, and green node to represent an AST node. Between file nodes we have HAS_FILE edges between the parent directory and the child file. Between file nodes and AST nodes we have HAS_AST edges between the source code files and the root AST node. Between the AST nodes we have HAS_PARENT edges between the parent and child AST nodes. We will get the following graph:

Knowledge graph for our example codebase. Blue nodes represent file/directory, green nodes represent AST node. Between file nodes we have HAS_FILE edges between the parent directory and the child file. Between file nodes and AST nodes we have HAS_AST edges between the source code files and the root AST node. Between the AST nodes we have HAS_PARENT edges between the parent and child AST nodes.

This graph can be stored in a graph database such as Neo4J, and our previous two questions: “How many functions are defined in the time_utils.py file?”, or “List all files where the variable local_zone is used” can be answered with a LLM that is capable of generating the following answers:

Generated Cypher:
MATCH (f:file {filename: 'time_util.py'})-[:HAS_AST]->(root:ast)
MATCH (root)-[:PARENT_OF*]->(func:ast {type: 'function_definition'})
RETURN count(func) AS NumberOfFunctions
Full Context:
[{'NumberOfFunctions': 2}]

{'input': 'How many functions are defined in the time_utils.py file?',
'output': 'There are 2 functions defined in the time_utils.py file.'}
Generated Cypher:
MATCH (f:file)-[:HAS_AST]->(root:ast)
MATCH (root)-[:PARENT_OF*]->(func:ast{{type: 'identifier', text: 'local_zone'}})
RETURN f
Full Conext:
[{'f': { 'filename': 'time_utils.py' }}]

{'input': 'List all files where the variable local_zone is used.',
'output': 'The variable local_zone is used in the time_utils.py file'}

In this way, the LLM generates Cypher queries based on user inquiries and uses the results to respond. These queries can be generated by the LLM using the LangChain GraphCypherQAChain. Moreover, if you build a RAG agent on the knowledge graph and prompt it query the graph database multiple times if necessary. It is capable of chaining multiple cypher queries, allowing you ask for even more complex question over the codebase. It shows the power of knowledge graph which stores the relationships between/within files and source code, allowing us to answer questions that text splitting-based RAG chatbot is not capable answering.

More Advanced Knowledge Graph

The knowledge graph can be extended with other types of graphs that are used in static analysis, such as data flow or control flow graphs [4]. Even more interesting is to incorporate run-time data (or from dynamic analysis), such as coverage from testing [5]. Incorporating these information will have even greater potential for LLMs to solve more difficult problems at codebase level.

But one thing to keep in mind is that while these tools brings benefits with additional information in the knowledge graph, it comes at a cost that the approaches will be language dependent. Most of the tools only works for a limited set of programming languages and require complex environment setup, making it less general.

Use cases for Knowledge Graph over Codebase

If we had a knowledge graph over a codebase, in what whys can we utilize it? Here are some of my thoughts:

  • Better context retrieval for RAG over a codebase, which I tried to illustrate with the example.
  • Large codebase understanding for Coding Agents. The knowledge graph can be included in the coding agents to improve the generation[6]/editing[7]/completion[8] performance at codebase level.
  • In general, I think it is useful for enabling/improving many codebase related tasks, with is currently an emerging topic in this area (You can find many papers published in 2023/2024 by searching “repository level coding” or “repository level code llm” on Google Scholar).

References

[1] Allamanis, Miltiadis, Marc Brockschmidt, and Mahmoud Khademi. “Learning to represent programs with graphs.” arXiv preprint arXiv:1711.00740 (2017).

[2] Yasunaga, Michihiro, and Percy Liang. “Graph-based, self-supervised program repair from diagnostic feedback.” International Conference on Machine Learning. PMLR, 2020.

[3] Chen, Zimin, et al. “PLUR: A unifying, graph-based view of program learning, understanding, and repair.” Advances in Neural Information Processing Systems 34 (2021): 23089–23101.

[4] Wang, Xin, et al. “CODE-MVP: Learning to represent source code from multiple views with contrastive pre-training.” arXiv preprint arXiv:2205.02029 (2022).

[5] Lou, Yiling, et al. “Boosting coverage-based fault localization via graph-based representation learning.” Proceedings of the 29th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 2021.

[6] Luo, Qinyu, et al. “RepoAgent: An LLM-Powered Open-Source Framework for Repository-level Code Documentation Generation.” arXiv preprint arXiv:2402.16667 (2024).

[7] Bairi, Ramakrishna, et al. “Codeplan: Repository-level coding using llms and planning.” arXiv preprint arXiv:2309.12499 (2023).

[8] Phan, Huy N., et al. “RepoHyper: Better Context Retrieval Is All You Need for Repository-Level Code Completion.” arXiv preprint arXiv:2403.06095 (2024).

--

--