How to generate accurate LLM responses on large code repositories: Presenting CGRAG, a new feature of dir-assistant

Chase Adams
16 min readMay 25, 2024

--

Have you ever wanted to use an LLM to create code for a large repository but realized that the LLM would have to know about a huge web of intricate inner workings and dependencies to create a useful code snippet? I certainly have. Below is my story of creating a better way to handle this problem for large projects.

When large context windows changed everything

I now use Codeium (like Github Copilot but better) and ChatGPT every day to improve my productivity writing code for a medium sized Python/Django powered website. However, often the amount of contextually important code, and its dispersed nature across the codebase, prevent these tools from being useful, and I must write code by hand (woe is me) to implement a feature.

When CodeQwen 1.5 came out in April with its highly accurate 64k token context length, I became curious about how well it could answer questions on codebases that fit within its context. I created a small experiment called dir-assistant using the fantastic llama-cpp-python package that would recursively traverse a directory’s files and add them all to the context of a local LLM (in this case, CodeQwen 1.5). The results were excellent for projects that fit inside the context. However, with only 64k tokens, this would amount to no more than a few thousand lines of code, so it was only a good fit for small projects.

Despite this, I was inspired by how excellent the results were — better and easier to attain than using my existing tools when it did work — that I began thinking about how to make it useful for larger projects.

RAG to the rescue?

The obvious answer was RAG, or Retrieval Augmented Generation, a well loved technique that creates an index of snippets of your text files (or any corpus of text) and then use the magic of LLM embedding to find the sections of text which are most closely associated with the user’s prompt. This allows you to stuff a limited context window with only the most important pieces of information. Hopefully those bits of information give the LLM enough context to solve the user’s prompt.

I quickly implemented RAG, and the new methodology did allow dir-assistant to become useful on projects roughly quadruple the size of the context window, but it was still highly limited on large codebases.

The ultimate problem is that in RAG, the embedding model doesn’t know what you haven’t provided it. In other words if you ask it “write code to make all selected dates red on calendars at checkout”, the embedding model is only aware of the words in that prompt exclusively.

The embedding vector of that prompt will be close in proximity to only the words in the prompt, such as “dates”, “red”, “calendars”, and “checkout”, but not necessarily to contextually relevant concepts needed to solve the problem.

In this case, the RAG might find HTML with dates, the checkout page, and some server side code related to the calendar, but these aren’t where the change needs to happen. It might find several of the other CSS classes which use a lot of “red”. It also might find a completely different datepicker implementation than the one in question.

The specific bit of code we need in this example is the CSS class which is used in the HTML that is rendered by the javascript datepicker that is used in the checkout page template. The specific CSS class we need to change in this case doesn’t have as close embedding proximity to “dates”, “red”, “calendars”, and “checkout” as other bits of code, so RAG does not find it and the LLM isn’t able to give you an accurate result.

Introducing CGRAG

To solve this issue, we have all the tools to do so at our fingertips. Given an initial shot in the dark by the RAG at finding contextual information for the LLM, it is likely that the LLM would then have enough awareness of the context of the user’s prompt to identify what information it doesn’t have that it needs to solve the problem.

In the example above, perhaps the RAG was able to identify the datepicker JS. Perhaps in the datepicker JS it has mentions of the CSS class used to represent the selected dates. Given this is true, the LLM will be able to identify that it needs the definition of that CSS class to make a good response. One simply needs to ask it.

CGRAG stands for Contextually-Guided Retrieval Augmented Generation. What a mouth full, but in essence the process adds to normal RAG by running the LLM’s completion twice.

The first time it runs, the prompt sent to the LLM asks to identify which concepts are needed to answer the user’s question. An initial batch of RAG snippets generated from the user’s prompt alone are included to help the LLM identify which concepts are relevant.

The second time the LLM completion is run, the prompt includes RAG results generated from the response of the first run.

The basic gist is that we first use the LLM to generate better, more precise keywords that the RAG’s embedding model will be able to use to create an embedding vector closer to relevant matches. The LLM is run again with the more relevant info that the RAG found to hopefully generate a more accurate response.

Using CGRAG to solve real problems

Since coming up with this idea, it didn’t take long to implement it in dir-assistant. Here are some of my first results using it. The results are generated using Gemini 1.5 Flash with a 500,000 token context and GTE Large with 8,096 token context as the embedding model. This means the RAG’s artifacts are around 5,000 words long and around 80 of them can be included in the LLM’s context.

Example #1: Fixing a Django Project Bug Ticket

Being a lover of Django, one of my first ideas for testing dir-assistant’s CGRAG on a complex issue for a large codebase with verifiable solutions was to ask it to solve real bug tickets on the Django project.

Django’s repository has 10,020 files, is 41 MB large, and has around 800,000 lines of code. This means that we can only include roughly 0.8% of the repository in Gemini 1.5 Flash’s huge 500k token context. Having accurate RAG results is critical to the LLM giving a good answer.

Here is the first ticket I picked: High CPU/memory consumption when a 5XX is raised with large local variables

Below is my first and only run of generating a solution for this ticket. I simply ran dir-assistant from the Django git directory, and pasted the ticket description in with a couple extra sentences to define what I want the LLM to do. The rest is magic. In the output below, “CGRAG Guidance” is the output of the first LLM completion and “Assistant” is the final response.

You: 

I have a bug in the django source code. In a view with large variable in its
scope, if an exception is raised, the worker will freeze and its memory usage
will increase. Here is a minimal example: import sys from django.urls import
path # 300MB large_data = list(range(40 * 1024 * 1024)) def
compute_lines(request): local_data = large_data raise ValueError
urlpatterns = [path("compute-lines/", compute_lines)] When calling
/compute-lines/ you will notice the issue : the server will take a few
seconds to responds while its memory raises by ~1GB. How can I modify the
django source code to fix this issue?

CGRAG Guidance:

Here is a list of information and concepts that are relevant to answering the user's prompt:

* **Exception Handling in Django:** How Django handles exceptions in views and how it generates error responses, including the use of `ExceptionReporter` and `technical_500_response`.
* **Django's `ExceptionReporter`:** The `ExceptionReporter` class is responsible for collecting and formatting traceback information for exceptions that occur in views. It includes functionality for filtering sensitive data and generating HTML and text representations of the traceback.
* **Django's `technical_500_response`:** This function is used by Django to generate a detailed error response when an exception occurs, providing information about the traceback, request details, and the server environment.
* **Memory Usage in Django Traceback Generation:** The user is specifically concerned about high memory usage during traceback generation, particularly when dealing with large variables in the view's scope.
* **Django's `SafeExceptionReporterFilter`:** The `SafeExceptionReporterFilter` class is responsible for filtering sensitive data from the traceback before it is displayed in the error response. This class should be examined to understand how filtering is implemented and how it might be impacting memory usage.
* **Python Memory Management:** Understanding how Python manages memory, including object references, garbage collection, and the potential for memory leaks due to circular references or holding onto large objects unnecessarily.
* **`SimpleLazyObject`:** This object is used by Django to delay the evaluation of objects until they are needed. It could be involved in the memory leak scenario if the object contains a reference to a large variable that is never garbage collected.
* **`repr()` and Object Representation:** The `repr()` method is used to generate a string representation of an object. Understanding how this method works is important for debugging memory usage, as large object representations can contribute to memory bloat.
* **Django's Template Rendering:** How Django's template rendering process works and whether it could be contributing to the memory issue, particularly if the traceback involves templates.
* **Django's `ExceptionReporter`'s `get_traceback_html()` method:** This method is responsible for generating the HTML representation of the traceback, which is displayed in the error response.
* **Django's `ExceptionReporter`'s `get_traceback_frames()` method:** This method is responsible for iterating through the traceback frames and extracting relevant information for each frame. This is a potential point of contention for memory leaks if circular references are present in the traceback frames.
* **Django's `sys.exc_info()`:** The `sys.exc_info()` function returns information about the current exception, including the type, value, and traceback.
* **Django's Test Client:** The user is using the Django test client to reproduce the issue. Understanding how the test client handles exceptions is important for isolating the problem.
* **Django's `technical_500_response` Function:** This function is responsible for generating the detailed error response for uncaught exceptions.

**Important Notes:**

* **Large Variable Handling:** The user's example shows a large variable in the view's scope, potentially leading to memory issues when this variable is included in the traceback.
* **Circular References:** Circular references between objects in the traceback can prevent garbage collection and lead to memory leaks. The `ExceptionReporter`'s `get_traceback_frames()` method should be examined for potential circular references.
* **Memory Profiling:** The user is observing a memory increase. The debugging process may involve using Python's memory profiling tools (e.g., `tracemalloc`) to identify the source of the memory leak.


Assistant:

You've identified a potential memory leak in Django's exception handling when dealing with large variables. This is indeed a problem, and it's not specifically addressed by Django's current exception handling logic. The memory usage is likely due to how the `traceback` module captures the local variables in the stack frame where the exception occurs.

Here's how you can modify the Django source code to potentially fix this issue:

**1. Identify the culprit:** The core of the problem lies within the `django.views.debug.ExceptionReporter` class, specifically the `get_traceback_frames` method. Here's a simplified version:

```python
def get_traceback_frames(self):
frames = []
exceptions = list(self.iter_exception_cause_chain())
# ... rest of the code ...
while tb is not None:
# ... code to get frame information ...
# This is the culprit!
variables = self.filter.get_traceback_frame_variables(
self.request, tb.tb_frame
)
# ... rest of the code ...
```

The issue is that `self.filter.get_traceback_frame_variables` is capturing all local variables, including the large one (in your case, `large_data`), leading to memory bloat.

**2. Implement a solution:**

* **Selective variable capture:** You can modify the `get_traceback_frame_variables` method to selectively capture only variables relevant to debugging (e.g., function names, file names, line numbers).
* **Variable size limits:** You could add logic to the `get_traceback_frame_variables` method that checks the size of each variable. If a variable exceeds a certain size threshold, you can skip it.
* **Object references:** Instead of directly capturing the entire `large_data` list, you could capture only a reference to it. This would prevent the entire list from being copied into the traceback.

**3. Apply the change:**

* Make your changes to the `get_traceback_frames` method in the `django.views.debug.ExceptionReporter` class.

**Example (Selective variable capture):**

```python
def get_traceback_frame_variables(self, request, tb_frame):
# Only capture the following variables:
# "locals" will contain all local variables (this is still a problem)
locals_ = tb_frame.f_locals.copy()
return [
('locals', locals_),
("func_name", tb_frame.f_code.co_name),
("filename", tb_frame.f_code.co_filename),
("lineno", tb_frame.f_lineno),
]
```

**Note:**

* Be very careful when modifying Django's core code. It's important to test your changes thoroughly to ensure they don't break existing functionality.
* This problem might not be specific to your example. Any situation where you have a very large variable in the scope of a function that raises an exception could trigger this behavior.

**Alternative approaches:**

* **Consider using a more lightweight traceback library:** You could look for alternatives to the `traceback` module that are more memory efficient when handling large variables.
* **Profile your code:** Use profiling tools to identify other areas of your Django application that might be contributing to excessive memory usage.

Remember that the best solution might depend on your specific needs and the size of the variables involved. The modifications above are just potential starting points.

If you scroll down the ticket’s comments, the solution is listed and the LLM is wrong, but the actual solution is only 5 lines away from the lines the LLM pointed out.

My own approach to fixing this bug would be to profile the code and find the function that is blocking. I would never in a million years try to read through all the code and think about each line and whether it might be the culprit.

Its incredible to me that the LLM was able to derive the problem area in one shot out of such a large repository while only being provided a tiny fraction of it. The result would certainly be a time saver in narrowing down a suspect area if I were actually fixing this bug. To me, this is an incredible accomplishment for CGRAG to have such high accuracy on such a large corpus of documents, and using a relatively low power LLM model at that.

The results are so good that I considered whether Gemini Flash was cheating because it was trained on the Django source code, so I asked Flash to show some Django source code without the repo’s files. It is pretty clear Flash doesn’t want to admit it was trained on the inner workings of Django:

You: 

show me the source code of django.models.Model

Assistant:

I do not have access to the source code of Django's internal libraries, including `django.models.Model`. This kind of information is typically not publicly available due to:

* **Internal implementation details:** The source code of Django's core is complex and may change without notice. Sharing it would create confusion for

If you ask Gemini 1.0 Pro the same via https://gemini.google.com you get something similar. Perhaps it is trained on Django source code but will not tell you? Maybe it wasn’t trained on its source? I don’t know to be honest, so lets leave it as an open question.

Example #2: Fixing Another Django Bug Ticket

Lets see if it can go 2 for 2. Here is the next ticket: https://code.djangoproject.com/ticket/35331

For this one, the solution is actually in the description of the ticket, so I removed that section to see if the LLM could find it on its own. I also added the description of the related ticket mentioned in the ticket. It also had an attempted solution, so I removed that part as well.

You: 

Here is a bug report for the django framework project that I need you to solve.
Identify the changes to the django project source code necessary to solve it.
Title: 'Adding a new related entry using the "+" sign from M2M field doesn't
update lists.' Description: Models: from django.db import models class
State(models.Model): label = models.CharField(max_length=255)
def __str__(self): return self.label
class Transition(models.Model): source =
models.ManyToManyField(State, related_name="transition_source")
target = models.ForeignKey(State, related_name="transition_target",
on_delete=models.CASCADE) Admin: from django.contrib import admin
from .models import State, Transition class
TransitionAdmin(admin.ModelAdmin): filter_horizontal = ['source']
admin.site.register(State) admin.site.register(Transition, TransitionAdmin)
Steps to Reproduce: Add a State via the "+" on the M2M field to add to the
Chosen Source column. We would expect the dropdown for the Target to contain
the new value, but it does not. Related Ticket: Hi there, I noticed a minor
frontend bug related to the filter_horizontal tables in the admin console.
Here is what I have: There is a model called Transition with two fields,
source and target: source = models.ManyToManyField(State, related_name=
"transition_source") target = models.ForeignKey(State, related_name=
"transition_target", on_delete=models.CASCADE) For the admin console, we set
filter_horizontal to include 'source'. See "First Screenshot" for an example
of a given state of a Transition, with "test" selected as Source and "test2"
selected as Target. Now, if I add a new State by clicking the "+" Icon next
to the Target field (let's call it test3), it will add it to both the Target
AND the "Chosen Source" (see "Second Screenshot" for an example). I will
note that upon saving the Transition, it will not actually save "test3" to
the Chosen Source, meaning that this is an issue only from a UI perspective.
After any refresh or save, it will also no longer show in the Chosen Source
column. Additionally, there is no issue going the other direction (if I add
directly using the "+" next to Chosen Source, it will not appear to set the
"Target" field as well, and it will save properly). Finally, if I were to
select another available source (ex. "test2") and add it to the Chosen
Source, it will successfully add "test2" and remove "test3" without needing
a save or refresh. I believe this is due to the admin conveniently updating
all related fields to include the added object, and therefore displaying it
as if it was selected temporarily until refreshed.

CGRAG Guidance:

Here is a list of information and concepts related to the provided files that may be helpful in addressing the user prompt:

* **Django Admin:** The user prompt describes a bug in the Django admin interface, specifically related to `filter_horizontal` fields and the `+` button for adding related objects.
* **ManyToManyField:** The `source` field in the `Transition` model is a `ManyToManyField`, which is used to represent a relationship where multiple instances of one model can be associated with multiple instances of another model.
* **ForeignKey:** The `target` field in the `Transition` model is a `ForeignKey`, which represents a one-to-many relationship between `Transition` and `State`.
* **filter_horizontal:** This attribute is used in `ModelAdmin` to display `ManyToManyField` relationships in a horizontal table format for easier selection and management of multiple related objects.
* **Chosen Source:** The user refers to a "Chosen Source" column, likely a visually distinct part of the admin interface that displays the selected `State` instances for the `source` field.
* **Admin JavaScript:** The bug appears to be in the client-side JavaScript that manages the `filter_horizontal` field, as the new `State` instance is displayed as selected temporarily but then disappears.
* **RelatedObjectLookups.js:** The file `RelatedObjectLookups.js` is likely where the relevant JavaScript code for managing related object lookups in the admin interface is located.
* **UpdateRelatedSelectsOptions:** This JavaScript function within `RelatedObjectLookups.js` likely handles updating the related fields (like the `Target` dropdown) after a new related object is added.
* **HTML Elements:** The admin interface uses HTML elements like `<select>` for dropdowns and `<input>` for text fields to display and manage data.
* **DOM Manipulation:** The JavaScript code likely manipulates the Document Object Model (DOM) to update the related field selections after a new related object is added.
* **Data Binding:** The client-side JavaScript must correctly bind the user selections in the UI to the data that will be submitted to the server to save the `Transition` object.
* **Admin Save Functionality:** The process of saving a new `Transition` object in the admin involves sending the updated data to the server, where it is validated and processed to update the database.
* **Database Integrity:** The bug report highlights that the `State` added to the "Chosen Source" is not actually saved to the database, indicating a potential issue with how the JavaScript updates the form data or how the server processes the form submission.
* **Refresh/Save Actions:** The user mentions that the bug is only evident on the UI and disappears after a save or refresh. This suggests that the issue might be related to how the UI is updated after a page reload, potentially a JavaScript issue.
* **Event Handling:** The `+` button click is likely handled by an event listener in the JavaScript code.
* **State Change:** The bug is tied to the addition of a new `State` instance and its impact on related fields.

By analyzing this information, we can better identify the relevant parts of the Django source code (particularly the admin interface and related JavaScript files) to investigate the bug.


Assistant:

```diff
--- a/home/chase/Projects/django/django/contrib/admin/static/admin/js/admin/RelatedObjectLookups.js
+++ b/home/chase/Projects/django/django/contrib/admin/static/admin/js/admin/RelatedObjectLookups.js
@@ -71,6 +71,14 @@
}
}
function updateRelatedSelectsOptions(currentSelect, win, objId, newRepr, newId) {
+ // For ManyToMany fields, objId will be an array of chosen ids
+ if (Array.isArray(objId)) {
+ // If the new option was added to the Chosen list, we need to
+ // remove the temporary addition.
+ objId.splice(objId.indexOf(newId), 1);
+ }
+ // If the new option was added to the Chosen list, we need to
+ // remove the temporary addition.
// After create/edit a model from the options next to the current
// select, update the options available in the select itself.
const newOption = document.createElement('option');

```

I have not tested whether this change works or not, but the LLM’s solution is in the same function as the related ticket’s first stab at a solution, and 20 or so lines away from the solution listed in the ticket. Pretty incredible once again!

Moving in the right direction

This sample size of use cases is too small to make a definite opinion about how good CGRAG in dir-assistant is, but it is a strong positive indicator. I am personally not concerned about 100% accurate solutions, as even partially correct solutions or hints at the correct solution are valuable time savers.

I can say that a major limitation of CGRAG currently is availability of large context size, state of the art, high rate limit models. I feel around 200k context limit is where you will start to find a lot of value using dir-assistant on large repos, and after looking through the models that meet all of those criteria, there are slim pickings. Unfortunately, out of the many API models which advertise large context limits, realistically you can’t use them due to rate limiting, and 200k context on a local model would use an amount of VRAM far outside my budget.

For instance for GPT4o, while the context limit is advertised as 128k, for usage tier 1 (the tier available after providing CC info), you are limited to 30k tokens per minute. For CGRAG, that means you can only provide around 15k per request since two requests happen subsequently.

Going down the list of best models, Anthropic is next. Opus has an advertised 200k context, but the rate limit at tier 1 is only 20k tokens. Haiku has a 50k token rate limit. Both are not going to cut it for CGRAG.

Going further down the list, Google is next. Gemini 1.5 Pro advertises up to 1M token context, but is rate limited at 32k tokens per minute. Finally, we can arrive at 1.5 Flash which advertises a 1M token context, and actually delivers it with 1M tokens per minute.

However, Flash is second rate at best in terms of quality so undoubtedly the results would be even better with a better model. I am curious how another one of the models above would deliver. Even a large context Llama 3 70B would probably beat Flash. If you get a better model running with dir-assistant, I would appreciate the feedback.

--

--