Why LLMs can not code?

10 min readApr 5, 2024

A couple of months ago, on one of our feature announcements, someone posted a comment, “I don’t believe LLMs can code.” Most of us would agree with such sentiment, we have experienced it when interacting with all kinds of AI tools and chatbots. We are assessing them based on the results they produce, which poses an intriguing challenge because we lack a clear understanding of what drives them to generate specific outputs.

Coding is interesting on its own merit. It has a creative aspect, where it is required to overcome certain problems and find solutions to them, and a repetitive side. The first category requires problem-solving skills. However, once a solution has been found, it can be replicated multiple times.

Addressing the general-purpose problem-solving capability in machine learning is a difficult problem and will remain unsolvable for some time. Leaving that apart, is it possible to explain why the models perform well when evaluated on benchmarks like Human Eval but completely fail in real-life applications? Are the benchmarks wrong? Do they not map the real world well enough, or maybe there is another underlying problem?

Part 1: Problems

Through our experience applying large language models in Software Development, we have identified three key reasons outside of problem-solving that are fundamental to understanding why real-life attempts to apply LLMs in software development are failing. The proposed list may not be exhaustive, and additional problems may be identified in the future.

Problem 1: Non-transferable knowledge.

I used the term non-transferable knowledge, even though it is not commonly associated with Machine Learning and is more often found in epistemology or even in the context of the job market. I could alternatively state that source code is an example of data that has poor generalizability.

One of the breakthroughs that led to the current AI revolution was the application of Transfer Learning for language models that had been applied in Transformers architecture. Previously this method has been successfully used only in Computer Vision. The problem may arise when the data that you try to train on has very little similarity. Which is exactly the case of source code. Anecdotally, I could state that “no two codebases in the world are the same”. This could also be proven quite easily; if two codebases were the same, that would mean that they are simply copies of each other. Therefore, no two codebases can be the same.

This means that Transfer Learning, or the training process in general, will not help us address even known problems on the unknown codebase. Simply put, if you train the model on N different codebases, it will not help you in generating code in the N + 1 codebase, or it will only work with limited applications. The true reason for this is twofold, first of all, each source code is very strictly fulfilling a certain set of requirements that may not overlap with each other. The second reason is that there is an almost unlimited number of ways that certain concepts can be expressed in code and some of them are nothing more than the personal preference of the author or authors. Even software libraries that are focusing on solving the same problem may use completely different terms to describe the same capabilities. The amount of variation in expressing those is simply too large. This is not to say that there is no overlap across the code. If we could establish a hypothetical metric that would define the similarity of two codebases. I would imagine that would be primarily clustered around the common runtime that is shared between them like Python, NodeJS, JVM, etc. The secondary clustering would happen across a common set of dependencies, though this is where the level of similarity would end between all of them.

If Transfer Learning is not a workable solution are GPTs doomed when used in such applications?

To answer that question, let’s first see how humans solve this problem. Software developers experience the same problem firsthand each time they join a new project. It does not matter how many years of experience they have; each time, they need to go through a learning curve.

The ability to learn appears to be the most reasonable solution to address this problem. AGI would check that box, but in the absence of that is there a way we can “teach” the model? For LLMs, the only way of doing that at scale is to fine-tune the model. Despite that fine-tuning itself was not intended as a means of passing the knowledge to the model. For large problems like codebases, which require operating on tens of millions of tokens, there is no other viable alternative. For a smaller-scale problem, you can apply prompting or the RAG approach successfully.

Problem 2: Knowledge drift.

We may not think about it very often, but the amount of general knowledge is continuously expanding. There are a set of well-known facts, like the definition of the Solar System or the biography of Isaac Newtown, that are fairly static and do not change over time. Yet, humanity is continuously making discoveries. This problem of knowledge expansion is also present in the software development space. It is even more complex than that because every codebase that is being actively worked on is being changed over time due to bug fixes, improvements, refactoring, etc. This introduces a natural drift of the code over time. What might be surprising is the rate at which this change may be happening; even for a well-established codebase that has been in development for over a decade with millions of lines of code, they might be changing at a rate of 2–3% a month. When projected in the future this would mean a yearly drift in ranges of 24–36%. The models that have been even once trained on a given codebase will tend to trend downwards in terms of the similarity of the generated output as the actual codebase is continuously drifting away.

This means that training the model once on the codebase is not a sufficient enough solution and that each model will have to be re-trained so that it would not “fall behind”. The frequency of how often this should be done is only dependent on the dynamic of the introduced changes and the acceptable rate at which it would be allowed for the model quality to deteriorate.

Problem 3: Coding is an ambiguous process

I recall from my NLP class back in my university years that there is general agreement that natural language can often be ambiguous and that there are multiple layers at which this ambiguity can manifest itself. The code itself is not considered ambiguous. However, the entire process of writing code is. There are at least two reasons why this might be the case, the ambiguity may manifest itself when trying to map certain parts of code into specific requirements, specifically when requirements that existed when the code was created are no longer known. Recreating the entire tough process that had led to specific implementation may not be possible in such cases. The second reason, which might be often overlooked, is that the code itself is just a manifestation of countless small decisions that the author or authors have made. Once the code is written it becomes an “axiom”. It now defines exactly how certain aspects have been coded and become “written in stone,” so to speak. There are probably countless numbers of equally possible ways of expressing the same concept in code, though once that has been chosen the rest of the code takes the hard dependency on it.

It is possible to visualize this problem with a simple example. Here is a real-life production code:

public MockitoException(String message) {
       super(message);
      
       unfilteredStackTrace = getStackTrace();

       ConditionalStackTraceFilter filter = new ConditionalStackTraceFilter();
       filter.filter(this);
}

That at some point has been refactored into:

public MockitoException(String message) {
       super(message);      
       filterStackTrace();
}

Considering that no other part of the file has been changed, there is no way to tell what the intended outcome is. The new code becomes an “axiom.” When AI has been asked to implement a constructor in this class, an untrained model repeatedly implements it by simply invoking the base class constructor, not knowing any better. A model that was fine-tuned on the previous code repeatedly restores the original implementation.

This is also a problem that you might experience yourself when using chatbots. When the result does not match your expectations, you would have to enter additional context into a prompt to get the desired outcome. We might not think about it as a problem, but without this additional context, this becomes an issue of ambiguity.

This may become a fundamental problem that either could be mitigated or there might be approximate solutions for it, but it may remain unsolvable in principle.

Part 2: Experiment

We uncovered those problems once we started conducting a large-scale test of using LLMs for code generation.

Experiment structure

One of the unique capabilities of CodeMaker AI architecture is the ability to perform autonomous operations at scale. We leveraged that capability to execute a task against the entire codebase. For the experiment, we chose the popular Java testing library, Mockito. We first erased all of the code out of the src/main/java directory, leaving only the types and method stubs. We then used the CodeMaker AI batch code generation capability to re-create the implementation code.

Methodology

We had to establish a measure to be able to compare two codebases together. Comparing source code is a complex problem. Code that has a different syntactic structure or even semantic structure may be logically the same. We choose to use a straightforward approach, even being familiar with some of the alternatives like the Pass@k metric. We decided to define two metrics:

Error rate — defined as the Lavenstein distance between two compared files, gives the absolute error of how two files are apart from each other. The metric had been averaged across all of the files.
Similarity rate — defined as sim(a, b) = 1 — (dist(a, b) / max(|a|, |b|)) that answers the questions of how two files are similar to each other. The metric had been averaged across all of the files.

Experiment 1: Evaluating foundation models.

We first re-created the code using two different foundation models. The models haven’t been fine-tuned on the codebase. We chose models with different scales of parameters for comparison. Trying to quantify what difference would make using 7B and 70B parameter models.

Foundation 7B Parameter Model

Average error rate 1042.46
Average similarity 0.83

Foundation 70B Parameter Model

Average error rate 1063.14
Average similarity 0.80

Keep in mind that since we only erased the method implementation, we are not starting the comparison from zero, though this does not stop us from comparing the results against each other.

It is notable that the larger model, the 70B parameter model, performed slightly worse than the 7B parameter model. This was a clear indication to us that increasing the model complexity is not going to solve this category of problems. The data is clear.

Experiment 2: Fine-Tune 7B Parameter Model

We then repeated this process using a fine-tuned 7B parameter model. We made multiple iterations to see how pre-processing the training data impacted the end results. We are publishing the final results that fine-tune the model with a number of Epoch of 5 and 10, respectively.

Fine-Tuned 7B Parameter Model (5 Epochs)

Average error rate 616.01
Average similarity 0.90

Fine-Tuned 7B Parameter Model (10 Epochs)

Average error rate 474.00
Average similarity 0.92

The created code can be found here.

By performing these tests, we proved that fine-tuning the model on the codebase improves the similarity of the generated code. However, despite being fine-tuned across the entire code, the model was not able to recreate the code perfectly. We will look more closely into this in the future.

We noticed a couple of interesting choices made by the model when we inspected the code. For instance, when generating constructor code, it would skip the fields that had been already initialized, avoiding writing unnecessary code. We have never trained it to perform such forms of optimizations.

We haven’t yet validated how much and whether the fine-tuning impacted the model's ability to generalize or its overall capabilities. This remains an open question. From manual tests, we were able to tell that the model remained pretty functional, but quantifying that by re-running the benchmarks would be an interesting experiment.

When we publish this post, we haven’t yet completed the tests for the fine-tuned 70B version of the model. It would be interesting to verify if it might make any difference.

We plan to recreate this experiment or an even larger codebase aiming to increase them by order of magnitude and assess if this impacts the model's ability to recreate the code.

Solution: Specialization

From our perspective, the solution is clear: problems like coding will force a different approach than building even more complex and larger models. That solution is specialization. By specialization, we understand fine-tuning the model for specific code or task. There is no point in fine-tuning LLMs on all of the codebases in the world, because as soon as the model faces the “unknown,” it will not be able to give any meaningful results. Now, that does not mean that we don’t need the model emergent abilities. We also see the potential benefit of expanding that model knowledge with the code of its direct dependencies, but we haven’t yet validated that idea.

This is not to say that it is not possible to train and apply models for performing certain specialized tasks, as this has clearly been accomplished. However, those applications would be limited in scope and generally work when applied pointwise. As soon as you start considering the idea of operating a model across the entire codebase, it is necessary to make the model aware of its entirety.

Isn’t fine-tuning simply overfitting the model and brute-forcing the solution? I don’t believe so; I wrote this post to state that the underlying problems are way more fundamental and that what limits the applications of LLMs is not their capabilities, or lack thereof, but rather the constraints of the real world.

Learnings

We use our model architecture together with the fine-tuning pipeline and model evaluation to process source code at scale. This allows us to recreate the implementation of the entire library with reasonable similarity of the results when compared to the reference implementation.

The experimentation shows us that we have found a clear recipe for building AI products, either chat, code completion, or autonomous coding agents that will be able to operate within the existing codebase. The interesting side effect of the architecture that we have built is that it allows us to perform a large-scale benchmark for evaluating model performance in real-life applications.

Key takeaways

Models that can learn are useful in tackling already-solved problems. It would become crucial to develop models that have general-purpose problem-solving skills as coding demands both of those skills. Only then would it be possible to develop a truly versatile AI coding model, possibly even before AGI would be developed.

Why LLMs can not code?

Written by Jakub Narloch