Where LLMs shine and fall short in my daily life as a software engineer
My experience writing a memory leak analysis algorithm with ChatGPT and Claude
I’ve been coding with Github Copilot for a few months now, and I regularly ask coding-related questions to ChatGPT and Claude. I’ve been both amazed and frustrated at these LLMs’ ability to answer questions in my daily workflow. Sometimes, they have saved me potentially hours of work, while other times they have led me astray, trying things that ultimately would not work. Why are these models so good in some cases and so bad in others? When are they helpful and when are they not?
Note: the specific models I’m referring to in this article are ChatGPT o1 and Claude 3.5 Sonnet.
Overall, my experience using these AI tools has led me to the conclusion that:
The more context-dependent a problem is, the less likely AI will help.
What makes a problem context-dependent is exactly what it sounds like: how much context would a real person need in order to answer the same question.
Here are some examples of problems that are NOT context-dependent.
- Most algorithm interview questions you will encounter, such as the ones on LeetCode.
- Asking it to write simple React components from scratch. For example, asking it to “write a Todo app in React that has an input field to enter new Todos, and the ability to delete a Todo.”
- Translating some code from one programming language to another.
LLMs can still produce buggy code in these cases, but generally, with some guidance, or by pointing out that something doesn’t work, the models will arrive at the right solution.
These context-independent problems are the types of questions LLMs usually excel at.
Let me illustrate with an example.
I recently worked on a project where I had to write an algorithm to analyze whether a graph is trending upward.
There are many shapes of upward trending graph: a linear increase, an exponential increase, a logarithmic growth, a staircase-like growth, or other nonlinear growth.
I first asked ChatGPT and Claude to do a linear regression to determine if there exists a linear upward trend. Linear regression is a technique used to find the line of best fit for a linear set of data, and to determine how well that line indeed fits the data.
Basically, I needed to translate the following formula for computing the slope and y-intercept into code:
as well as the formula for computing the R-Squared value, or how well the line fits the data.
Writing a linear regression algorithm is a very context-independent problem, and both ChatGPT and Claude did it perfectly.
However, as the problem gets more context-specific, they started to falter.
A real word application of trying to detect an upward trend is to determine whether a graph is indicative of a memory leak. For example, if we want to determine whether there is a goroutines
leak in a go
program, we need to look at the number of active goroutines
over time.
The example below is indicative of a memory leak in goroutines
because the number of goroutines
increases exponentially until the machine eventually runs out of memory and experiences a out-of-memory kill.
To continue on the previous example of analyzing whether a graph is trending upward, I asked ChatGPT to detect whether a graph is growing exponentially. This time, I gave it more context:
“I want to analyze a timeseries graph to determine if the graph displays exponential growth. Can you write a function that takes in 2 arrays. The first is list of timestamps in milliseconds since epoch. The second is a list of y-values at those timestamps. The function should return whether the graph is growing exponentially, and how confident it is.”
Note that I provided the context that the list of x-coordinates are in milliseconds since epoch.
How did ChatGPT do?
ChatGPT’s merely gave me an algorithm for exponential regression.
Exponential regression is a technique to determine whether a graph is increasing exponentially. It works by transforming the data into a linear form by taking the natural log of the y-values. By taking ln(y), an exponential relationship becomes a linear one.
ChatGPT did give me the correct algorithm for exponential regression. Similar to linear regression, doing a simple exponential regression is a context-independent problem.
However, here, we have the additional context that this is not just any graph, but a timeseries graph where the x-axis values are in milliseconds since epoch. ChatGPT failed to take this context into account.
As a result, when I ran the algorithm against real timeseries data, like the data for the goroutines
example, it did not work. This is because the x-axis values are in milliseconds since epoch and thus the numbers are HUGE!
A one hour time difference in milliseconds in the x-axis is 60 *60 * 1000 = 3600000. However, if the y-axis values go from 200 to 20000, the log of the difference between the min and max is only 4.6.
ln(200) = 5.3
ln(20000) = 9.9
As a result, when running a linear regression on the transformed y-axis values, the slope returned could be extremely small, and would not pass a reasonable slope threshold for linear increase like 0.01.
Luckily I still remembered some math from college, and I remembered that scaling the horizontal axis is something that might need to get done. Despite telling ChatGPT the context of timeseries in milliseconds since epoch, it did not automatically think about scaling.
Additionally, even when I deliberately asked about scaling, the model gave the wrong answer.
The model said “either strategy can help”. However, simply shifting the timestamp will not affect the slope obtained from linear regression. The timestamps must be scaled down. Thus, the correct answer is “both steps must be used” instead of “either strategy can help.”
Some prompts did give the right answer of using both steps, but the fact that some prompts work and other don’t is indicative of the difficulty that LLMs have when it comes to analyzing context.
Claude did slightly better, it included scaling in its algorithm:
However, in its answer, it did not explain the reasoning for scaling and why it chose 1000. Even when I asked about it, the answer failed to mention the relevance in making sure we get a slope that can be meaningfully positive when performing linear regression with log values.
Claude’s answer was mostly related to seconds being more conventional and intuitive:
Timestamps are in milliseconds since epoch, but working in seconds is more conventional and intuitive for growth rate analysis
So what does this all mean in the day-to-day life of an engineer like myself?
Most hard problems I encounter in my day-to-day life are very context-dependent. A bug or issue that appears as you’re coding in a codebase is usually very context-dependent. For example, I’ve rarely had AI help me fix a styling issue or achieve a tricky-to-implement user interaction. This is because whether a CSS or Javascript suggestion works or not is very dependent on what else is on the page or how the many existing components are structured.
Therefore, in today’s world, LLMs are more like assistants than replacements. The more domain knowledge one needs to do their job, the more this statement becomes true.
If a question (e.g. write a algorithm that does so-and-so) has been answered before, and it has been answered many times by many people, chances are, the LLMs will be able to answer it accurately.
However, if a problem has not been answered before (e.g. implement a tricky user interaction within the context of your existing codebase with its existing layout), chances are, the current generation of LLMs will struggle as well.