Challenges in code generation with GPT-4

James Murdza
3 min readJul 20, 2023


At this point it is well established that large language models do pretty well in a variety of code generation scenarios—such as generating code snippets based on instructions. If you work with these systems you’ll develop a gut feeling for these abilities, but its difficult to put a finger on their strengths and weaknesses.

Full-stack code generation with GitWit

Going deeper with HumanEval

I used HumanEval (the standard set of 164 Python problems from OpenAI) to evaluate GPT-3.5 and GPT-4. As others have found, GPT-4 scored about 85% compared to GPT-3.5's 77%.

However, just the total averages are not enough for a deeper analysis. That’s why I collected and published the 3,260 generated code snippets from my evaluations.

Note: Since LLM’s are stochastic, each problem should be run a number of times on each model to measure the spread. I found that n=10 was a large enough sample size for my purposes.

Why GPT-4 makes programming mistakes

Most solutions to HumanEval are short (less than 20 lines of code) so it’s easy to see where GPT-4 is making mistakes. I put these mistakes into three categories:

Edge cases

For some problems, GPT-4 generates mostly valid code that still gives incorrect responses in edge cases. For example, in HumanEval/120 GPT-4 is tasked with creating a function that gives the top k sorted integers in an array. However, the simple solution it gives only works if k is greater than zero.

GPT-4’s incorrect solution (left) and a correct solution (right).

Brain farts

Sometimes a small mistake cascades into a bug that causes code to behave incorrectly. For example, in HumanEval/91, GPT-4 is tasked with writing code to count the number of sentences that start with the word “I” in a given string. Its solution misses the fact that there are many words aside from “I” that start with the letter I.

GPT-4’s incorrect solution (left) and a correct solution (right).


In the worst cases, GPT-4 fails to understand the problem as described and solves a different problem. For example, in HumanEval/115, GPT-4 must calculate the number of times a bucket with a given capacity must be lowered into a given set of wells in order to remove all water. Common sense says that a bucket cannot take water from multiple wells in one go, a fact that GPT-4’s solution misses.

GPT-4’s incorrect solution (left) and a correct solution (right).

GPT-4 performs worse than GPT-3.5 in some situations

Perhaps the most interesting result of the HumanEval results is that while GPT-4 performed better on 38 problems compared to GPT-3.5, it performed measurably worse on 21 problems.

For example, in HumanEval/55, GPT-4 should write a function to give the n-th Fibonacci number, starting with fib(1) = 1. This problem which was solved by GPT-3.5. However, GPT-4 fails to follow the instructions, shifting all of the indices by one. There isn’t an obvious explanation as to why or when this and other “regressions” occur, causing GPT-4 to do worse on some tasks compared to GPT-3.5. It shows at least that getting improving fundamental models for code generation is a bumpy ride.

To wrap up this analysis: GPT-4’s HumanEval score of 85% says more about the evaluation dataset than GPT-4 itself. The most interesting problems are of course, the 15–20% of cases where HumanEval struggles or fails. Future problem sets for evaluation will need to dig deeper into these areas.

(My full analysis of GPT-4 can be found at jamesmurdza/humaneval-results.)