DumbGPT: Is GPT-4 Getting Worse Over Time But Not Better?
The Paradox of Progress: How GPT-4’s Capabilities Might Be Fading
Recently, I’ve been seeing a surprising trend about GPT-4 getting worse over time, but not better. A lot of people have shared their observations about the declining quality of responses from the AI model, but these were, until recently, completely subjective (but not anymore).
Now, it’s official.
Recent research indicates that the June version of GPT-4 fares notably worse than its March variant in executing certain tasks. To illustrate, a test comprising 500 problems, which required the model to discern whether a provided integer was prime, showed alarming results. While the March model answered 488 problems accurately, the June model managed a meager 12 correct responses.
That’s a plunge from an impressive 97.6% accuracy rate to a disappointing 2.4%!
Join the Medium Membership Program for only $5 to continue learning without limits. I’ll receive a small portion of your membership fee if you use the following link, at no extra cost to you.
But wait, there’s more! The researchers tried to aid the model’s reasoning by employing the Chain-of-Thought technique. For instance, the model was prompted to process whether ‘17077’ is a prime number in a step-by-step manner. This technique has often proven to amplify the quality of responses. However, the newer version of GPT-4 couldn’t generate intermediate steps, responding with an incorrect “No.”
Additionally, the model’s ability to generate code has deteriorated.
To evaluate this, the team created a dataset featuring 50 straightforward problems from LeetCode and evaluated the frequency of GPT-4’s solutions running without modifications. The March variant triumphed in 52% of the problems, but the June model lagged, managing a mere 10%.
So, what’s the root cause?
Although we believe OpenAI consistently rolls out updates, the procedures they employ to assess improvements or setbacks in the model are unclear. There are rumors about OpenAI deploying multiple smaller, specialized GPT-4 models that mimic the function of a large model but at a lower operational cost. When a user poses a query, the system determines which model to forward the request to.
This new method is certainly cost-effective and faster, but could it be contributing to the degradation in quality?
Personally, I see this as a cautionary signal for developers who leverage GPT-4 for their applications. It’s not tenable for the behavior of an LLM to fluctuate over time.