Accuracy comparison between the March version of GPT-4 with the June version on the problem of determining whether a number is prime. In March, GPT-4 solved 97.6% of problems accurately, while in June, it solved only 2.4% of the problems.

DumbGPT: Is GPT-4 Getting Worse Over Time But Not Better?

The Paradox of Progress: How GPT-4’s Capabilities Might Be Fading

3 min readJul 19, 2023

Recently, I’ve been seeing a surprising trend about GPT-4 getting worse over time, but not better. A lot of people have shared their observations about the declining quality of responses from the AI model, but these were, until recently, completely subjective (but not anymore).

Now, it’s official.

Recent research indicates that the June version of GPT-4 fares notably worse than its March variant in executing certain tasks. To illustrate, a test comprising 500 problems, which required the model to discern whether a provided integer was prime, showed alarming results. While the March model answered 488 problems accurately, the June model managed a meager 12 correct responses.

That’s a plunge from an impressive 97.6% accuracy rate to a disappointing 2.4%!

Join the Medium Membership Program for only $5 to continue learning without limits. I’ll receive a small portion of your membership fee if you use the following link, at no extra cost to you.

Join Medium with my referral link - SM Raiyyan

As a Medium member, a portion of your membership fee goes to writers you read, and you get full access to every story…

medium.com

But wait, there’s more! The researchers tried to aid the model’s reasoning by employing the Chain-of-Thought technique. For instance, the model was prompted to process whether ‘17077’ is a prime number in a step-by-step manner. This technique has often proven to amplify the quality of responses. However, the newer version of GPT-4 couldn’t generate intermediate steps, responding with an incorrect “No.”

Additionally, the model’s ability to generate code has deteriorated.

To evaluate this, the team created a dataset featuring 50 straightforward problems from LeetCode and evaluated the frequency of GPT-4’s solutions running without modifications. The March variant triumphed in 52% of the problems, but the June model lagged, managing a mere 10%.

Performance of the March 2023 and June 2023 versions of GPT-4 and GPT-3.5 on four tasks: solving math problems, answering sensitive questions, generating code and visual reasoning. The performances of GPT-4 and GPT-3.5 can vary substantially over time, and for the worse in some tasks.

So, what’s the root cause?

Although we believe OpenAI consistently rolls out updates, the procedures they employ to assess improvements or setbacks in the model are unclear. There are rumors about OpenAI deploying multiple smaller, specialized GPT-4 models that mimic the function of a large model but at a lower operational cost. When a user poses a query, the system determines which model to forward the request to.

This new method is certainly cost-effective and faster, but could it be contributing to the degradation in quality?

Personally, I see this as a cautionary signal for developers who leverage GPT-4 for their applications. It’s not tenable for the behavior of an LLM to fluctuate over time.

DumbGPT: Is GPT-4 Getting Worse Over Time But Not Better?

The Paradox of Progress: How GPT-4’s Capabilities Might Be Fading

Join Medium with my referral link - SM Raiyyan

As a Medium member, a portion of your membership fee goes to writers you read, and you get full access to every story…

Written by SM Raiyyan