Let’s figure out why ChatGPT is turning lazy

Thomas Latterner
5 min readJan 23, 2024

--

Have you noticed a change in ChatGPT’s performance lately? If you’re a daily user of this gen AI tool, you may have observed some differences over the last months. It’s not just your imagination — there’s been a noticeable shift in how GPT-4, the latest model, is performing. If we don’t really know why, it seems the community noticed it. I’m going to explain to you the most likely hypothesis.

This realistic 3D-rendered scene depicts a professional office setting filled with robots from a ‘lazy robot squad’. Several robots are sprawled across desks and chairs, visibly asleep with their heads resting on desks and limbs in relaxed positions. The robots have a modern, metallic design, humorously portraying an exaggerated depiction of laziness in the workplace.
Lazy office robots — Generated by DALL·E 3

Ever heard of the LMSYS Chatbot Arena Leaderboard on Hugging Face website? It’s like a scoreboard showing which AI models are top of their game.

This leaderboard is very useful for users and developers alike, as it provides indications of which models are excelling and which are falling behind. Interestingly, the newer versions of OpenAi or anthropic, which one might expect to dominate the leaderboard, are actually ranked lower than their predecessors.

Chatbot Arena Leaderboard — snapshot of Jan 18, 2024

Newer versions of GPT-4 are not at the top, as we might expect:

  • GPT-4-Turbo — Release date: April 2023 — Elo: 1249
  • GPT-4–0314 — Release date: March 2023 — Elo: 1191
  • GPT-4–0613 — Release date: Jun 2023 — Elo: 1160

This is the same with Anthropic's models:

  • Claude-1 — Elo: 1150
  • Claude 2.1 — Elo: 1131
  • Claude 2 — Elo: 1119

If you wonder what is the Elo rating system, it is a method used to calculate the relative skill levels of players in zero-sum games like chess, where each player’s rating increases or decreases based on the outcome of games against other players.

This unexpected ranking reversal suggests that the latest updates might not be improvements in every aspect, and raises a crucial question: why isn’t the latest necessarily the greatest?

I’d expect the most recent versions to be the most powerful. However, the non-chronological order in which they’re rated does not reflect that. It suggests that the improvements in newer versions might not align with what users value or expect. This could be due to various factors, including changes in the model’s training data, updates in its algorithms, or shifts in its intended use cases. Understanding these versions are key to grasping the nuances of GPT-4’s development journey.

Hypothesis

Everything below are suppositions mad on daily life usage of ChatGPT, conversation and reading. As you can see above, it is not only a feeling. Even OpenAI themselves communicated about this issue:

If you are using it to generate code, this is particularly stunning. Many weeks ago, when I was asking ChatGPT to generate function(s), it was generating the whole code. I was very stunned how long and complex stuff it could generate. Today, if I don’t ask it to generate everything in detail, it will 90% of the time, instead of producing the whole code, add comment such as “handle the logic here”, “do the same thing as above” or “to implement”… I added a concrete example I recently got below 👇

// Implement this function based on how you store user IDs in your context
func getUserId(c *gin.Context) (uuid.UUID, bool) {
... [rest of your function code] ...
return uuid.UUID{}, false
}

I think OpenAI is doing it on purpose, to reduce the cost of the inference (The inference is the process of generating a text by predicting the next word depending on all the previous). In a previous article, I explained in detail how much the running infrastructure of ChatGPT could cost. Last approximations say it could be around $700,000 daily.

Behind the simple ChatGPT interface lies a complex infrastructure of high-powered computing resources, which comes with substantial costs. These costs can influence decisions about the model’s design and operation. To manage expenses, developers might opt for changes that slightly reduce the model’s responsiveness or complexity.

There’s a delicate balance between maintaining high quality and ensuring efficient performance in AI models. Techniques like quantization help manage this, but they come with their own set of trade-offs. Quantization is the technique of reducing the precision of the numerical values in the model (because yes, under the hood, everything is represented with numbers). Normally, the LLM models such as GPT use floating-point numbers, which require a significant amount of memory and computational power. By quantizing, these values are converted into lower-precision formats, such as keeping only a certain amount of number after the coma, which reduces the model’s size, memory used, and power consumption, so increasing its efficiency, with minimal loss in performance. This could be the turbo versions of GPT. To make it simpler with an example, it could be compared to an image compression. By using a .jpg format, a photo you took will look like almost the same as the original, but will take less place on your computer and will be faster to display.

Sleepy office robots — Generated by DALL·E 3 as

Lastly, it seems that GPT-4 behaves differently on mobile compared to desktop. The mobile version tends to offer shorter responses, likely a strategic choice considering the platform’s nature. However, this difference highlights how user experience can vary based on the device.

On mobile devices, where users generally seek quick and concise answers, GPT-4’s responses are shorter. This adaptation might enhance user experience with mobile platforms but also reveals how AI performance can vary significantly based on the platform.

Conclusion

ChatGPT and its iterations like GPT-4 are continuously evolving. While the latest version might not always be the best, it reflects the ongoing journey in AI development. It is not anymore about which model will be the most powerful, but also which one will be the most efficient, which I find very good for sustainability and ecology 🌿 More and more new models claim to be as good as GPT while needed less power and resources to run, such as the models from the French company Mixtral MISTRAL 7B and MIXTRAL 8X7B.

If you have concrete examples on how ChatGPT is becoming lazy, please share! Also, if you have others hypothesis, let’s discuss them in the comments!

Thank you for reading. If you liked this article or if you want to encourage me to write more, feel free to let some 👏

--

--

Thomas Latterner

Tech lover, LLM Enthusiastic, Entrepreneur, Co-Founder & Chief Technology Officer at Jus Mundi https://jusmundi.com/