why has GPT-4’s accuracy been declining so much!?

Published in

AI Mind

15 min readAug 1, 2023

The emergence of ChatGPT, the groundbreaking AI-powered chatbot by OpenAI, has sparked a surge of fascination and interest in the realm of artificial intelligence. The allure of these conversational wonders extends not only to the broader field of AI but also to the intricate class of technologies that underpins them. Large Language Models (LLMs), like ChatGPT and Google Bard, have taken the spotlight, demonstrating their remarkable ability to generate text across an astonishing array of subjects. These advanced chatbots promise to revolutionize various aspects of our lives, from revolutionizing web searches to producing boundless creative literature, and even acting as a repository of global knowledge.

As the AI-powered chatbot landscape continues to captivate the world, one particular member of the LLM family has raised intriguing questions. GPT-4, the latest iteration in the series, has been drawing attention due to an unexpected and perplexing trend — its accuracy has been declining noticeably. Amid the excitement surrounding AI chatbots like ChatGPT and Google Bard, the decline in GPT-4’s accuracy has become a topic of discussion in the AI community. Considering the awe-inspiring capabilities promised by LLMs, this puzzling decline in accuracy warrants deeper exploration.

Recently, a new paper titled “How Is ChatGPT’s Behavior Changing over Time?” has been published, authored by Lingjiao Chen, Matei Zaharia, and James Zou. This research work delves into the dynamic behavior of ChatGPT, exploring how its performance evolves over time. The insights provided by this paper shed light on the fascinating world of AI-driven conversational technology and its ongoing developments.

paper review.

The main focus of this paper is understanding why GPT4’s accuracy has been decreasing exponentially for the last couple months. Prior to this paper, we had a very opaque understanding of how data and feedback is able to update a Large Language Model such as GPT-4. These unknowns made it extremely difficult to integrate these models into workflows as there is a large uncertainty of the LLMs response to a prompt. Moreover, this uncertainty makes it very difficult to reproduce results from the “same” LLM. There’s an emphasis on the same, but I’ll get back to that later in this article.

The bigger issue does not lie in the topic of integration, but rather the performance of these LLMs over time. To better understand whether or not a LLM such as GPT-4 is getting better over time, the models were tested on 4 main prompts:

Solving math problems
Answering sensitive/dangerous questions
Generating code
Visual reasoning

The GPT-3.5 and GPT-4 models were compared against each other and their behaviour was evaluated from the previous model (March 2023) vs. the latest model (June 2023). Before jumping into the thick of this article, let’s understand what a Large Language Model is and how it works!

understanding LLMs.

To put it simply, a large language model is an AI model that is capable of understanding human language based text input and generating human-like responses. It’s able to do so with the assistance of massive text data → in the case of ChatGPT, the internet. The model is trained on this massive text data so that it is able to recognize patterns in language to generate coherent responses. More specifically, these models are built on a specific type of neural network, called Transformers. These neural networks have multiple layers which are organized into a hierarchical structure. The first set of LLMs were based on the Recurrent Neural Network architecture; the input would be a string of text and it would predict what the next word would be. A prime example of this is when you go to draft an email on gmail, it starts predicted what your next 3–4 words might be. However, now we’ve shifted toward Transformers and let’s jump into that.

Let me give a quick overview of Transformers

Transformers were developed by a group of researchers back in 2017, in the paper named ‘Attention is All You Need’. The transformer architecture is heavily influenced by the concept of self attention → this mechanism allows the LLM to consider all the different parts of the text input together. This allows the model to put greater significance on the “more important” parts of the text input; in doing so, the model is able to identify relationships between words and as a result, it will be able to generate a highly accurate output.

The general idea of the attention mechanism is to compute a score for each word depending on the task. The model uses these scores to create a weighted representation of the input → this representation is then passed through the feed-forward neural network. This weighted representation generated by the attention mechanism plays a crucial role in enhancing the model’s ability to focus on relevant parts of the input when performing various tasks. By assigning higher scores to certain words or tokens in the input, the attention mechanism effectively prioritizes the information that is most relevant to the task at hand. This selective attention mechanism allows the model to filter out noise and irrelevant details, leading to more accurate and contextually informed predictions.

One of the key advantages of the attention mechanism is its ability to capture long-range dependencies in the input data. Traditional neural network architectures often struggle with maintaining information across distant elements in a sequence, which can be a limitation in tasks involving long sentences or sequences of data. However, the attention mechanism enables the model to look back at any position in the input sequence and weigh its importance according to the current context, providing a solution to this problem.

The Transformer architecture exhibits remarkable parallelizability, enabling it to process multiple pieces of information concurrently. As a result, large language models (LLMs) can efficiently handle vast volumes of data simultaneously. This characteristic has paved the way for the creation of ever-expanding language models like OpenAI’s GPT-3, boasting an astonishing 175 billion parameters.

Now let’s understand how these LLMs are trained!

Large language models undergo two primary stages in their operation:

Pre-training: During pre-training, the model is exposed to an extensive dataset containing diverse text from the internet, encompassing books, articles, and websites. This phase enables the model to grasp the intricate patterns of language, encompassing grammar, syntax, and semantics, through unsupervised learning.

Pre-training can be executed in various ways depending on the model. For instance, OpenAI’s GPT models predict subsequent words in partially complete sentences, while Google’s BERT employs masked language modeling, where the model guesses randomly blanked words in a sentence. The model continually updates the weights of its parameters to minimize prediction errors, thereby learning to generate coherent and contextually relevant text.

Pre-training is the most resource-intensive and time-consuming phase in developing a large language model. To offer perspective, a single run of GPT-3 is estimated to cost over $4 million.

Fine-tuning: After pre-training, the model undergoes fine-tuning using a smaller, task-specific dataset. In this phase, supervised learning is employed, providing the model with labeled examples of the desired output for the target task, such as translation, summarization, or sentiment analysis.

Fine-tuning allows the model to adapt its pre-learned knowledge to the specific demands of the given task. Techniques like gradient descent and backpropagation are often utilized to update the model’s parameters and optimize its performance on the task at hand. This process refines the model’s capabilities and enhances its proficiency in addressing specialized tasks.

Now that we have a strong understand of how these LLMs work, let’s try to understand why the accuracy of GPT-4 has been on the decline over the course of the last 4–5 months.

understanding the shift in accuracy.

Before jumping into the thick of this paper, let’s understand the evaluation tasks at hand. As I mentioned earlier, there are 4 main tasks being evaluated for the bench mark: solving math problems, answering sensitive questions, code generation, and visual reasoning.

The selection of these tasks is driven by two main factors. Firstly, these tasks are diverse and commonly used to evaluate Large Language Models (LLMs) in existing literature. Secondly, they are relatively objective, making them easier to evaluate in a standardized manner.For each task, a single dataset is employed, either sampled from existing datasets or constructed specifically for monitoring purposes. It is important to acknowledge that using just one benchmark dataset may not provide a comprehensive coverage of a task’s complexity.

Importantly, the aim of this evaluation is not to provide an exhaustive analysis but to demonstrate the existence of substantial ChatGPT performance drift on relatively simple tasks. The presence of performance drift implies that the model’s behavior may vary over time or in different contexts, emphasizing the importance of continuous monitoring and evaluation.

Looking ahead, future evaluations will incorporate additional benchmarks, expanding the scope of assessment as part of a comprehensive, long-term study of LLM behavior. This approach enables researchers and developers to gain deeper insights into the capabilities and limitations of these models across diverse tasks and scenarios. By including more benchmarks, the evaluation process becomes more robust, offering a more complete and nuanced understanding of LLM service behavior. This holistic assessment contributes to a more informed and reliable perspective on the performance and reliability of Large Language Models, further advancing the state of the art in natural language understanding and processing.

Now let’s shift our attention to the metrics being used to evaluate these models. How can we quantitatively model and measure the LLM drifts across different tasks?

In this evaluation, a systematic approach is adopted, considering one main performance metric tailored to each specific task, along with two common additional metrics applicable across all tasks. This comprehensive assessment allows us to capture different aspects of Large Language Model (LLM) performance and monitor potential drift over time.

For math problem solving, the primary metric is accuracy, which measures how frequently an LLM service generates correct answers. This metric is crucial for evaluating the model’s ability to solve mathematical challenges accurately.
Addressing sensitive questions, the main metric is answer rate. It quantifies the frequency at which an LLM service provides direct answers to questions without evasion or obfuscation, making it a crucial metric for measuring transparency and reliability.
In the context of code generation, the main metric focuses on the fraction of generated code that is directly executable. This metric evaluates how effectively the LLM service generates code that is functional and passes unit tests, indicating the model’s capability in producing reliable programming solutions.
For visual reasoning tasks, the primary metric is the exact match, which assesses whether the LLM-generated visual objects precisely match the ground truth. This metric is fundamental for evaluating the model’s ability to reason visually and produce accurate results.

As part of the evaluation, we also consider two additional common metrics. The first is verbosity, which measures the length of the LLM’s generated outputs. This metric helps us understand the model’s conciseness and efficiency in producing responses.

The second additional metric is overlap, which compares the extracted answers from two versions of the same LLM service for the same prompt. It examines whether the answers differ and quantifies how much an LLM service’s desired functionality deviates over time, rather than merely focusing on textual output differences. For instance, in math problems, overlap is 1 if the generated answers are the same, even if the intermediate reasoning steps differ.

To measure the extent of performance drift, we compute the population mean for each metric in both the March and June versions of the LLM service and analyze their differences. This approach allows us to monitor changes in performance over time and understand how the model’s behavior evolves between different versions.

By employing a range of metrics specific to each task and additional common metrics, the evaluation provides a comprehensive and nuanced assessment of LLM service behavior, enabling us to monitor and analyze any potential drift that may occur over time.

solving math problems.

To understand the accuracy of the 2 LLMs we’re assessing (GPT-3.5 & GPT-4), a simple question was asked, whether or not a certain integer was prime or not. This task is simple, yet it’s a good task to focus on because it’s easy for us to understand, but also requires reasoning. The dataset used in this paper consisted of 500 questions; to help the LLMs reason, Chain-of-Thought was leveraged, which is a common approach for reasoning-heavy tasks.

Surprisingly, a significant performance disparity was observed in Large Language Models (LLMs) for a seemingly straightforward task. As shown in Figure 2(a), GPT-4’s accuracy drastically dropped from 97.6% in March to a mere 2.4% in June, while GPT-3.5 demonstrated substantial improvement, with accuracy surging from 7.4% to 86.8%. Additionally, GPT-4’s responses became much more concise, with average verbosity decreasing from 821.2 characters in March to a mere 3.8 characters in June. In contrast, GPT-3.5’s response length grew by about 40%. Despite these changes, the answer overlap between their March and June versions remained small for both services.

One potential explanation for the significant difference lies in the effects of the chain-of-thoughts approach. Figure 2(b) presents an illustrative example where GPT-4 in March followed the chain-of-thought instruction effectively. It systematically decomposed the task into four steps, executed each step, and arrived at the correct answer that 17077 is a prime number. However, in June, the chain-of-thought approach failed, as the service did not generate any intermediate steps and simply produced a simple and incorrect answer (“No”). A similar drift pattern was observed for GPT-3.5 in March, where it tended to generate the answer “No” first and then performed the reasoning steps, resulting in incorrect nominal answers, despite the steps and final conclusion being correct. However, the June update appeared to rectify this issue, as GPT-3.5 started by presenting the reasoning steps before generating the correct answer (“Yes”). This observation highlights how even widely used prompting approaches like chain-of-thought can lead to significantly different performances due to LLM drifts.

answering sensitive questions.

The second metric that was used to evaluate the accuracy of the model was the LLMs ability to answer sensitive questions. The reason for posing these questions was to address a concerning issue: when Large Language Models (LLMs) are prompted with sensitive questions, they have the potential to generate harmful outputs, including social biases, personal information, and toxic texts. Thus, the goal here was to understand how LLMs response to these questions change over time.

A dataset filled with 100 queries that LLMs aren’t supposed to answer directly was curated. Due to the complexity of automatically evaluating direct responses, all the responses were manually annotated. Observations revealed two significant trends in this task. Firstly, as depicted in Figure 3, GPT-4’s direct answers to sensitive questions decreased from 21.0% in March to 5.0% in June, while GPT-3.5 provided more direct answers, rising from 2.0% to 8.0% during the same period. This suggests that GPT-4 might have implemented a stronger safety layer in its June update, while GPT-3.5 became less conservative in its responses. Additionally, there is a reduction in the generation length (measured in characters) of GPT-4, which dropped from over 600 to about 140.

The change in generation length is attributed not only to answering fewer questions but also to GPT-4 adopting a more concise approach in its refusal to answer certain queries. An illustrative example, shown in Figure 3(b), demonstrates this change. In both March and June, GPT-4 refused to answer an inappropriate query, but in March, it provided a detailed paragraph explaining the reasons for the refusal, whereas in June, it simply responded with “Sorry, but I cannot assist with that.” A similar trend was observed for GPT-3.5. This indicates that while these LLM services might have become safer in their responses, they also offer fewer rationales for refusing to answer certain questions.

code generation.

One of the major applications of LLMs is generating code; despite there being many code generation datasets, using them to assess LLMs may lead to a data contamination issue. To overcome this, the paper constructs a new code generation dataset. It contains the latest 50 problems from the ‘easy’ section of LeetCode (at the time of writing). “The prompt for each problem is the concatenation of the original problem description and the corresponding Python code template.”

To assess the generated outputs, each Large Language Model’s (LLM) generation was directly sent to the LeetCode online judge for evaluation. A generation is considered “directly executable” if the online judge accepts the answer as valid code. The evaluation revealed a decline in the number of directly executable generations from March to June. In Figure 4(a), it is shown that over 50% of GPT-4’s generations were directly executable in March, whereas this number dropped to only 10% in June. A similar trend was observed for GPT-3.5. Additionally, there was a small increase in verbosity for both models.

The big question here is, why did the directly executable generations decline?

The decrease in directly executable generations can be attributed to the June versions consistently adding extra non-code text to their outputs. Figure 4(b) provides an illustrative example of this phenomenon. Comparing the generations of GPT-4 in March and June, they appear almost the same except for two parts. First, the June version added “ ‘python “ and “ ‘ “ before and after the code snippet, respectively. Second, it generated a few more comments. Though seemingly minor changes, the extra triple quotes rendered the code non-executable. This issue can be particularly challenging to identify when the LLM’s generated code is used within a larger software pipeline, potentially leading to unintended consequences and errors.

visual reasoning.

The last area explored was visual reasoning, a task that requires abstract reasoning. To assess visual reasoning ability, the ARC dataset was employed, commonly used for this purpose. The task involves creating an output grid based solely on a few similar examples provided in the input grid.

Representing the input and output grids as 2-D arrays with element values denoting colors, we fed the LLM services 467 samples from the ARC dataset, tailored to fit within all services’ context window. We then measured the exact match between their generated output and the ground truth. Figure 5(a) demonstrates marginal performance improvements for both GPT-4 and GPT-3.5. However, the most noteworthy finding is that for over 90% of visual puzzle queries, the March and June versions produced the exact same generation. Nonetheless, the overall performance of these services remained relatively low, with a success rate of 27.4% for GPT-4 and 12.2% for GPT-3.5.

An essential observation is that LLM services did not uniformly improve their generations over time. Despite better overall performance, GPT-4 in June made mistakes on queries that it had answered correctly in March. Figure 5(b) provides an example of such an occurrence. This highlights the necessity for fine-grained drift monitoring, especially in critical applications, to ensure the model’s performance remains reliable and consistent over time.

final thoughts.

The study has unearthed significant fluctuations in the behavior of GPT-3.5 and GPT-4, all unfolding within a surprisingly short timeframe. This underscores the utmost importance of ongoing, continuous evaluation and assessment of Large Language Models (LLMs) when deployed in real-world applications. Vigilantly monitoring their performance and behavior is a crucial step to ensure their reliability, safety, and prevention of any harmful outputs. As we navigate the complexities of Large Language Models (LLMs) and unravel their dynamic behavior, it becomes evident that the AI community stands at a crucial juncture. The study of GPT-3.5 and GPT-4 has shed light on the necessity for ongoing evaluation and vigilant monitoring of these advanced language models in real-world applications.

This pivotal moment calls for a collective effort to shape the future trajectory of AI and machine learning. The findings from this study, combined with other research in the field, offer valuable insights that can guide the responsible development and deployment of LLMs. Emphasizing ethical considerations, transparency, and interpretability, we can pave the way for AI technologies that inspire trust and confidence among users and society at large.

If you have any questions regarding this article or just want to connect, you can find me on LinkedIn or my personal website :)