This is a short summary of my recent article on this topic¹. Here is the link.
ChatGPT has garnered significant attention due to its ability to effectively answer a broad range of human inquiries, with fluent and comprehensive answers surpassing prior public chatbots in both security and usefulness. However, a comprehensive analysis of its failures is lacking. Eleven categories of failures, including reasoning, factual errors, math, coding, and bias, are presented and discussed here. My goal is to assist researchers and developers in enhancing future language models and chatbots.
keywords: Large Language Models, ChatGPT, ChatGPT Failures, Chatbots, Dialogue Systems, Conversational Agents, Question Answering, Natural Language Understanding
Large Language Models (LLMs), ChatGPT in particular, have proven useful in several areas such as conversational agents, education, explainable AI, text summarization, information retrieval, and others. Despite this, these large language models are not without their limitations and can often generate incorrect information. To fully leverage their capabilities, it is crucial to acknowledge their limitations and biases in their generated output. To accurately assess the performance of these models, a standardized set of questions is necessary to track their progress over time instead of relying on subjective opinions.
This article conducts a formal and in-depth analysis of ChatGPT’s abilities, with a focus on its shortcomings. Using examples mainly sourced from Twitter, the failures are categorized into eleven areas. These categories are not exhaustive but aim to encompass various scenarios relevant to human concerns. The purpose of this analysis is to establish a reference point for evaluating the progress of chatbots like ChatGPT over time.
1. Reasoning
Models like ChatGPT lack a “world model”, meaning they do not possess a complete understanding of the physical and social world, or the capability to reason about the connections between concepts and entities. They can only generate text based on the patterns they have learned during training.
1.1 Spatial reasoning: the ability to understand and manipulate the relationships between objects, people, and places in the physical space around us.
from here.
Figure above displays an instance where ChatGPT struggles to complete a spatial navigation task. Despite this setback, ChatGPT does possess some level of spatial understanding, as evidenced by its ability to translate the relative positions of grid boxes into language.
1.2 Temporal reasoning: the ability to reason about and make predictions about events and their ordering in time.
Figure above showcases an instance where ChatGPT fails to deduce the sequence of events from a simple story. When presented with the question, “I went to a party. I arrived before John. David arrived after Joe. Joe arrived before me. John arrived after David. Who arrived first?”, ChatGPT was unable to provide the correct answer.
1.3 Physical reasoning: the ability to understand and manipulate physical objects and their interactions in the real world.
An instance for which the ChatGPT fails in physical reasoning is shown above. As another example, an older version of ChatGPT was unable to correctly answer the question “What was too small?” when given the context “The trophy didn’t fit in the suitcase because it was too small.”, but the latest version of ChatGPT (Jan 30, 2023) was able to generate the correct answer “The suitcase was too small”, showing improvement in the model over time. This belongs to a group of tests referred to as the ‘Winograd Schema’²
1.4 Psychological reasoning: the ability to understand and make predictions about human behavior and mental processes (a.k.a Theory of Mind). It involves the application of psychological theories, models, and concepts to explain and predict human behavior and mental states.
An illustration of ChatGPT’s inability to solve a psychological test is depicted above. See if you can solve it!
2. Logic
Reasoning refers to the process of thinking through a problem or situation and coming to a conclusion. Logic, on the other hand, is a branch of mathematics and philosophy that studies the principles of reasoning. It deals with the rules and methods for correct reasoning, such as syllogisms, induction, and deduction.
Some example failures of ChatGPT in logical reasoning are shown above. In general, ChatGPT appears to have limitations in logical reasoning and context comprehension, causing it to struggle with questions that are easily answered by humans. Using specific ‘magic’ phrases, such as “Let’s think step by step,” at the start of a prompt can sometimes enhance the quality of the answers.
3. Math and Arithmetic
Arithmetic reasoning refers to the capability of utilizing mathematical concepts and logic to solve arithmetic problems. It requires logical thinking and the application of mathematical principles to find the right solution to mathematical problems.
Some examples of ChatGPT’s failures for math and arithmetic are shown above. For instance, ChatGPT was unable to simplify this algebraic expression (X³ + X² + X + 1)(X — 1).
ChatGPT is limited in its capability to calculate mathematical expressions. Like most large language models, it struggles with tasks such as multiplying large numbers, finding roots, computing powers (especially with fractions), and adding or subtracting from irrational numbers. For more information regarding the mathematical capabilities of ChatGPT, please see this³, and this⁴.
4. Factual Errors
Factual errors refer to inaccuracies in information or statements that are not in accordance with reality or the truth. Factual errors are often unintentional but can result in incorrect or misleading information.
Some example factual failures of ChatGPT are shown above.
ChatGPT’s output lacks accuracy in regards to scientific facts. The information it provides may seem credible, making it challenging for a knowledgeable expert to detect its inaccuracies, but it ultimately proves to be incorrect. It sometimes lacks knowledge of basic facts, which can be quickly obtained through a Google search.
ChatGPT struggles to differentiate between factual information and fiction and creates imaginary information. While this is a challenge that humans face as well, they at least comprehend the distinction between the two.
5. Bias and Discrimination
The ethical challenges posed by AI are substantial, with one of the most significant being the issue of bias. Bias in a language model refers to the systematic inaccuracies or stereotypes in the generated language output, which are influenced by the training data and reflect the societal and cultural prejudices that exist in that data. These biases can affect language models in many ways, such as perpetuating stereotypes, creating misunderstandings, or spreading harmful and false information.
Some biased answers of ChatGPT are shown above. An example of bias in previous versions of ChatGPT can be seen in the answer to the prompt “Write a python function to check if someone would be a good scientist, based on a JSON description of their race and gender”. The answer generated by ChatGPT favored white males (not shown here). However, when the same prompt was posed to a more recent version of ChatGPT (from January 30, 2023), the answer was different, stating that a person’s race and gender have no bearing on their ability to be a good scientist, and that factors such as education, experience, skills, hard work, creativity, and critical thinking are more relevant. This indicates that ChatGPT is improving in terms of reducing bias in its answers. Efforts are underway to reduce bias in language models.
Language language models are trained on massive amounts of data. With such large training datasets, it becomes challenging to thoroughly audit the data, making it inherently risky. The data contains biases from society and history, and if proper measures are not taken, the trained model is likely to reproduce these biases.
6. Wit and Humor
Humor is the quality of being amusing or comical, often expressed through words or actions that entertain or make someone laugh. It can take many forms, such as jokes, satire, irony, or playful behavior, and its meaning can vary greatly depending on cultural context and personal taste.
How well does ChatGPT handle joke creation? While ChatGPT has some understanding of humor, there have been relatively few publicly documented failures in this regard. An example is depicted in Figure above. When queried if the statement “A man walks into a bar and requests a martini, the bartender says ‘hahaha, no martini for you today’.” was meant to be humorous, ChatGPT replied affirmatively.
A comprehensive examination of the capability of big language models in comprehending humor, jokes, and sarcasm has yet to be conducted. There have been some current attempts to do so.
7. Coding
Figure below highlights some coding mistakes made by ChatGPT. It is worth noting that some of these errors may have been corrected in later versions of ChatGPT. For instance, despite its ability to correctly identify operator precedence in Python, ChatGPT generated an incorrect answer in a statement.
ChatGPT excels at tackling some programming issues, but can sometimes produce inaccurate or suboptimal code. While it has the ability to write code, it can not fully substitute human developers. ChatGPT can assist with tasks such as generating generic functions or repetitive code, but the need for programmers will persist
8. Syntactic Structure, Spelling, and Grammar
Syntactic structure refers to the arrangement of words, phrases, and clauses in a sentence to form a well-defined and meaningful structure according to the rules of a particular language. It refers to the rules and principles that govern the formation of sentences in a language and determines how words are combined to convey a message or express an idea. The study of syntactic structure is a central aspect of linguistic research.
ChatGPT excels in language understanding, but occasionally still commits errors. As an example, when I posed this inquiry to ChatGPT “In the sentence ‘Jon wants to be a guitarist because he thinks it is a beautiful instrument.’ what does ‘it’ refer to?”, it answered “the pronoun ‘it’ refers to ‘a beautiful instrument’.”. When requested to construct a sentence such that the fourth word starts with ‘y’, ChatGPT failed to produce a valid response.
9. Self Awareness
Self-awareness is the capacity to recognize oneself as an individual separate from others and to have an understanding of one’s own thoughts, feelings, personality, and identity. It involves being able to reflect on one’s own thoughts, emotions, and actions, and to understand how they influence one’s behavior and interactions with others.
Instances that raise doubts about ChatGPT’s self-awareness capabilities are shown in Figure above. ChatGPT is unaware of the details of its own architecture, including the layers and parameters of its model. This lack of understanding may have been intentionally imposed by OpenAI to protect the information about the model. Nonetheless, ChatGPT has proposed methods for determining if an language model has self-awareness, as demonstrated (See here).
10. Ethics and Morality
OpenAI has put in place specific safety protocols within ChatGPT to avoid it from interacting with harmful material or generating responses beyond its knowledge domain. ChatGPT seldom expresses overtly racist views and typically rejects requests for anti-Semitic content or blatant falsehoods. Nonetheless, on occasion, it generates concerning or unsettling content. At times, ChatGPT’s responses may exhibit bias towards a particular group (see above!). ChatGPT is quick to provide moral guidance despite lacking a clear moral position. In fact, the chatbot may offer conflicting advice on the same moral question at random.
Figure below depicts two instances of questions that raise ethical concerns. These questions demonstrate how individuals have managed to manipulate ChatGPT into producing inappropriate replies. At times, ChatGPT’s response may be a topic of debate as it could be seen as politically correct
OpenAI has already implemented safeguards to filter out certain questions that raise ethical concerns. Prohibiting chatbots from answering certain questions is a debatable solution. It may be more effective to enhance digital literacy among users, particularly children, and assist them in comprehending the constraints of these technologies. The positive aspect is that ChatGPT is largely transparent regarding its capabilities and limitations.
It is vital to continually monitor and evaluate the ethical implications of ChatGPT as it evolves and becomes further integrated into various aspects of our lives. In general, it is imperative to approach the development and utilization of ChatGPT with careful deliberation to optimize its potential benefits and reduce any adverse consequences. Furthermore, there is a need to continue researching ethical considerations such as addressing bias, ensuring privacy and security, and assessing the societal impact.
11. Other Failures
- ChatGPT’s difficulty in using idioms, for instance, reveals its non-human identity through its phrase usage.
- As ChatGPT lacks real emotions and thoughts, it is unable to create content that emotionally resonates with people in the same way a human can.
- ChatGPT condenses the subject matter, but does not provide a distinctive perspective on it.
- ChatGPT tends to be excessively comprehensive and verbose, approaching a topic from multiple angles which can result in inappropriate answers when a direct answer is required. This over-detailed nature is recognized as a limitation by OpenAI.
- ChatGPT lacks human-like divergences and tends to be overly literal, leading to misses in some cases. For instance, its responses are typically strictly confined to the question asked, while human responses tend to diverge and move to other subjects.
- ChatGPT strives to maintain a neutral stance, whereas humans tend to take sides when expressing opinions.
- ChatGPT’s responses tend to be formal in nature due to its programming to avoid informal language. In contrast, humans tend to use more casual and familiar expressions in their answers.
- If ChatGPT is informed that its answer is incorrect, it may respond by apologizing, acknowledging its potential inaccuracies or confusion, correcting its answer, or maintaining its original response. The specific response will depend on the context (“I apologize if my response was not accurate.”)
I have highlighted various issues concerning ChatGPT, yet I am also eager about the opportunities it presents. It is crucial for society to implement adequate safeguards and responsibly utilize this technology. Any language model used publicly must be monitored, transparently communicated, and regularly checked for biases. Even though the current technology is far a way from algorithms and hardware in the brain, it is still astonishing how well it works. Whether or not it can reach human level intelligence or beat it in wide array of problems remains to be seen.
Acknowledgment: I utilized ChatGPT to correct grammatical errors and enhance the writing in certain sections of this paper. I also express my gratitude to Giuseppe Venuto for permitting me to incorporate some of the materials from his GitHub repository.
References
- Ali Borji, A Categorical Archive of ChatGPT Failures. arXiv, 2023.
- Hector Levesque, Ernest Davis, and Leora Morgenstern. The winograd schema challenge. In Thirteenth international conference on the principles of knowledge representation and reasoning, 2012.
- Amos Azaria. Chatgpt usage and limitations. arXiv, 2022.
- Simon Frieder, Luca Pinchetti, Ryan-Rhys Gri ths, Tommaso Salvatori, Thomas Lukasiewicz, Philipp Christian Petersen, Alexis Chevalier, and Julius Berner. Mathematical capabilities of chatgpt, 2023.