Large Language Models are not good at logic

Published in

Computational Thinking: How computers think, decide and learn

3 min readMar 12, 2024

Large Language Models (LLMs)tend to make mistakes on logic. It comes from a difficult to separate false from true. As one example, it is common when using Bing with chatGPT that it will look for all websites, and answer based on the information, without judging the information. As a human, I can judge how trustworthy us a information.

As I was writing my latest paper, called “SnakeChat: a conversational-AI based app for snake classification”, I invite you to read for getting more context, I have noticed something interesting. LLMs do not seem be good at logic. This is deductive logic. Suppose I give you a sequence of facts, and you have the knowledge to check that. You are going to use deductive logic to conclude. That is: if this is true, and this is a fact, then this is true.

Consider the example below from my paper:

“Based on the description provided and the refinement from our specialized model, the snake in the photo appears to be a Micrurus frontalis, also known as the frontal coral snake or São Paulo coral snake, with a high probability of 76.48%. This snake species is characterized by a grey and black coloration with distinct black patterns edged with lighter borders [this is a well-known trait from the Bothrops]. Please note that coral snakes can be highly venomous, so if this is a real-world situation, it is important to maintain a safe distance and avoid any interaction with the snake.” SnakeChat: a conversational-AI based app for snake classification

Context. there are models under the hood that make predictions from images. Those models are called by chatGPT, when called wrongly, they give strange predictions. Which was the case. The wrong call created a wrong prediction, which induced the LLM to make a final conclusion based on wrong information.

Micrurus frontalis are snakes that do not strike. What is curious: I have entered the same conclusion on Bing, which has chatGPT under the hood. It confirmed the conclusion, using sites as citation. I have also asked BARD to assess the correctness of the paragraph, and it confirmed it is correct. I have confronted BARD, it asked me for apologies and fixed the misinformation.

In fact, when chatGPT gained momentum, a joke that was around on social was about a person confronting chatGPT on a math problem, which chatGPT was correct. After confrontation, chatGPT actually agreed with the person, that mentioned his wife as a source. It seems the same behavior: a wrong function calling triggered a wrong prediction, which triggered a wrong argumentation as conclusion. This is a real risk with those models, also present on Bing. Another behavior that was reported was chatGPT explaining how a slug makes metamorphosis: it actually gave details and explanations on how that happens; which makes no sense. It is not possible to replicate those behaviors, but it can happen such a senseless conclusion. Even though those behaviors were reported at the beginning of the hype around chatGPT, it seems it is still present as we can see from this case.

Those experimental behaviors suggest that LLMs may not be at confronting information. They are perfect for data mining, but may be problematic when you need to make sure an information is consistent. As we learn about those models, it is natural that we find eventually flaws.

Large Language Models are not good at logic

Written by Jorge Guerra Pires, PhD