Why Can’t LLMs Like ChatGPT Accurately Determine Whether 9.11 or 9.9 is Larger?

What Do You All Think?

Published in

Mr. Plan ₿ Publication

6 min readSep 3, 2024

This article reflects only the personal views of the author and may not necessarily be correct. Fellow colleagues are encouraged to discuss and correct any mistakes.

A straightforward answer to this question might be to blame the tokenizer, but that explanation could be oversimplifying the issue.

For those who mentioned models like gpt4o, their tokenizers can be accessed via OpenAI’s website or tiktoken. The fact that “11” is tokenized as a single token does not definitively explain why this phenomenon occurs.

We can easily find counterexamples. For instance, the Llama series, starting from the first generation, tokenizes individual digits. Similarly, open-source models like Baichuan employ the same strategy. This is documented in their technical reports, and you can easily verify it with the following code.

Why Can’t LLMs Like ChatGPT Accurately Determine Whether 9.11 or 9.9 is Larger?

What Do You All Think?

Written by Mr. Nobody