Why Can’t LLMs Like ChatGPT Accurately Determine Whether 9.11 or 9.9 is Larger?

What Do You All Think?

Mr. Nobody
Mr. Plan ₿ Publication

--

This article reflects only the personal views of the author and may not necessarily be correct. Fellow colleagues are encouraged to discuss and correct any mistakes.

A straightforward answer to this question might be to blame the tokenizer, but that explanation could be oversimplifying the issue.

For those who mentioned models like gpt4o, their tokenizers can be accessed via OpenAI’s website or tiktoken. The fact that “11” is tokenized as a single token does not definitively explain why this phenomenon occurs.

We can easily find counterexamples. For instance, the Llama series, starting from the first generation, tokenizes individual digits. Similarly, open-source models like Baichuan employ the same strategy. This is documented in their technical reports, and you can easily verify it with the following code.

--

--

Mr. Nobody
Mr. Plan ₿ Publication

Since I was young, I have always enjoyed reading biographies of historical figures, especially those about World War II, including documentaries and novels.