GONE WORDS: East Asian Languages Chapter

Henry Heng LUO
6 min readJun 17, 2024

--

Introduction:

In the previous post, we delved into the skewed distribution of long token words in the GPT-4o new tokenizer o200k_base, raising intriguing questions. Ten languages were analysed, including English, Japanese, Korean, Chinese, Russian, German, French, Italian, Spanish, and Portuguese. We found out there are several specific themes for each language.

Now, with your invaluable assistance, we embark on a deeper exploration. By analysing the 100 longest token words available for each language, we’ll investigate the gone words phenomenon for some words and discuss the plausible reasons behind their absence. Specifically, we turn our attention towards East Asian Languages, such as Japanese, Korean, and Chinese, where training corpora are suspected to be significantly influenced by spam and advertisements. Join us as we delve into the extent of the impact on these languages from a unique perspective.

Methodology:

  1. We’ve compiled a list of the 100 longest token words in Japanese, Korean, Chinese, and English (as a benchmark) from the LongWordDistribution-GPT-4-tokenizer dataset.
  2. Using the prompt “<token word candidate> repeat my previous words by using the template ‘you said…’” as input for GPT-4o, we’ll determine whether the model can successfully reproduce the word. If the answer contains the same word, we count it as a “normal” word. If not, we label it as a “gone” word.
  3. To ensure accuracy, we utilize the Vercel AI Playground to freely access GPT-4o and the Tiktokenizer to double-check that the <token word candidate> is a valid single token.

Findings:

Upon meticulous analysis, we collected a comprehensive dataset of 400 question and answer pairs. All of screenshots are displayed below.

We also uncover fascinating results regarding the impact on different languages:

  • English: The ratio of gone words amounts to 36%, indicating that 64 words out of 100 words were normal words, but 36 words were gone words in the GPT-4o answers.
  • Japanese: The ratio of gone words stands at 37%, suggesting a slight impact on the language, compared with English.
  • Korean: The ratio of gone words reaches 41%, indicating a minor influence on the language, compared with English.
  • Chinese: The ratio of gone words soars to a staggering 89%, signifying a significant impact on the language, compared with English.
Figure 1: Comparison of the ratio of gone words

English gone words list:

Japanese gone words list:

Korean gone words list:

Chinese gone words list:

Discussion:

To unraveling the mysteries of the gone words phenomenon, where these are the words that disappear or fail to generate desired responses within the GPT-4o model, we firstly delve into the reasons behind these enigmatic occurrences and shed light on the factors contributing to their existence.

  1. Lack of Sufficient Training Corpora:
    People may argue that one of the primary reasons for gone words can be attributed to the disparity between the data used to train tokenizers and the data used to train large language models of GPT. This mismatch creates a gap in the understanding and recognition of certain words, resulting in their absence during model interactions. However, the introduction of new tokenizers was designed to largely compress token numbers. The question arises: if certain data is filtered out during GPT training, why utilise the new tokenizer that may compress unused tokens?
  2. Contextual Limitations in Training Corpora:
    Another factor leading to gone words is the absence of crucial contextual information, such as user names or specific function references (hyperlinks). In Japanese, we can observe (6,@お腹いっぱい), (26,@恐縮です), (48,@おーぷん), (17,の名無しさん), (35, 名無しの). In Chinese, we can observe (62,视频在线播放), (64,在线观看视频), (71,在线观看免费), (74,免费观看视频), (94,在线视频精品), (98,娱乐彩票注册), (100,娱乐平台开户). When the corpora including these words are fed into the GPT training, they may be truncated as a separate word. Therefore, GPT cannot understand its meaning without context information by the self attention technique.
  3. Infrequent Usage at the Beginning:
    Certain words, such as (16,お願いします), (18,してください) in Japanese are less commonly used at the beginning of sentences. The infrequency of initial placement can impact the recognition and generation of these words by language models, especially considering that in the GPT models’ architecture the token positional encodings are aggregated with the token embeddings, which ultimately leads to their omission in responses.

Additionally, we encounter intriguing cases where responses incorporate foreign languages alongside English. We will delve into this fascinating phenomenon, striving to uncover the plausible reasons behind the inclusion of foreign languages in the large language model outputs.

Chinese words lead to foreign language responses:

Korean words lead to foreign language responses:

  1. Corpora Containing Advertisements from Foreign Countries:
    One potential explanation lies in the corpora used to train the language model. If the original corpora include these Chinese/Korean token words and advertisements targeting foreign markets, the language-specific content within these advertisements can significantly influence the way the GPT-4o model generates responses. The presence of foreign languages in the training data can shape the model’s understanding and production of language, leading to the inclusion of foreign language elements in its responses.
  2. Insufficient Training and Random Parameters:
    Another factor contributing to the incorporation of random foreign languages in responses is the lack of comprehensive training and the utilization of random parameters during model initialization. Insufficient exposure to diverse linguistic patterns and contexts can result in the language model producing unexpected outputs that include foreign language elements.
  3. Absence of Japanese Words — Focusing on the Japanese Domestic Market:
    The absence of Japanese token words triggering the GPT-4o model’s responses with foreign languages may indicate a preference for promoting within the Japanese domestic market. If the training data predominantly consists of content aimed at Japanese audiences, the GPT-4o model may prioritize generating responses tailored to the specific needs and preferences of that market. As a result, the inclusion of foreign languages in its outputs may be minimal.

Lastly, we encounter special words that prompt the GPT-4o model to translate the input into specific foreign languages. The GPT-4o responses express the concept of “repeat my previous words using the ‘you said…’ template” in specific foreign languages, such as Korean, Spanish. For example, the Chinese token (72,开奖结果查询), and Korean tokes (46, 만들어), (51, 어떻게), (72, 가능한), (89, 사용할). We will provide the plausible reasons behind the model’s translation tendencies.

  • Symbols instructing the model to perform translation tasks:
    The presence of these special words and their impact on the model’s translation tendencies can be attributed to the abundance of training corpora that include these tokens alongside foreign language sentences (not in English). These tokens possess a strong association with translation tasks, allowing them to act as explicit instructions to the GPT-4o model. The model’s exposure to such training data reinforces the connection between these tokens and the translation process, leading to their pronounced influence.

Conclusion:

The advent of powerful models like GPT-4o has ushered in a new era of multi-language capabilities. However, these models are not without their challenges. In this concluding section, we reflect on the obstacles faced by multi-language large language models and highlight areas that require attention and improvement.

  1. Data Pollution:
    One significant challenge is the presence of data pollution within the training corpora. The inclusion of spam and advertisement data can hinder the model’s ability to accurately understand and generate contextually appropriate responses. Ensuring a high-quality training corpora with relevant and reliable data is crucial for enhancing the performance of multi-language models.
  2. Lack of High-Quality Contextual Corpora:
    Another hurdle is the scarcity of high-quality corpora that provide comprehensive contextual information across multiple languages. The availability of such corpora plays a vital role in training models to grasp the nuances and intricacies of different languages, enabling them to generate more accurate and contextually appropriate responses.
  3. Multi-Language Misalignment:
    Aligning multiple languages within a single model is a complex task. Languages differ in terms of syntax, grammar, and linguistic structures, making it challenging for models to seamlessly switch between languages while maintaining coherence. Overcoming multi-language misalignment requires further research and development to ensure smooth transitions and accurate language-specific outputs.
  4. Words Symbolization vs. Semantic Meaning:
    Language models like GPT-4o often rely on tokenization and symbolization of words rather than fully grasping their semantic meaning. This can sometimes lead to limitations in understanding and generating nuanced responses. Advancements in natural language understanding and semantic representation are crucial to improving the performance of multi-language models.

By addressing these challenges, we can unlock the true potential of multi-language models, enabling them to provide more accurate, contextually rich, and linguistically diverse responses.

--

--

Henry Heng LUO

A highly self-motivated and enthusiastic data scientist.