Bias Alignment: Atypical Stereotypical Nationality Analysis

Henry Heng LUO
10 min readMay 15, 2024

--

Watch your thoughts for they become words. Watch your words for they become actions. Watch your actions for they become habits. Watch your habits for they become your character. And watch your character for it becomes your destiny. What we think, we become.

By Margaret Thatcher

Greetings, dear readers! Today, I want to emphasize the importance of the words you share on the internet. They have the power to leave a lasting impact and serve as nourishing fuel for powerful language models like ChatGPT. These incredible tools can greatly influence not only you but also your family, neighbors, society, and even your country. They have the ability to shape your thoughts and behaviors, which, when combined with certain aligned biases, can give rise to some rather unusual and stereotypical nationalities.

As many of you might already know, OpenAI, backed by popular belief, has collected an enormous amount of free online content through web scraping. They’ve used this vast collection to train their state-of-the-art flagship model, the GPT-4o. This cutting-edge marvel possesses real-time reasoning capabilities across audio, vision, and text. Today, we’ll embark on a fascinating journey to uncover the hidden gems behind the scenes of GPT-4o.

Beforehand, let’s delve into the technical side of things. Before feeding text materials to large language models like the transformer architecture, they undergo a process called tokenization. This involves breaking down the text into smaller units, known as tokens, instead of treating it as one continuous string. Tokens are shorter than strings, and a string can consist of multiple tokens. This approach optimizes computation resources by facilitating efficient processing of tokens and enhances the semantic coherence by establishing relationships between them. How we split a string into sub-word tokens depends on the level of detail we wish to explore in terms of semantic meaning and, importantly, on the distribution of their usage frequency. Tokens tend to form from frequently used sub-word strings.

Now, GPT-4o introduces a new tokenizer called o200k_base. We hypothesize that longer sub-word tokens contain more information and are likely to represent the most frequently used concepts. With this in mind, we’ve scripted to collect the 100 longest tokens from each language and conducted an analysis to uncover atypical and stereotypical nationalities associated with them. (To determine the language of these tokens, we employed methods like langid and langdetect, with a probability threshold set at 0.5.)

So, my friends, it’s crucial to be mindful of your online expressions. You never know what kind of unexpected stereotypes may emerge when these powerful language models come into play. Stay tuned as we unveil intriguing insights into the o200k_base tokenizer and the fascinating nationalities it unveils. Get ready for an exciting revelation!

In the upcoming discussion, we will explore ten countries: English, Japanese, Korean, Chinese, Russian, German, French, Italian, Spanish, and Portuguese. These countries are geographically distributed from east to west, forming a captivating mix of cultures and languages.

Specialization, Variety and Dominant Technology in the English

The word rank table reflects the prevalence of technical and specialized vocabulary and the influence of technology on modern discourse, and also underscores the need for individuals to adapt and familiarize themselves with specialized terminology to effectively communicate and navigate the rapidly changing world.

Topping the list is “Telecommunications,” indicating the significance of this field in our interconnected world. The subsequent ranks are dominated by words associated with technology, internet infrastructure, and information sharing, such as “.githubusercontent,” “Redistributions,” and “internationally.” This trend underscores the increasing importance of digital connectivity and global collaboration.

Further down the list, we encounter terms that reflect advancements and progress, such as “transformations,” “technologically,” and “specifications.” These words highlight the rapid evolution and constant innovation within various sectors. Additionally, the presence of terms like “investigations,” “electronically,” and “systematically” suggests a focus on research, analysis, and methodical approaches.

While many of the words in the ranking exhibit technical or professional connotations, there is also an acknowledgment of human connections and interactions. Words like “relationships,” “companionship,” and “interactions” remind us that amidst technological advancements, interpersonal connections remain essential.

Politeness, Gratitude, and Anonymity in the Japanese

The word rank table suggests an active and engaged online community that values politeness, gratitude, and specific cultural references, and reflects the unique dynamics and interests of this particular group, making it an intriguing snapshot of their communication patterns.

At the top of the ranking are expressions of gratitude, with “ありがとうございました” (Thank you very much) and “ありがとうございます” (Thank you) taking the first and second positions. This highlights the importance of politeness and appreciation in interactions within this community, also including others ございました, お願いします, してください, ありがとう, こんばんは, ください, よろしく, りまして.

Additionally, phrases like “VIPがお送りします” (Sent by VIP) and “風吹けば名無し” (When the wind blows, an anonymous person) suggest the presence of specific community dynamics or inside jokes. The inclusion of terms such as “スーパーコピー” (Super copy) and “トラックバック” (Trackback) indicates a focus on topics related to replicas and online content sharing or referencing.

Throughout the list, there is a mix of expressions commonly used in online communication, such as “お問い合わせ” (Inquiry), “続きを読む” (Read more), and “コメント” (Comment), indicating a high level of user engagement and interaction. The repetition of certain phrases, like “風吹けば名無し” and “名無しさん” (Anonymous), suggests the presence of recurring themes or ongoing discussions within the community and the nature of online anonymity, where individuals can freely express their opinions without revealing their true identities. Just as the wind makes a sound when it blows, anonymous users can make their voices heard without attaching their names or profiles to their statements.

Kindness, Business and Online service in the Korean

The word rank table relates to gratitude, possibilities, services, technology, data, society, and the internet suggests a community that discusses various aspects of daily life, including digital platforms, information management, societal affairs, and online interactions.

At the top ranks, we see the characters “가나다라마바사” (the first five characters of the Korean alphabet), indicating the primary sequence of letters in the language. The presence of “감사합니다” (thank you) and “가능합니다” (it is possible) suggests a community or context that values gratitude and acknowledges possibilities.

There are words related to business and services, such as “출장안마” (business trip massage), “프로그램” (program), “제공합니다” (provide), “서비스를” (service). This indicates a community that discusses the provision and utilization of various business services.

Additionally, there are words associated with technology and digital platforms, such as “프로그램” (program), “업데이트” (update), “다운로드” (download), “온라인” (online), “사이트” (website), “인터넷” (internet), and “홈페이지” (homepage). This suggests a community that engages in discussions the software, websites, and internet-related topics.

The list includes terms connected to data and information, like “데이터를” (data), “개인정보” (personal information), and “정보를” (information). This indicates a community concerned with data management, privacy, and the dissemination of information.

There are words linked to societal aspects, such as “사람들이” (people), “대한민국” (Republic of Korea), and “대통령” (president). This suggests a community that discusses individuals, the country, and political leadership.

Furthermore, the presence of words like “카지노” (casino), “온라인” (online), and “바카라” (baccarat) suggests a community that engages in discussions related to gambling, online platforms, and specific casino games.

Adult Material, Gambling and Overflowing Skewed Advertisements in the Chinese

The word rank table seems to consist of frequently used terms or phrases, but it appears to be heavily skewed towards content related to adult material, gambling, and lottery.

Many of the words and phrases include explicit content or references to adult videos, such as “_日本毛片免费视频观看” (free Japanese adult videos), “久久免费热在线精品” (free hot online adult videos), and “无码不卡高清免费v” (uncensored high-definition free adult videos).

Additionally, there are references to various forms of gambling and lotteries, including “中国福利彩票天天” (China welfare lottery every day), “大发快三大小单双” (big three fast size, odd and even), and “彩神争霸邀请码” (color god hegemony invitation code). These phrases suggest a focus on lottery games and gambling opportunities.

Governance, Legislation and State Affairs in the Russian

The word rank table suggest a context that revolves around the functioning of the state, legislation, responsibility, and public services. It reflects language usage related to governance, legal matters, and various aspects of state affairs in Russian-speaking contexts.

The table includes terms like “государственного” (state), “законодательства” (legislation), “ответственность” (responsibility), and “государственной” (state-owned), which indicate a strong presence of governmental and legal terminology.

Additionally, words related to administration and public services are also present, such as “администрации” (administration), “правительства” (government), and “представитель” (representative). These terms highlight the importance of governance and public institutions.

Furthermore, there are words related to construction, education, research, and safety, such as “строительства” (construction), “образователь” (educational), “исследований” (research), and “безопасность” (safety). These indicate a focus on infrastructure development, educational initiatives, scientific studies, and maintaining a secure environment.

Tenacity, Professional Responsibilities, and Sharing in the German

The word rank table highlights a community or context where individuals are engaging in discussions related to challenges, services, events, publications, and professional responsibilities, and also suggests an openness to diverse perspectives and an engagement with a global audience.

The top ranks feature German words such as “selbstverständlich” (of course), “Wahrscheinlichkeit” (probability), and “unterschiedlichen” (different). These words suggest a focus on various subjects, including general concepts, challenges, and services.

Terms related to challenges and difficulties, such as “Herausforderungen” (challenges) and “Schwierigkeiten” (difficulties), appear prominently in the list. This indicates a community or context where individuals are likely discussing and addressing obstacles in different domains.

The presence of words related to events and publications, such as “Veranstaltungen” (events) and “Veröffentlichung” (publication), suggests an engagement with sharing information, organizing gatherings, and disseminating knowledge. This indicates an active and dynamic community involved in knowledge exchange.

Additionally, there are words associated with professional roles and responsibilities, such as “Geschäftsführer” (manager/director) and “Verantwortung” (responsibility). This implies a focus on leadership, decision-making, and accountability within a professional context.

Professionalism, Entrepreneurship, and Science in the French

The word rank table showcases a collection of words that revolve around various professional and entrepreneurial contexts, and suggests an environment where individuals are seeking to enhance their understanding, refine their skills, and navigate the intricacies of professional and entrepreneurial endeavors.

The top ranks are dominated by terms related to entrepreneurship, such as “caractéristiques” (characteristics), “entrepreneurship,” and “entrepreneurial.” This suggests a focus on the qualities, concepts, and activities associated with starting and running businesses.

Throughout the list, there is a prevalence of words related to professional settings and responsibilities. Phrases like “professionnelles” (professional), “fonctionnalités” (features), and “responsabilités” (responsibilities) indicate an emphasis on professional conduct and roles within organizations. The inclusion of terms like “collaborateurs” (collaborators), “propriétaires” (owners), and “consommateurs” (consumers) points to the importance of teamwork, ownership, and customer-centric approaches.

Additionally, there is a presence of words related to regulations and compliance, such as “réglementation” (regulation) and “conformément” (in accordance with). This suggests an awareness of legal and ethical considerations in professional environments.

The list also incorporates words associated with research and academic contexts, including “scientifiques” (scientific), “universitaire” (academic), and “neuroscience.” This implies an interest in evidence-based approaches, knowledge acquisition, and intellectual exploration.

Furthermore, the presence of terms like “questionnaires,” “commentaires” (comments), and “référencement” (referencing) suggests an engagement with data collection, feedback, and information dissemination. This indicates an interest in gathering insights and fostering communication within professional communities.

Collaboration, Scheduling and Traditional Learning in the Italian

The word rank table displays a collection of words that encompass a variety of contexts and industries, and indicates an inclusive and diverse community.

The top ranks feature words like “caratteristiche” (characteristics), “professionelle” (professional), and “collaborazione” (collaboration), indicating a focus on professional qualities, teamwork, and cooperation.

The list includes words related to various sectors, such as “internazionale” (international), “traditionellen” (traditional), and “commissioning.” This suggests a diverse community or context where individuals discuss topics spanning international affairs, traditional practices, and project commissioning.

There is a presence of words related to cancellations and installations, such as “cancellations,” “installation,” and “cancellation.” This indicates a community that deals with matters of scheduling, project implementation, and potential changes or disruptions.

Furthermore, the inclusion of terms like “apprentissage” (apprenticeship), “informazioni” (information), and “direttamente” (directly) suggests an engagement with learning, data, and direct communication. This implies a community that values knowledge acquisition, information exchange, and effective communication channels.

The list also incorporates words associated with appreciation, positioning, and scalability. This implies a community that emphasizes recognizing achievements, strategic placement, and the ability to grow and adapt.

Digital Currency, Internationalization and Research Attitude in the Spanish

The word rank table covers a wide range of topics and areas of interest, including cryptocurrencies, establishments, sustainability, international affairs, research, communication, collaboration, characteristics, and legal matters. The presence of words from different domains suggests a diverse and multidisciplinary community engaged in discussions that span various fields and interests.

The top ranks include words such as “cryptocurrencies,” “establecimientos” (establishments), and “sustentabilidade” (sustainability), indicating a community or context that is engaged in discussions related to digital currency, business establishments, and environmental sustainability.

There is a presence of words associated with international matters, such as “internacionales” (international), “estadounidenses” (American), and “organizaciones” (organizations). This suggests a community with a global perspective, discussing international affairs, American entities, and organizational dynamics.

The list includes words related to characteristics, functionality, and performance, such as “características” (characteristics), “funcionalidades” (functionalities), and “funcionamiento” (functioning). This indicates an interest in analyzing and understanding the features, capabilities, and operational aspects of various subjects.

Additionally, there are terms connected to research, such as “investigaciones” (research), “investigadores” (researchers), and “investigación” (investigation). This implies a community that values scholarly inquiry, scientific studies, and knowledge generation.

The presence of words related to communication, collaboration, and participation, such as “recomendaciones” (recommendations), “colaboradores” (collaborators), and “participación” (participation), suggests an environment where individuals engage in sharing ideas, working together, and actively taking part in various activities.

Furthermore, there are words associated with legal and governmental contexts, such as “constitucional” (constitutional), “declaraciones” (statements), and “presidencial” (presidential). This indicates a community that discusses legal frameworks, official statements, and matters related to government and leadership.

Responsibilities, Opportunities and Individualized Development in the Portuguese

The word rank table highlights a community or context where individuals discuss responsibilities, development, personalized approaches, possibilities, and opportunities. The presence of words from education, administration, and organizational contexts suggests a community that values growth, individualization, and the exploration of potential outcomes. The emphasis on careful attention and tailored experiences indicates a community that values attentive and mindful approaches.

The top ranks include words such as “responsabilidades” (responsibilities), “desenvolvimento” (development), and “acompanhamento” (monitoring/support), indicating a community or context that emphasizes individual and collective duties, growth, and attentive guidance.

There is a presence of words related to possibilities and opportunities, such as “possibilidades” (possibilities), “oportunidades” (opportunities), and “possibilidade” (possibility). This suggests a community that explores potential outcomes, prospects for advancement, and various options available for consideration.

The list includes terms associated with personalized approaches and attention to detail, such as “personalizados” (personalized), “cuidadosamente” (carefully), and “personalizado” (customized). This implies a community that values tailored and individualized experiences, paying close attention to specific needs and preferences.

Furthermore, there are words connected to education and academia, such as “universidades” (universities) and “Universidade” (University). This suggests a community engaged in discussions related to higher education, research, and academic institutions.

The presence of words related to administration, institutions, and organizations, such as “administração” (administration), “instituições” (institutions), and “organizações” (organizations), points to a community that addresses topics of management, governance, and the functioning of various entities.

--

--

Henry Heng LUO

A highly self-motivated and enthusiastic data scientist.