Breaking the Language Barrier: Unleashing the Full Potential of Generative AI Systems

Published in

ReadyAI.org

6 min readNov 3, 2023

By: Rooz Aliabadi, Ph.D.

Generative A.I. technologies, designed for tasks ranging from drafting emails to shaping research papers, are rapidly being deployed across various sectors. However, the linguistic diversity of our world needs to be reflected in these tools, with non-standard languages and dialects often overlooked, potentially reinforcing societal inequalities.

Language defining the bounds of one’s world is pertinent in the digital age, where language determines the extent of our interaction with technology, and the limitations of digital communication can restrict the equitable use of technology.

Currently, speakers of less dominant languages, other than major ones like English or Mandarin, find themselves at a disadvantage, with most of the internet’s content being in English, followed by other Asian and European languages. Even within English, the numerous dialects beyond General American English are only occasionally well-served by large language models (LLMs) that power AI tools, perpetuating a divide between standard and non-standard language speakers.

The role of language as a societal force is well-acknowledged among sociologists, anthropologists, and linguists. It not only strengthens community bonds but also perpetuates inequalities, as historically seen when language and literacy have been used as tools of oppression. For example, during the transatlantic slave trade, literacy became a method of control, leading to the prohibition of literacy for enslaved people in the Confederate states in the U.S. in the 19th century.

Given this legacy and instances where bilingualism has been discouraged through “English-only” policies, the replication of such linguistic biases in digital spaces cannot be overlooked. It’s vital to examine the consequences of these historical patterns as they are extended into the digital realm, potentially widening the digital language divide in AI-driven systems with the advancement of generative A.I. models in recent months.

The Digital Divide Begins with Linguistic Differences

Language differences contribute to the digital divide, affecting the development of generative A.I. systems and LLMs. These systems predominantly depend on online data, where representation is limited to a few hundred languages, with English dominating the landscape. This dominance has led to the proliferation of English-centric datasets and models, given the abundance of data in English.

Historically, even before the advent of generative AI, NLP systems were mainly tailored for and tested in “high resource” languages, such as English. Only about 20 languages worldwide have the “high-resource” status, meaning they have abundant data to train language-based systems effectively. The stark contrast occurs partly because speakers of less-resourced languages have reduced access to digital platforms, resulting in a smaller digital presence and, thus, less representation in the datasets derived from web scraping. This lack of data means language-based A.I. applications fail to represent the linguistic diversity of billions of individuals globally.

Furthermore, even within the high-resource language communities, regional dialects still need to be represented. Most online content, including literature, blogs, news, ads, and social media, is produced in Standard American English, which is then gathered for training NLP systems and generative AI tools. For example, ChatGPT’s training on 300 billion words likely includes a minimal representation of non-standard English dialects.

Speakers of such dialects, like AAVE (African-American Vernacular English is the variety of English natively spoken, particularly in urban communities, by most working- and middle-class African Americans and some Black Canadians.) or Chicano English (Chicano English, or Mexican-American English, is a dialect of American English spoken primarily by Mexican Americans, particularly in the Southwestern United States ranging from Texas to California, as well as in Chicago. Also, they often need help with digital access challenges due to the need for essential internet infrastructure or devices. This deepens their underrepresentation in LLM training datasets, leading to a generative A.I. that needs to be equipped to serve a diverse array of communities. Therefore, the digital divide not only restricts access to technology but also influences the inclusivity and representativeness of A.I. systems.

The Divide in Digital Linguistic Representation

These trends contribute to the digital language divide. English serves merely as one example of how speakers of non-standard dialects within a high-resource language can face exclusion. This pattern is not unique to English; languages like Mandarin and German also possess “standard” and non-standard dialects that might not be adequately represented online or in research data sets, such as the urban youth dialect Kiezdeutsch in Germany. While the resource gap among language speakers often stems from uneven digital access and infrastructure, including AI scientists and developers who embody linguistic diversity is essential for creating generative A.I. tools that are genuinely inclusive.

The Significance of Digital Linguistic Disparity

Our language shapes our interaction with the world and determines the communities we can access. History has repeatedly shown language being used as a means of exclusion and control, from the prohibition of literacy among enslaved Black people in the U.S. to the denial of educational resources to Japanese-American children in internment camps. Today, this exclusion continues as some political factions oppose bilingual education for native Spanish speakers in the United States.

Language access has historically been leveraged to marginalize vulnerable groups, and now, language-centric technologies like generative AI are emerging as modern gatekeepers.

For generative AI to be a force for equity, it must have access to a fair distribution of language data. Generative AI holds promise for bridging equity gaps for those with communication impairments, low literacy rates, or disparities in educational resources. Yet, it risks perpetuating erasure of cultural identities if it cannot authentically represent the linguistic nuances of diverse populations. An example is ChatGPT’s inadequate attempt to mimic the narrative voice of “The Hate U Give,” which demonstrated a superficial and misinformed approach to African-American Vernacular English. If generative AI’s benefits aren’t universally accessible and inclusive, it may worsen discrepancies.

Prioritizing “standard” language varieties in A.I. training discriminates against other dialects and their speakers. For instance, A.I. tools designed to detect plagiarism or misinformation have been shown to misidentify non-native English text as AI-generated, as evidenced by a Stanford study with TOEFL essays. This perpetuates a cycle where “standard” English is considered prestigious, and any deviation is deemed inferior, pressuring non-standard speakers to conform to benefit from A.I. technology.

This bias within the digital language divide is harmful not only to speakers of non-standard dialects but also to the creators of generative AI. For AI tools to be genuinely inclusive and practical, they must account for the full spectrum of linguistic expression, including code-switching and non-standard varieties, which marginalized groups often employ to navigate mainstream society.

As AI developers strive to rectify these biases, incorporating more open-source data and contributions from diverse linguistic backgrounds could improve A.I.’s understanding and handling of language nuances. Open-sourced language data will likely be more inclusive, providing a richer, more varied representation of language use and context than proprietary datasets with limited scope. Addressing these issues will not only enhance the performance of generative AI but also ensure it serves all communities equitably.

Looking Ahead

To proactively reduce bias in generative A.I., the process must be proactive and deliberate, incorporating regional and linguistic specificity in model development and dataset creation. This involves engaging a broad array of “humans-in-the-loop” from the start, incorporating diverse community inputs to ensure their dialects and idiomatic expressions are represented in large language models (LLMs).

Efforts to involve underrepresented communities in developing training data should be transparent and respectful, recognizing that language is not just a communal asset but deeply personal. This approach acknowledges the importance of cultural nuances that more homogenous LLMs may overlook.

Pioneering efforts are already underway to localize training data. Confronting the digital language divide means recognizing the underlying disparities in internet access, often influenced by gender, geography, and socioeconomic factors. These differences not only shape who is represented online but also affect the datasets available for A.I. training. To craft genuinely inclusive digital environments, we must reassess the prevailing language norms and address the gaps in internet accessibility, ensuring our technological ecosystems reflect the full spectrum of our global linguistic tapestry.

This article was written by Rooz Aliabadi, Ph.D. (rooz@readyai.org). Rooz is the CEO (Chief Troublemaker) at ReadyAI.org

To learn more about ReadyAI, visit www.readyai.org or email us at info@readyai.org.

Breaking the Language Barrier: Unleashing the Full Potential of Generative AI Systems

The Digital Divide Begins with Linguistic Differences

The Divide in Digital Linguistic Representation

The Significance of Digital Linguistic Disparity

Looking Ahead

Written by ReadyAI.org