Can Large Language Models (LLMs) Largely Help Our Languages?

Apoorv Mohit
𝐀𝐈 𝐦𝐨𝐧𝐤𝐬.𝐢𝐨
7 min readMay 3, 2023
Photo by Andrew Neel on Unsplash

Ever since ChatGPT was made available for public use by OpenAI in November, the internet has been adding more and more data on it for its fifth iteration to train on. By looking at the sheer amount of articles, discoveries, limitations, capabilities and warnings that have flooded the internet regarding ChatGPT (GPT-3) and GPT-4, it is safe to say that GPT-5 will be able to learn more about itself from its training data rather than its developers.

GPT-3 and GPT-4 are what can be defined as instances of a Large Language Model. A large language model is a type of machine learning model that uses deep neural networks to process and understand natural language. These models are trained on vast amounts of text data, such as books, articles, and web pages, and are designed to generate human-like responses to natural language input. The primary goal of large language models is to understand language at a deep level and use this understanding to generate text that is coherent, relevant, and contextually appropriate. They can be used for a wide range of tasks, including language translation, sentiment analysis, text classification, and question answering.

Months after its initial release, ChatGPT is still regarded as a creature with magical powers by the virtue of its ability to understand human language, math, science, morals, logic and values just like an ideal human. History and mythology has given us enough evidence time and again to prove that every creature with magical powers is not always of the family of angels, some of them are followers of the devil too. It is yet to be seen whether GPT-4 and its following versions will have a halo surrounding their heads or horns poking out of their skulls. Either ways, we need to brace for impact.

We are in the early stages of assessing the new A.I.’s impact on the society and our lives. Its impact on a variety of issues, processes and tasks such as automation, writing, editing, visual arts, programming. research and much more are highlighted every coming day. While we are exploring a range of topics that A.I. can affect, we might be missing out on some. One of such topics can be Languages.

A.I., or Large Language Models, more specifically, will compel us to digitize languages more extensively. Moreover, it can help us preserve our languages in a way that has never been possible before.

Large Language Models and their Languages

ChatGPT, arguably the most widely known free Large Language Model was trained on 570 gigabytes of data¹. It is a fairly small amount of data as compared to the total size of the internet which can easily be estimated to touch a few thousand petabytes. According to BBC Science Focus, Google, Amazon, Microsoft and Facebook (Meta) alone store at least 1,200 petabytes (1.2 million terabytes) of data².

The Common Crawl, WebText 2, Books 1&2 and Wikipedia are the data sources for ChatGPT. It means that these were the repositories of data on which ChatGPT is trained on.

The table above is from a paper titled: Language Models are Few Shot Learners³ and reveals the data sources for GPT-3. The data sources for GPT-4 have not been made public as of now.

Let us consider The Common Crawl as the primary dataset and overlook the other datasets. The Common Crawl corpus contains petabytes of data collected over 12 years of web crawling. The corpus contains raw web page data, metadata extracts and text extracts. Common Crawl data is stored on Amazon Web Services’ Public Data Sets and on multiple academic cloud platforms across the world.

Upon looking at the statistics published by The Common Crawl, we get to see that the English Language makes up most of the dataset (46%), followed by the German Language (5%), Russian (5%), French (4%) and Chinese (4%). This means that GPT-3 was trained on the English language more than any other language and consequently, is better at English than any other language. Comparing this digital statistic with the real world, English is most spoken language, owing to the colonization and “Civilization” of the major part of the globe in the 19th and 20th centuries, followed by Mandarin, Hindi, Spanish and French.

Taking a look at the top 20 languages spoken throughout the world and the percentage of its share in the Common Crawl, we can easily figure out that there is a major disparity in the real and virtual presence of languages.

Global Speakers percentage vs. Common Crawl percentage of Languages

Even though there is a visible drastic absence of the volume of other major languages in the Common Crawl dataset (and other datasets), GPT-3 has proven to be really good at those languages as well.

However, for a Large Language Model to become better at any given language, it first needs to “learn” that language, which means that it needs to look at various words, sentences and phrases and identify the sentence formation patterns, genders of different entities and meanings of different words in different contexts among other intricacies of a language and its structure. In order to make the Large Language Models better and more coherent at different languages, developers will need to extract more sources of texts in those particular languages to increase the size of the training data. A huge dataset needs to be prepared for developing an LLM that “speaks” and “understands” various languages fluently.

Languages, Large Language Models, the Internet and Eternity

Increasing the size of the dataset means increasing the presence of different languages on the internet, or in some sort of digital format. Once a language is recorded and stored on the internet (or a database), it can safely be assumed that that language will now never be lost.

Accents can sometimes act as a distinguishing factor for a single language. It can be regarded as a “style” element which considerable changes how a language is spoken and understood. Apart from accents, some languages comprise of various dialects that completely changes the vocabulary, implementations and maybe a few stylistic elements while pertaining to the structure of the original languages. Such elements of a language need to be incorporated too into the datasets to make them, and in turn the Language Models richer. This will make our existing records of a language richer and can help existing applications such as translate with being more accurate.

We have seen that a few languages make most of the training dataset and are thus classified as high-resource languages while other languages with a significantly smaller digital presence are known as low-resource languages. Therefore, logically, LLMs won’t be adept at low-resource languages. Bloom⁴, a Hugging Face project has been trying to resolve this very problem by shifting the focus of auto-regressive models for next token prediction such as GPT-3 and GPT-4 to lower resource languages.

Large Language Models have facilitated the need for more and more languages to leave the pages of a book, escape mere narration and oration by people and essentially get captured in the form of 0’s and 1’s in a dataset. Once a language has been stored in a dataset, it can be relieved of the insecurity of being declared a dead or an extinct language ever. Therefore, Large Language Models can help our languages to survive. Training LLM’s on a large number of languages can help them get richer and more coherent as well as making the language ever more known and used. At any given point of time in recent or distant future, LLMs can ensure that any language can always have a speaker alive — themselves. LLMs can act as the last standing speaker of a language, thus preventing it from ever going extinct.

It is almost ironic how languages might need to surrender their symbolic forms and acquire a characterless binary shape in order to attain eternity through the internet.

Artificial Intelligence brings with itself a plethora of problems and solutions, questions and answers, uncertainty and results, warnings and explanations but as a race of organisms who have always been skeptical of changes and new notions, we need to tread very carefully in today’s era which is about to be changed (for good or bad) by the emergence of Artificial General Intelligence. We need to check for every little advantage and disadvantage that AI has or can bring to us and this might just be one of the major advantages that AI brings to us and something we created long before AI itself — our languages.

References

  1. ChatGPT and Dall-E-2 — Show Me the Data Sources, Dennis Layton
  2. How Much Data is on The Internet, Garreth Mitchell
  3. Language Models are Few Shot Learners, Tom B. Brown et al.
  4. Why doesn’t AI speak every language, Vox (7:14 — 8:19)

--

--

Apoorv Mohit
𝐀𝐈 𝐦𝐨𝐧𝐤𝐬.𝐢𝐨

Student | Studying Information Technology and Data Science | Trying to write on everything in and beyond this world.