Transformative Impact of Large Language Models (LLMs) on Our Lives
LLMs bring tremendous promise and peril
By: Rooz Aliabadi
Since OpenAI, the company behind ChatGPT, made the chatbot available to the general public in November 2022, it has been the primary topic of conversation among the tech elite. ChatGPT holds more information than any of us could learn in a lifetime. It can engage in intelligent discussions on issues ranging from African mineral extraction to the geopolitical complexities surrounding Taiwanese semiconductor firm TSMC. Its underlying neural network, GPT-4, has achieved high scores on exams required for US admission into medical and law schools. In addition, it can generate a wide range of creative content, such as songs, poems, and essays. Other “generative AI” models can also produce digital images, drawings, and animations.
Alongside the enthusiasm surrounding generative AI models, there is growing concern within the tech industry, and beyond that, these models are being developed at an alarming pace. GPT-4, categorized as a large language model (LLM), falls under the umbrella of generative AI. Companies like Alphabet, Amazon, and Nvidia have developed their LLMs.
Despite actively pursuing its development, I continue to express a strong sense of unease regarding the potential existential threat posed by AI. I discuss this topic daily with other ed-tech colleagues. In recent months, the US and European governments have begun exploring new regulations. There are also prominent voices advocating for a pause in AI development, fearing the possibility of the software spiraling out of control and causing harm, or even destruction, to human society. To assess the appropriate level of concern or excitement towards this technology, it’s essential first to understand its origins, mechanics, and growth limitations.
The rapid expansion of AI software capabilities can be traced back to the early 2010s, when “deep learning” emerged as a popular software technique. With massive datasets and high-performance computers utilizing Graphics Processing Units (GPUs) to run neural networks, deep learning significantly enhanced computers’ abilities in image recognition, audio processing, and even gaming. By the decade’s end, computers had surpassed human performance in many tasks.
Neural networks were typically integrated into the software with more extensive functionality, such as email clients (which I have been using on my Gmail account). Nontechnical users rarely interacted with these AIs directly. When they did, their experiences were often described in almost mystical terms. Lee Sedol, a world-renowned player of the ancient Chinese board game Go, retired after being defeated by Alphabet’s AlphaGo software based on a neural network in 2016. “Even if I become the number one,” he stated, “there is an entity that cannot be defeated.”
By performing within the most human conversation medium, ChatGPT is currently providing internet users with a comparable experience — a sense of intellectual disorientation resulting from software that has suddenly advanced to a level where it can perform tasks previously exclusive to human intelligence.
Despite the sense of amazement that it inspires, an LLM is, in essence, an enormous statistical exercise. If you prompt ChatGPT to complete a sentence like “The potential of LLMs is that they can…” you will receive an instantaneous response. How does this process function?
Initially, the language utilized in the query is transformed from words, which neural networks cannot process, into a set of representative numbers. GPT-3, which prompted an earlier version of ChatGPT, achieves this by dividing the text into character-based sections known as tokens, which frequently appear together. These tokens may be words, such as “love” or “are,” affixes, such as “dis” or “ised,” or punctuation marks, such as “?”. GPT-3’s glossary contains information on 50,257 tokens.
The processing capacity of GPT-3 is limited to 2,048 tokens at a time, roughly equivalent to the length of a lengthy article published in The New Yorker Magazine. However, GPT-4 surpasses this limit by accommodating inputs of up to 32,000 tokens long, equivalent to a novella. With a greater input capacity, the model gains a broader context and provides better responses. Nonetheless, the catch is that the computation required for longer inputs increases non-linearly, implying that slightly lengthier inputs need significantly more computing power.
The tokens are, after that, given definitions by positioning them in a “meaning space” where words with similar meanings are situated near each other.
The LLM then employs its “attention network” to establish connections between various query components. An individual reading our query, “LLMs promise that they…” would understand English grammar and comprehend the meanings behind the words in the sentence. For example, they would know that the model is the large one. On the other hand, an LLM must learn these associations from the beginning during its training period. Over billions of training sessions, its attention network progressively encodes the structure of the language it sees as numbers (“weights”) within its neural network. If it comprehends language at all, an LLM does so in a statistical rather than a grammatical manner. It is more akin to an abacus than a mind.
After processing the prompt, the LLM generates a response by evaluating the probability of each token in its vocabulary being the best choice for the next word in the sentence. The token with the highest chance is only sometimes selected as the following word in the response, as the LLM’s level of creativity is determined by its operators and can affect its decision-making process.
The LLM starts by generating a single word and then uses it as input for developing the next word. The first word is generated based solely on the prompt, whereas subsequent words are generated by including the words generated before them in the response. This iterative process, called autoregression, continues until the LLM has completed the entire response.
Even though the rules governing their operations can be documented, the outputs of LLMs are only partially foreseeable; these very large abacuses can do things that smaller ones cannot, in ways even their designers find astonishing.
Although LLMs’ workings can be expressed in rules, their outputs are somewhat predictable. These very large abacuses possess abilities that smaller ones lack, surprising even their creators. An OpenAI researcher identified 137 such “emergent” abilities across various LLMs. These abilities are not magical but, instead, represented in some form within the LLMs’ training data (or given prompts) but only become apparent once the LLMs reach a specific, substantial threshold in size. At a certain size, an LLM struggles to write gender-inclusive sentences in Portuguese, no better than if it were doing so at random. However, increasing the model size by a small amount results in a new ability appearing. For instance, GPT-4 achieved a 90th percentile score in the American Uniform Bar Examination, designed to test the skills of aspiring lawyers, whereas the slightly smaller GPT-3.5 failed the exam.
The emergence of new abilities in LLMs is fascinating, suggesting that these models still have untapped potential. For instance, Jonas Degrave, an engineer at Alphabet’s AI research company DeepMind, demonstrated that ChatGPT could behave like a computer command line terminal, accurately compiling and running programs. This leads experts to believe that, with a slight increase in size, LLMs may gain the ability to perform many new valuable tasks. However, the same reason that makes the emergence of these abilities exciting also raises concerns among experts. Many studies have shown that certain social biases can surface when models grow, making it difficult to predict what harmful behaviors could emerge if LLMs continue to scale.
The remarkable achievements of LLMs in producing compelling text and demonstrating excellent abilities have resulted from the convergence of three factors: massive amounts of data, algorithms that can learn from it, and the necessary computational resources. Although the specific specifications and mechanisms of GPT-4 remain undisclosed, those of GPT-3 has been documented in a paper titled “Language Models are Few-Shot Learners,” published by OpenAI in early 2020.
The neural network weights in GPT-3 are random before encountering training data. As a result, its generated text would be nonsensical. To produce readable text, GPT-3 was trained on multiple data sources, with the primary source being snapshots of the internet from 2016 to 2019, obtained from a database named Common Crawl. This initial dataset of 45 terabytes was refined using a different machine-learning model to filter out low-quality text, resulting in a 570-gigabyte dataset that can be stored on a laptop. Additionally, GPT-4 was trained on several terabytes of images, although the exact amount is unknown. For comparison, AlexNet, a neural network that sparked excitement in image processing during the 2010s, was trained on a dataset of 1.2 million labeled images totaling 126 gigabytes. This dataset was less than one-tenth the size of the probable dataset used for GPT-4.
During training, the LLM undergoes a self-quizzing process where it receives a text segment, masks certain words at the end, and attempts to predict the missing words. The model then reveals the correct answers and compares them to its predictions. Since the answers are contained within the training data, these models can be self-supervised and trained on massive datasets without human labeling.
The LLM’s objective is to minimize its errors and improve the accuracy of its predictions. However, not all mistakes are equally detrimental. For instance, guessing “I love ice hockey” instead of “I love ice cream” is better than thinking “I love ice are.” The magnitude of each error is quantified as a “loss” value. The loss is then utilized to adjust the neural network’s weights, directing it toward generating more accurate responses after a few attempts.
The attention network of the LLM is critical in leveraging vast amounts of data for learning. It incorporates a mechanism to acquire and utilize word and concept associations that may be distant from each other in a text, making it possible to process large volumes of data in a reasonable time frame. A typical LLM has several parallel attention networks that enable the process to be distributed across multiple GPUs. Older language models that did not incorporate attention-based mechanisms would have been unable to manage such data volumes efficiently. The scaling would only be computationally feasible with attention.
The recent expansion of LLMs has been primarily fueled by their capacity to handle large amounts of data. For instance, GPT-3 boasted hundreds of layers and billions of weights and was trained on hundreds of billions of words. In contrast, the first version of GPT, developed five years ago, was a mere ten thousandth of its size.
There are valid reasons to believe that the expansion of LLMs cannot be infinite. This is due to the costly inputs that LLMs rely on, such as data, electricity, computational power, and skilled labor. For example, training GPT-3 used 1.3 gigawatt-hours of electricity to power 121 American homes for a year, costing OpenAI around $4.6m. Training GPT-4, a much larger model, will cost disproportionately more, likely around $100m. Since the computational requirements escalate faster than the input data, the cost of training LLMs increases faster than their performance improves. An inflection point has already been reached, and large models may no longer be feasible. We’re at the end of an era where there will be these enormous models. We’ll improve them in other ways.
The availability of training data is the primary constraint on the future advancement of LLMs. With GPT-3, virtually all high-quality text obtained from the internet has already been used for training. A research paper from October 2022 has stated that “the supply of high-quality language data is likely to be depleted soon, possibly before 2026.” While additional text is available, it is generally restricted to corporate databases or personal devices. It is not easily accessible on the scale and at the low cost of Common Crawl.
Although computers will continue to become more powerful in the future, there is unlikely to be a new hardware breakthrough as significant as the introduction of GPUs in the early 2010s, meaning that training larger language models will probably become increasingly expensive. This could be one reason I could be more enthusiastic about the idea. While it is possible to make improvements, such as developing new kinds of chips like Google’s Tensor Processing Unit, chip manufacturing is no longer improving exponentially through Moore’s law and shrinking circuits.
Legal concerns will also arise as AI-generated content becomes more prevalent. Getty Images, a photography agency, has sued Stability AI, a company that produces an image-generating model called Stable Diffusion. Stable Diffusion uses training data from Common Crawl, the same source as GPT-3 and GPT-4, and similar processes using attention networks. Images are one of the most impressive examples of AI’s generative capabilities, and people on the internet are often fooled into believing in photos of events that never occurred, such as the Pope wearing a Balenciaga jacket or Donald Trump’s arrest.
Getty Images has filed a lawsuit against Stability AI, the creators of the Stable Diffusion image-generation model, claiming that it has used Getty’s copyrighted material without permission. While the evidence of copyrighted material in ChatGPT’s text output may need to be clarified, it is undeniable that the model has been trained on copyrighted content. OpenAI is likely relying on the “fair use” provision in copyright law, which permits limited use of copyrighted material for transformative purposes. However, the legality of this approach may eventually be subject to legal scrutiny.
Even if LLMs hit the ceiling in their development this year and OpenAI were to go bankrupt due to a high-profile lawsuit, the potential of LLMs would persist. The resources to handle data and the technology to process it are readily accessible, albeit at a higher cost than what OpenAI has achieved in scale.
With careful and selective training, open-source implementations can match the performance of GPT-4. This is beneficial because having the capabilities of LLMs accessible to many people allows for innovative new applications to be developed, improving various fields, from medicine to law.
However, this also implies that the potentially catastrophic risks, causing concern among the tech community, have become more conceivable. LLMs have already demonstrated immense power, and their rapid progress has made many involved in their development apprehensive. The capabilities of the most advanced models have exceeded the comprehension and regulation of their creators, posing various types of risks.
This Chat GPT Lesson Plan and others are available FREE to all educators at edu.readyai.org
This article was written by Rooz Aliabadi (rooz@readyai.org). Rooz is the CEO (Chief Troublemaker) at ReadyAI.org
To learn more about ReadyAI, visit www.readyai.org or email us at info@readyai.org.