Approaching Human: Reflections on Large Language Models
“Life is the ultimate technology. Machine technology is a temporary surrogate for life technology. As we improve our machines they will become more organic, more biological, more like life, because life is the best technology for living.”
— Kevin Kelly, founding executive editor of Wired magazine.
A Blast from the Past
ChatGPT, GPT-3, Stable Diffusion, BLOOM, DALL-E, Midjourney, Copilot, ... The list goes on and on. These models are masterful creative multipliers: enabling rapid and productive content generation unlike ever before. With applications in dev tools, education, and many more verticals, maximizing productivity has never been easier than before. However, many developers and practitioners may be missing a bigger-picture vision of what these models are just beginning to offer.
Let’s rewind to Germany in 1440. With its sleek, mechanical design and ability to quickly and efficiently print books and other materials, the Gutenberg press revolutionized the world of printing and allowed for the widespread dissemination of knowledge and ideas. As the first printing press to use movable type, the Gutenberg press was a marvel of engineering and a glimpse into the future of content production. Its ability to quickly and accurately print large volumes of text allowed for the proliferation of books, pamphlets, and other materials, making information more accessible than ever before. It was a symbol of innovation and progress, and its impact on the world of printing and content delivery can still be felt today.
Simply put, what made the Gutenberg press so remarkable was the concept of mass-producing text via ink. If you had any idea, you could spread it by printing it. Thus, Gutenberg’s innovation enabled mass communication with a single medium: text. Even today, text is still the universal medium for the largest models out there, but the original conception presents an idea that many seem to miss: it changed the way humans fundamentally interact with knowledge and how knowledge proliferates.
Large Language Models
The rise of large language models (LLMs) has been one of the most significant developments in the field of machine learning in recent years. These models, which are trained on massive amounts of text data, have the ability to generate or print human-like text and have been used in a variety of applications. What enables them to be so impressive? — Transformers. As conceived by Vaswani et al., transformers are able to capture long-term dependencies in sequential data like natural language. This is possible because they use self-attention mechanisms, which allow the model to focus on specific parts of the input when processing it. This means that the model can learn to pay more attention to certain words or phrases in a sentence, which allows it to capture subtle nuances and relationships in the data that other models might miss. Additionally, transformers are able to process input in parallel, which makes them more efficient and faster to train than other models that process input sequentially.
One of the most well-known large language models is GPT-3, developed by OpenAI. GPT-3, an autoregressive decoder LLM, has been praised for its ability to generate text that is difficult to distinguish from text written by a human, and it has been used in a range of applications, from chatbots and machine translation to text summarization and content generation. Another popular large language model is BERT, developed by Google. BERT, a bidirectional LLM, is designed to understand the context and meaning of words in a sentence, and it has been used in a variety of natural language processing tasks, including sentiment analysis and question answering.
The rise of large language models has led to both excitement and concern among experts. On the one hand, these models have the potential to greatly improve natural language processing and many existing applications. On the other hand, there are concerns about the potential misuse of these models, such as in the generation of fake news. The purpose of this article is to put recent developments like ChatGPT and Stable Diffusion into perspective and understand how the ongoing exponential growth of LLM robustness fits into the larger vision of Artificial General Intelligence (AGI).
The 2nd Order Perspective
With the recent release of ChatGPT, there has been more existential discussion surrounding the capabilities of machine learning than ever before. What makes ChatGPT in particular so impressive? Frameworks like AlphaCode, Copilot, and Ghostwriter enable code autocomplete and the writing of entire segments from scratch. Stable Diffusion, DALL-E, and DreamBooth enable the generation of image media for whatever one can dream up. But ChatGPT? — it feels more human than any LLM we have encountered so far. The robust conversational style and smart responses across a variety of prompts are simply mindblowing. From simulating classical thought experiments like the Trolley problem to writing well-documented code for nontrivial algorithmic questions, ChatGPT is a science fiction dream come true. Its impact is also only starting to be felt. Education will slowly be changed forever: drafts, essays, papers, and even books can be written in minutes not hours. When it comes to developers, the Software Engineering industry will shift towards adaptive problem solvers who can ask the right questions and ideate better products since a ChatGPT can write 99% of code for most workflows. Prompt engineering is becoming a craft in itself: GitHub-style repositories for successful prompts and prompt playgrounds like Everyprompt have already begun to emerge. Moreover, query-based approaches to use models like GPT-3 and ChatGPT for search have also started sprouting as seen with concepts like GPT-index. These parallel applications are just scraping the surface of what can be unlocked with GPT-N models.
With GPT-4 on the way (within the next few months), you may be wondering: what’s the difference between GPT-2 and GPT-3? Moreover, how will GPT-4 build on the wonders of GPT-3?
Since GPT-3 is trained on a much larger dataset and has significantly more parameters than GPT-2, it is capable of generating more realistic and diverse text. Given the larger scale of engineering and greater amount of parameters, GPT-3 has a larger context window than GPT-2. The context window is the amount of context that the model considers when generating text. A larger context window allows the model to generate more coherent and realistic text because it has more information to work with. GPT-3 has a context window of 4096 tokens, while GPT-2 has a context window of 1024 tokens.
Let’s now consider the original GPT model. Yes, GPT-2 has a larger context window than the original GPT. GPT-2 has a context window of 1024 tokens, while the original GPT has a context window of only 512 tokens. We can now extend this idea to a bigger-picture theme: the performance improvements of GPT-N models are facilitated by larger context windows alongside a larger number of parameters within the model. Given this, we can extrapolate that GPT-4 will enable even richer text generation with more in-depth relational and semantic power. But what about ChatGPT? ChatGPT is almost a GPT-3.5 but with a twist — ChatGPT is powered by reinforcement learning (RL). In reinforcement learning, the model is trained to maximize a reward signal by taking a series of actions. In the case of ChatGPT, the model is trained to generate responses to input text that are relevant, coherent, and engaging. The model is rewarded for generating responses that are similar to those produced by human conversationalists.
To train ChatGPT using reinforcement learning, the model is presented with a conversational context and a prompt, and it must generate a response. This response is then evaluated by a human evaluator, who assigns a score based on its quality. The model’s response is then compared to the responses produced by human conversationalists, and the model is rewarded for generating responses that are similar to the human responses. Over time, this reinforcement learning process allows the model to improve its ability to generate high-quality responses to conversational prompts.
Based on this discussion, it’s easy to see that GPT-n models will only get better with larger context windows, AKA more global knowledge as a result of greater scale (billions more parameters). However, ChatGPT, a ~GPT-3.5 of sorts, presents a new frontier to be explored: treating LLMs as agents through applications of techniques like reinforcement learning and meta-learning.
How History ties into the Next Frontier
Let’s now step back in time. Prior to the deep learning revolution of the 21st century, there was the time of GOFAI: Good Old Fashioned Artificial Intelligence. At the dawn of our modern age of machine learning, GOFAI was predominately composed of rule-based or logical agents. Significant contributions of the GOFAI era include search algorithms, automated scheduling, constraint-based reasoning, the semantic web, knowledge graphs, etc. IBM’s DeepBlue, a classical system of the GOFAI era, was a chess-playing computer that was developed by IBM in the late 1990s. It was designed to be able to play and win against human chess champions.
DeepBlue used a combination of advanced computer hardware and software to analyze and evaluate chess positions: an optimized brute force. It was capable of evaluating up to 200 million positions per second, which allowed it to search deeper into the game tree and consider a wider range of possible moves than a human chess player could. In fact, DeepBlue was essentially the parallelization of the alpha-beta pruning algorithm.
Now entering the 2000s, we begin to see the advent of modern deep learning systems. With more powerful compute and more data than ever before, neural networks (multi-layer perceptrons) begin to emerge and enable optimal models for nonlinear distributions of data. AlexNet, VGG, ResNet, etc. — these were million-parameter neural networks that consisted of novel techniques like backpropagation, dropout, batch norm, etc. A relevant example of the marvels of modern deep learning is AlphaGo. Developed by Google DeepMind, AlphaGo is a computer program that uses machine learning to play the board game Go. It uses a neural network trained on expert Go games, and a Monte Carlo tree search algorithm, to evaluate the current board position and select the best move. This allows it to defeat even highly skilled human players. The Monte Carlo tree search is a reinforcement learning heuristic scaled across millions of neurons within the network.
The Pseudo Singularity
As we progress through the 2020s, the pieces are in place. From GOFAI to modern deep learning, the last few years set the stage for LLMs: multi-billion parameter models essentially trained on the internet to capture long-term dependencies. However, there’s a major caveat to all the magic of LLMs: nothing that a GPT-N model produces is original relative to the original data it was trained upon. Current models are essentially great listeners & idea diffusion mechanisms: thoughtful, expressive, and multi-modal echo chambers of the internet.
We’ve reached the crux of the discussion: how do we make the leap beyond GPT-N models? History is a teacher of the past and based on the developments of the last few decades, it might as well take the next transformer. Let’s generalize: reasoning with different media (text, image, video, etc.) while putting together various ideas in a latent space to form novel, nonobvious arguments may be the next frontier. Part of the reason why ChatGPT is so amazing is that its RL mechanism via Human Feedback may serve as the earliest form of this idea. Imagine a generation of LLMs that can generate completely novel images or text and explain the reasoning process behind it: one could even think of this as a limited form of consciousness. Essentially, an LLM of the future could map out why it produced the concepts or arguments that it did rather than just act as a stochastic parrot of the internet like today’s diffusion and generation mechanisms. If such an LLM was possible, it would then be theoretically impossible to trick it into providing factually incorrect outputs since those would not align with the model’s internal reasoning. What would it take to create this seemingly magical reasoning LLM? —based on historical trends, likely a lot of fundamental R&D innovation with scaling meta-learning/multi-task frameworks alongside reinforcement learning. Perhaps the next transformer will be an agent that at a high level, rewards cross-domain knowledge.
Another interesting idea in regard to the future of LLMs is sparsity. GPT-3 and existing models are typically dense models: models where all neurons are active across billions of parameters. Sparsity can help to reduce the number of parameters in a model, which can make it more efficient and easier to train. Moreover, sparsity can also help to improve the interpretability of a model. In other words, a sparse model is easier to understand and explain, because it has fewer parameters and is therefore less complex. Lastly, sparsity can also help to improve the generalization ability of a model. In other words, a sparse model is better able to generalize from the training data to new, unseen data, because it has fewer parameters and is therefore less likely to overfit the training data.
If sparse LLMs with novel reasoning abilities will one day exist, this could perhaps unlock answers to deep questions in fields like computational complexity and biology. Perhaps, this could be the first instance of AGI?
AGI & Biology
Given we’ve discussed the future of LLMs in regard to what could one day be AGI, I want to ask a fundamental question: what does it mean to be human? Emotion, consciousness, and morals are all good starting points. Going back to the opening example, Johann Gutenberg and his printing press revolutionized how humans mass-digest information. Over decades and centuries, this certainly would have created new behavior in regard to how our brains process information. Today’s billion-parameter models/knowledge diffusers are similarly on track to change how we access and interact with information. This begs a deeper question: are we approaching an asymptote where these AI systems will mimic humans themselves? Currently, text produced by GPT-3 is essentially indistinguishable from human writers. However, a recent paper by Geoffrey Hinton seems to doubt the current state of deep learning, even large LLMs, to reason like humans:
It seems very unlikely that the human brain uses back propagation to learn. There is little evidence of backprop mechanics in biological brains (no error derivatives propagating backwards, no storage of neuron activities to use in a packprop pass, …). — Geoffrey Hinton
Simply put, humans likely don’t backpropagate in regard to learning. This idea prompts us to question our current machine-learning techniques: for example, how would we enable pattern recognition similar to humans without explicitly backpropagating gradients? Given this, it makes sense for researchers to dive deeper and essentially question the rules of the game. My personal view is that nature is simple and if nature is the paradigm we are striving towards, we should aim for AI systems that similarly capture complex ideas with simplicity. With the amazing folks and projects at institutions like OpenAI, DeepMind, and Google Brain, we’re certainly going somewhere in regard to questions of model consciousness, cross-domain reasoning, and the grand task of AGI. As my mentor, Prof. Bernie Widrow said, Nature is the best machine of them all.
Short-term Thoughts and Ideas
Given this discussion of GPT-N models, AGI, and nature as a model, it’s certainly difficult to predict where we will be in 5 years, 10 years, 20 years, 50 years, etc. Lots of potential exists and ChatGPT could just be a taste in regard to what LLMs can bring to the world. This could be more than just an iPhone moment. As Kevin Kelly references in his books, technology exists to create new forms of understanding. Moreover, technology many times mimics the grandest system of them all: life
“There is no communication without the nerves of electricity. There is no electricity without the veins of coal mining, uranium mining, or even the mining of precious metals to make solar panels. There is no metabolism of factories without the ingest of food from domesticated plants and animals, and no circulation of goods without vehicles. This global-scaled network of systems, subsystems, machines, pipes, roads, wires, conveyor belts, automobiles, servers and routers, institutions, laws, calculators, sensors, works of art, archives, activators, collective memory, and power generators — this whole grand system of interrelated and interdependent pieces forms a very primitive organism-like system.”
Today, developers and enthusiasts around the world are fortunate enough to start building out vertical applications through APIs that access the GPT-n models. These generative models are creativity maximizers: new forms of content can be produced, new ways to iterate on content may become relevant, and human-computer interaction will change. However, whether it be text, audio, or images, humans in this present time will still need to find creative applications of the LLMs. With the coming release of GPT-4 and additional models across multiple modalities, I couldn’t be more excited to see what’s in store next. From trillion parameter models to multimodal SOTA performance, the world is changing week-by-week and with OpenAI leading the way as the modern Bell Labs, there’s so much more to come.
*ChatGPT also helped relieve writer’s block and aided with multiple sections as I wrote this article :)