The Emergence of the Transformer. The Architectural Breakthrough, History, and the Future of The Model Behind ChatGPT

22 min readAug 15, 2023

Originally published on Substack on June 7, 2023

It is becoming increasingly clear that the complexities of human intelligence are not as insurmountable as once thought. Tasks once believed to be exclusive to human cognition, such as language generation and problem-solving, are now being replicated by machines with startling accuracy. With the advent of transformer models like GPT-3, the next technological renaissance in Silicon Valley is beginning to be unleashed. Since its launch in November 2022, ChatGPT has experienced parabolic growth, surpassing the 1 million user mark in just five days. As of March 2023, ChatGPT reigns as the king of generative AI, attracting over 1 billion visits per month as of March 2023. But it’s not alone in this arena — other transformer models like Google Bard, Github Copilot (AI paired programmer), Jasper (content creator), and text-to-image generators such as DALL-E and Midjourney have skyrocketed in popularity as well. As these autoregressive models continue to transform the way we communicate, write, learn, and develop code, it’s clear that generative AI is here to stay and will create tremendous economic value.

So What is ChatGPT Doing Exactly

One look at the basic structure of a transformer, and it is hard to believe that the model behind ChatGPT is generating text with any sort of human-like capability. From a very high level, the fundamental task that ChatGPT is performing is given an input sequence, what is the next most probable next word. (Either next-token prediction or masked language modeling)

Specifically, ChatGPT takes in a sequence of textual input and is able to contextualize the information. It is able to understand how much each word relates to the others in the input, and from that knowledge and prior experience from training data, it creates a probability distribution over all possible words and adds the most probable next word to the sequence. The resulting phrase is then fed back through the model iteratively until the desired output length is reached.

As a few caveats, instead of handling words, the model behind ChatGPT actually handles “tokens,” which could be just a part of a word, a letter, or a prefix/suffix, which is why it can sometimes make up new words. Additionally, the model doesn’t always pick the word with the highest probability. Rather, the model uses a ‘temperature’ parameter that determines how often lower-ranked words will be used, and for essay generation, it turns out that not always predicting the most probable word leads to better text results. This randomness is why if ChatGPT is given the same prompt, it does not always produce the same result.

Part 1 — Predecessing NLP Models — The Era of CNNs and RNNs

Natural Language Processing (NLP) has long been a challenging subfield of Artificial Intelligence that focuses on developing algorithms to understand, process, and generate human-like language. Before the release of the Transformer in 2017, NLP models that could understand and generate text with human-like capabilities appeared to be an insurmountable mathematical task. Human language is rich and nuanced, with many levels of meaning and context, and despite the advancements in CNNs and RNNs, it appeared the industry was still decades away from NLP-optimized models with advanced text generation and comprehension abilities.

Convolutional Neural Networks (CNNs) were initially designed and most commonly used today for image-processing tasks like recognizing objects in pictures and videos. CNNs analyze images by splitting them into smaller segments and extracting local patterns and structures from that area. These features are then combined to build a higher-level understanding of the image. When it comes to natural language processing, CNNs treat words like images and similarly split the textual inputs into smaller segments. Specifically, CNNs for NLP operate on a 1-dimensional sequence of words (i.e., the text) instead of the 2-dimensional matrix of pixels in an image and apply convolutional filters to each segment to extract local features such as specific word combinations or patterns. These local features are then synthesized to build a higher-level understanding of the entire textual input. Given their ability to divide textual inputs into independent segments, Convolutional Neural Networks offer a highly parallelizable computational approach. This parallelizability extends to both the training process and the post-training inference stage, enabling CNNs to distribute the computational tasks of processing input text across multiple GPUs. This enables CNNs to be highly scalable and efficient to train on large datasets.

While Convolutional Neural Networks (CNNs) have shown impressive results in tasks such as sentiment analysis and text classification, they have severe limitations in comprehending and generating language. Text is sequential, and the meaning of sentences often depends on the context created by previous words. Since CNNs analyze the entire input sequence in parallel and completely ignore the sequential nature of the text, they struggle to capture the relationships between words. This makes CNNs a poor choice for large language modeling and specifically performing the task of “predicting the next word given an input”.

Recurrent Neural Networks (RNNs)

Recurrent Neural Networks (RNNs) were introduced to address this limitation by allowing information to flow from previous inputs to the current one. Recurrent Neural Networks process words individually and maintain a memory of previous inputs to improve subsequent predictions. At each time step, the RNN recurrently processes a word and considers the output from the previous time step. The resulting output is then fed to the next time step, and the model iteratively performs this process on the entire input sequence. By considering previous inputs when processing the current word, NLP-optimized RNNs build an understanding of the relationships between words, allowing them to contextualize information very well.

However, since RNNs process words sequentially, they have a limited ability to capture long-term dependencies and struggle in scenarios where the current word being analyzed depends on text far back in the input sequence. While LSTMs and GRUs are forms of RNNs that are capable of building longer-term dependencies, the inherent structure of recurrence in these networks makes them prone to the same problem as traditional RNNs but at a larger range. Furthermore, RNNs, LSTMs, and GRUs fail to model language accurately because they do not parallelize well and cannot be trained on large datasets efficiently. Due to the sequential nature of their processing, the computational task cannot be spread out across multiple GPUs creating an economic and time-capacity ceiling on the amount of data that can be used to train the network.

Transformer-ing NLP, The Rise of GPT

Enter the transformer, a neural network-based architecture that was originally introduced in a 2017 paper written by Google Researchers called “Attention is All You Need”. Initially introduced for language translation, this model aimed to tackle the limitations of traditional RNNs and CNNs by processing all the words in inputs in parallel (like CNNs) while still accounting for word positioning (like RNNs). The proposed transformer was revolutionary because it could parallelize across multiple GPUs, making it possible to train larger language models on enormous training datasets while still having the ability to learn the complex relationships between words. Following the release of the paper, OpenAI and other corporations began to experiment with the scaling transformer both in the number of parameters and the size of the training dataset.

Specifically, OpenAI scaled the transformer from 117 million parameters in the original GPT-1 to the groundbreaking GPT-3 model, which boasts a staggering 175 billion parameters. The training dataset for GPT-3 is estimated to have consisted of almost a trillion words, over 40 terabytes of text, and was so large that it required an estimated 10,000 Nvidia A100 GPUs (each costing around $10,000–15,000) and still took several days to train.

To get a sense of the scale of just how big these models are, the image below depicts a simple neural network called a multi-layer perceptron (MLP). Similar to the structure of the brain, which is composed of neurons interconnected by dendrites, neural networks structurally mimic the brain by replacing neurons with nodes (still often referred to as neurons) and dendrites with connections. Each of these connections has an associated numerical value (weight/parameter) representing the intensity of the relationship between the respective neurons. GPT-4 is estimated to have 1 trillion parameters.

Part 2 — Inside the Transformer: Unveiling The Inner Workings of ChatGPT’s Architecture

To understand how the transformer works, let’s first explore one of its original applications: language translation. Consider the task of translating a sentence from English to Spanish. Simply translating each word individually would produce entirely nonsensical output. A more effective approach would be to understand the sentence contextually, taking into account word positions and their relationships with one another, then using this understanding to generate the Spanish phrase word by word. For all language modeling tasks, this is precisely what the transformer does. It takes in an input sequence, comprehends word positioning, and numerically scores the relationships between each word in the input. Armed with this higher-level understanding, the transformer decodes this information to predict the most probable next word.

From a high level, the transformer architecture can be divided into two key steps: the encoder and the decoder. The encoding layers take the input sequence (English text) and construct an abstract continuous representation of its features. This higher-dimensional encoded representation encapsulates the complex relationships between words, such as syntactic structure or co-occurrence. Then, the decoder attends to the encoded input and uses it to understand the relationships between each word in the input and the current word being generated. Based on this understanding and the previously generated words, the decoder produces a probability distribution over all the possible next words. The decoder then selects the most probable (or near-most probable) word and appends it to the input sequence.

The Idea of a Semantic Meaning Space

The encoding layers model the relationships between words by turning words into vectors. To understand this concept, we can think of mapping words to points (2d vectors) on a coordinate plane. For instance, the word cat can be represented by the vector [2,8], while the word dog is represented by [1.8,7.9]. The distance between points reflects the closeness of meaning or intensity of the relationship between words. In this way, the task of numerically scoring the relationships between words is now simply achieved by measuring the distance between the vector representations within a semantic meaning space.

Encoding Layers in Depth (Technical)

Vector Embeddings

Preceding the encoding layers and when processing an input, ChatGPT starts by converting the text sequence into numerical token representations. GPT-3 employs a vocabulary of 50,257 unique tokens; each assigned a specific numerical value. By mapping the input sequence to these numerical values, the text is transformed into a sequence of numbers. These tokens are then further encoded into higher-dimensional vectors. Specifically, in GPT-2, the vectors have a length of 768, while in ChatGPT’s GPT-3, the vectors have a length of 12,288. This transformation allows the model to work with numerical data as each word is now represented by a coordinate within a semantic meaning space of dimension 12,288 for GPT-3. The reason why the dimensionality of the vectors is set to be so high, rather than 2d or 3d vectors, is that higher dimensional vectors provide a rich and expressive space so that the model can learn complex linguistic features and relationships.

Alongside word embeddings, the model also generates positional vector encodings for each token. Positional encodings are a significant advancement over previous NLP models because they allow the model to account for the order of words without needing to sequentially process the input. Armed with these vectors representing the position of each token, the transformer then sums the two sets of vectors to create a single encoding vector for each token. The resulting sequence of encoding vectors now contains critical information on both the meaning and position of each token in the input.

Multi-Headed Self-Attention

Following the transformation of words into vector embeddings, the transformer processes the input by feeding it through a sequence of encoding layers. Each encoding layer consists of two key components: a multi-headed self-attention mechanism and a position-wise fully connected feed-forward neural network. Self-Attention, the transformer’s most revolutionary component, is a mechanism designed to address the computational challenge associated with capturing contextual information. Rather than treating all words equally, self-attention allows the model to focus on the most relevant words for each prediction, considering their relationships and dependencies within the sequence. This allows the transformer to focus not only on nearby words but also on words that may appear in sentences far back in the input.

The multi-headed self-attention mechanism within each encoding layer is composed of 96 attention heads. The process starts by evenly distributing the inputted vector embeddings amongst the 96 attention heads. For the first encoding layer, the sequence of vector embeddings is the original set of embedding vectors following the transformation of words into vectors. For subsequent encoding layers, the input to the attention mechanism is the output from the previous encoding layer (still a sequence of embedding vectors, yet transformed).

After receiving a given portion of the input, each attention head then compares each of its embeddings to the embeddings of all other words in the sequence. This is achieved by generating query, key, and value vectors for each corresponding portion of the input. For instance, if we focus on the first attention head, it specifically attends to the first embedding (or a few adjacent embeddings depending on the length of the input). It then generates separate query, key, and value vectors for each embedding it attends to.

To evaluate the relationship between the current embedding being attended to and all other embeddings in the input, the attention head calculates the dot product between the query vector of the respective embedding and the key vectors of all the other embeddings in the input sequence. This dot-product operation, called dot-product attention, measures the similarity or relevance between the query and key vectors.

To ensure that the resulting scores are appropriately scaled, they are divided by the square root of the dimensionality of the key vectors. The scaled scores are then passed through a softmax function, which converts them into attention weights. These attention weights represent the relevance of each word in the context of the current word being attended to. Finally, the attention weights are multiplied by the value vectors to obtain the weighted sum of the value vectors, which captures the contextual information from the attended words. This process is repeated for each attention head, allowing the model to capture different types of relationships and dependencies within the input sequence.

Finally, the outputs of all the attention heads are concatenated to produce the terminal encoded representation of the input sequence. This representation is then fed through a feed-forward neural network that allows the model to capture the higher-level features and contextual information supplied by the self-attention layer. The output of the neural network is then passed as the input to the next encoding layer. This process continues iteratively through all the encoding layers of the transformer.

The Decoding Stage and Generating the Target Sequence

Following the encoding layers, the decoder is responsible for generating the target sequence, word by word, based on the encoded representation produced by the final encoding layer. In each decoding layer, the transformer similarly contains self-attention and neural network layers as in the encoding stage to compare and score the components of the decoder’s input sequence. However, the decoder also incorporates an additional attention mechanism called encoder-decoder attention. This attention mechanism allows the decoder to consider the encoded input representation while generating each word of the target sequence. It attends to the encoded input, understanding the relationships between each word in the input and the word currently being generated. This enables the decoder to make informed predictions based on the contextual information from both the input sequence and the previously generated words.

Once the attention mechanisms have captured the relevant contextual information, the decoder passes the output through a feed-forward neural network resulting in a probability distribution over the possible next words in the output sequence. This distribution represents the likelihood of each token in the vocabulary being the next token in the sequence. The decoder then selects the most probable (or near-most probable) token based on this distribution and appends it to the input sequence. This iterative process continues until the desired length of the output sequence is reached, or a special end-of-sequence token is generated.

Part 3 — The Role of Neural Networks

So, we’ve just explored the nitty-gritty details of the transformer architecture, with its encoder and decoder components and all the fancy math involved in scoring word relationships. But let’s take a step back and ask the big question: why does this whole thing actually work? How does the model learn word embeddings? And how does it use that higher-level understanding to predict what comes next? Well, the answers to these burning questions lie in the inner workings of neural nets.

A Quick and Dirty Guide to Neural Nets

Consider the data above. This arrangement of points visually suggests a linear model could be a good fit. We can derive a function of the form f(x) = ax + b to fit this dataset. With this function, we can predict the output of future data points, for example, f(100) = a(100) + b. Through a simple linear regression, one can find the constants (a,b) so that the function f(x) = ax+b is the best-fit line for the given data points.

This process of fitting a model to input-output pairs holds true for more complex relationships as well. For instance, we can use polynomials like f(x) = a2x2 +a1x + a0 to model more complicated relationships between numbers. Numerical analysis techniques, including polynomial interpolation, help derive the best-fit functions for these relationships.

The overarching concept with these models is that regardless of the set of data points, there’s always a method to fit an equation or, in other words, adjust a set of parameters {a0, a1, … , an}) so that the model best describes the dataset.

Neural Networks Are Function Approximators Too!

While traditional statistical modeling approaches use a predefined mathematical function to fit a model to input-output pairs, neural networks offer greater flexibility in modeling more complicated relationships. Neural networks consist of neurons and connections with learnable weights (parameters), enabling them to learn complex relationships between inputs and outputs by iteratively fine-tuning the parameters. This allows neural networks to be trained to model nonlinear relationships between variables without any pre-existing mathematical understanding of the association between input and output pairs.

A Simple Neural Network For Image Recognition

Let’s explore how neural networks learn with the example of a model designed for image recognition. Suppose we want the model to analyze images consisting of 256 pixels that depict hand-drawn digits from 0 to 9. Our objective is to train the model so that it can accurately predict the number represented by the input image.

Rather than trying to derive a complicated mathematical function, we can train a multilayer perceptron (MLP) to serve as our prediction function.

Our neural network to solve this problem is structured with multiple layers of interconnected neurons. It begins with an input layer of size 256, followed by two hidden layers with 16 neurons each, and ends with an output layer of size 10. When an image is fed into the network, it is represented as a vector of size 256, denoted as [x1, x2, …, x256], where each xi represents the grayscale value of a specific pixel in the image. The data flows through the network, starting from the input layer, passing through the connections to the hidden layers, and finally reaching the output layer. At the output layer, each node’s value represents the probability (ranging from 0 to 1), indicating how likely the input image corresponds to a specific digit.

Each layer in the network is connected to the previous layer by an arbitrary number of connections with corresponding weights. In our neural network, each neuron is connected to every single neuron in the subsequent layer. The design choice of having two hidden layers, 16 neurons in each hidden layer, and having every neuron pair in subsequent layers connected is an arbitrary choice. Generally, the higher number of parameters and neurons within a model, the better performing the model is.

To be more specific on how input images turn into accurate predictions, the value of the neurons following the input layer are calculated via simple weighted sums. At a high level, disregarding bias and activation functions, the value of each neuron in the second layer is determined by taking a weighted sum of all the neurons connected from the input layer, multiplied by their respective weights.

Similarly, the value of each neuron in the third layer is computed by taking a weighted sum of all the connecting neurons from the second layer. This iterative process continues until reaching the output layer, where each neuron’s value represents a probability (ranging from 0 to 1) for each of the possible digits the image represents (0–9).

As a quick caveat, the weighted sums for each neuron in both the hidden and output layers are run through a sigmoid function, which guarantees that the value of the neuron is between 0 and 1. This is important because the output layer represents a probability distribution, therefore, requiring a bound on the value of each output neuron to be between 0 and 1.

So Why Does This Work?

When a neural network is instantiated, the model produces entirely nonsensical probability distributions when fed input images. In the beginning, an untrained neural network randomly assigns the values of the weights, and thus the output layer’s probability distribution is entirely random. For example, consider feeding a freshly instantiated neural network an image depicting a 3. Recall that the values fed in the input layer, [a1(0), … ,a256(0)] are the grayscale values of the 256 pixels in the image. Thus, when taking weighted sums to get the values of the neurons in the first hidden layer [a1(1), … ,a16(1)], then the second hidden layer [a1(2), … ,a16(2)], and finally the output layer [p1, … , p10], the random choices of w’s leads to neurons in the subsequent layers have seemingly random values (contained between 0 and 1 due to the sigmoid functions). The final output layer might look something like [0.163, 0.239, 0.021, 0.054, 0.163, 0.067, 0.013, 0.121, 0.092, 0.067], representing the random probabilities the image is a number 0 to 9. This output vector of probabilities does not give us any guarantee of what number the image is representing. However, we can use this output vector to train the model by comparing its values to what the expected output should have been, [0,0,0,1,0,0,0,0,0,0] for the image of a 3, and accordingly adjusting the parameters (w’s) to minimize the error.

Model Training

During the training process of a neural network, images and their corresponding expected output vectors are repeatedly presented to the model. By comparing the produced output vector with the expected one, the model uses the resulting error to adjust its parameters.

Our image recognition neural network consists of 13,002 parameters ((284*16) + (16*16) + (16*10), plus additional 16 + 16 + 10 biases (ignore these parameters for now)). The objective of the training is to find the optimal set of w’s, denoted as [w1, …, w13,002], that minimizes the error between the actual and expected output vectors. This task is accomplished using a technique called backpropagation, which serves as an iterative implementation of gradient descent.

Feed into Network

Minimizing the Error through Backpropagation and Gradient Descent

Backpropagation starts by creating a cost function that maps inputs from a 13,002-dimensional vector space to a singular output space representing the network error. The cost function can be expressed as C(w1, …, w13,002) = Σ(error), where the inputs are the 13,002 current parameters of the network and the output is the sum of the mean squared errors (MSE) between the expected vectors and the predicted outputs in a training dataset. Rather than creating a singular cost function for each image, it is more efficient to calculate the MSE for a set of images, with the resulting errors summed to form a single cost function for the batch. Then, the process of optimizing the network, or finding the best set of parameters, involves finding a minimum for this cost function, C(w1, …, w13,002) = Σ(error). In other words, the objective is to select the best combination of w’s, [w1, …, w13,002], such that C(w1, …, w13,002) = 0 or is sufficiently close to 0.

Minimizing Error

To comprehend how backpropagation iteratively approaches a minimum for the cost function, let’s examine the process of finding a minimum for a single variable cost function. For instance, consider a cost function expressed as C(w) = Σ(error), where w represents the weight of a singular parameter in a neural network containing two nodes and one edge connecting the two. The summed error still represents the difference between the actual and expected output vectors from a labeled dataset.

To find the minimum of this cost function, we can use the second derivative test from calculus. First, we calculate the derivative of C(w), which tells us how the cost function changes as we adjust the value of w. Then, we calculate the second derivative, which indicates the curvature of the cost function. By evaluating the second derivative at a critical point (where the derivative equals zero), we can determine whether it corresponds to a local minimum or maximum. If the second derivative is positive, it means the cost function is “bowl-shaped,” and the critical point is a local minimum. With this local minimum value wmin, we can restructure our neural net using the value as the weight of the sole parameter to guarantee that the error of our model is minimized with respect to the training dataset.

Now, let’s extend this idea of single variable cost functions to the realm of multivariable functions. We can think of the gradient of the function, ∇C, as a generalization of the derivative to multivariable functions. Just as the derivative measures the rate of change of a function with respect to a single variable, the gradient provides us with a vector of partial derivatives that capture the rate of change with respect to each variable. Thus, the positive gradient of a function represents the direction of steepest ascent, indicating how the function increases with respect to each variable. In our case, by computing the negative gradient of the cost function, C(w1, …, w13,002), we can determine the direction of the greatest decrease in the cost function (minimizing the error). Thus, starting at the initial point [w1, …, w13,002], which was the random selection of parameters for our neural net, we can take a step in the direction of the negative gradient which leads to a new point (or new set of parameters) in which the cost function produces a smaller error output.

This specific process is iteratively implemented by a sequence of time steps until the error is sufficiently low. The final set of parameters, [w1, …, w13,002], can now be used as the weights of the network in our image recognition net, which has become a better predictor of images.

So How Does This Work for Language?

Extending the idea of neural networks to large language modeling, the training dataset for ChatGPT is sourced from snapshots of the entire internet captured between 2016 and 2019. This extensive dataset, extracted from a database known as Common Crawl, initially comprised a staggering 45 terabytes of data. To ensure high quality, a machine learning model filtered this dataset down to 570 gigabytes.

During the training process, the transformer’s neural networks allow the model to learn the complex structures of word embeddings, positional encodings, and value vectors. Through this iterative learning process, the final neural network outputs a probability distribution over the 50,257 possible tokens.

Transformers learn by a self-supervised method in which the model quizzes itself on a text dataset. It takes a segment of text, conceals certain words, and attempts to predict the missing words. Subsequently, the model unveils the actual answer and compares it with its own prediction. The resulting error is then used to accordingly adjust the weights, nudging them in a direction that minimizes the error. Since the model conceals certain parts of the text dataset on its own and the answers are inherently in the data, transformers do not require a manually labeled dataset (unlike image recognition datasets). This is critical to the success of the transformer as the self-supervised nature of the training allows these models to be trained on large text datasets from the open web.

Final Thoughts

Innovation in the Generative AI space seemingly compounds daily as the transformer model is specialized to new domains or improved at the infrastructure level every day. The remarkable success of the transformer can be attributed, in part, to the simplicity of its fundamental task: predicting the next word. Although simplistic, the transformer acquires intelligence by iteratively performing this task at extreme scale. In the process of learning to predict the next word with extreme accuracy, the model is learning our language, our culture, and our behavior at a really deep level and in ways we do not fully understand yet.

As these models continue to evolve, they will undoubtedly have a transformative impact across all sectors. These models will be fine-tuned to companies/industries and will be connected to scripts that automate their generation.

Generative AI will replace many current jobs, but the transformer will not replace workers. Rather, the transformer will enhance the skill set of workers across all industries as their capabilities will 10x (or more) the efficiency of humans. However, going forward, research and innovation must focus on understanding the complexity of neural nets and using their structures to maximize their positive impact. The future of generative AI will be driven by understanding these complexities and using the intelligence to maximize human efficiency.