The Future of AI: The Transformer and How We Got There

7 min readMay 8, 2024

When you think of the word “Transformer,” what do you think? While many may say the robot hero cars from the Transformers franchise in the late 1970s, I can only think of one thing: the machine learning architecture that broke the field of Artificial Intelligence.

The recent buzz of Artificial Intelligence came partly because of the introduction of ChatGPT: A chatbot that gathered excitement with the public due to its ability to follow instructions and provide relevant, context-aware responses, that were coherent in a way that was never seen, or accessible to the wider population. Now, it seems as though AI is being integrated into every sector imaginable. What many people do not realize, however, is that the technology that powers these chatbots, the Transformer, has been around since 2017, and broke the field at the time.

Since when has AI ever had this level of complexity and skill? My hope is to be able to answer those questions, giving you a better picture of how we got here. To do that, lets dive into the past of sequential data processing. First, what exactly is sequential data processing? Let’s answer that.

What is Sequential Data?

Sequential data refers to a type of data that is logically ordered over time or position, where the order in which these data points are arranged is critical to understanding their dependencies [1]. Its biggest implication has to do with the field of Natural Language Processing (NLP).

NLP is an entire body of practices within machine learning that deals with the manipulation and understanding of human language to computers. Natural language is sequential:

The sentence “The cat jumps quickly over the dog,” with arrows showing how sequential text is interconnected

Each individual word is heavily interconnected with the others. More importantly, sentence structures heavily rely on certain sequential conventions to be coherent language. What if we want to be able to take a sequence of words, and have a machine learning model predict the next word? The Answer? Recurrent Neural Networks (RNNs).

Memory in Neural Networks?

Recurrent Neural Networks (RNNs), introduced in 1985 [2], is a type of neural network that was specifically designed to process sequential data. The output from a neuron directly influences the input to that same neuron. This allows a model to understand and aggregate sequential dependencies, as the previous time-step influences the current time-step. This allows a neuron to have a sort of memory and is referred to as a memory-cell due to its ability to preserve information across time-steps. These models are great for handling sequential data, such as text. Similarly to how humans process information, more weight is given to the recency of information to provide context in sequence prediction [3].

RNN Cell, rolled and unrolled — **Diagram of Simple RNN** by Tomohiro Oga

However, as sequences get longer, long-term dependencies get lost through a process known as the Vanishing Gradient Problem (VGP). While the math behind VGP is out of the scope of this article, the key idea can be simplified to this: as new inputs propagate through a single recurrent neuron, older inputs get ‘diluted,’ and thus, important long-term dependencies get diminished.

Consider the following sentence with the last word removed:

“I traveled to Japan this previous summer, and there
I learned how to speak [??]”

For us, it would be easy to conclude that the most logical next word would be “Japanese.” However, if a standard (or vanilla) RNN were asked to predict the last word, the distinct gap between “traveled to Japan” and “learned how to speak” may cause the neural network to guess that the next word would be “French.” The VGP causes RNNs to not be able to learn long-range dependencies well across time steps as earlier tokens in the sequence that are crucial to the entire context will not have high importance. To help fix this short-term memory problem, a better architecture was introduced: the Long Short-Term Memory (LSTM) network.

LSTM: The Solution?

Long-Short Term Memory networks, introduced in 1997 by Hochreiter, et al. [4], is a type of RNN that was made to solve the vanishing gradient problem. They use a combination of “input,” “output,” and “forget” gates that work in tandem to ignore useless information in the network and preserve more relevant or important information within the cell state.

While writing this, I have made an important connection: The model learns to attend to the information in the network, selectively understanding what is useful to remember, and what is not. Hold onto this idea, the concept of Attention is the biggest hallmark of the Transformer architecture.

A diagram showing the architecture of a single LSTM cell — **Diagram of LSTM** by Tomohiro Oga

LSTMs were thought to be a solution to the vanishing gradient problem, and in essence, they were, becoming the de facto standard for sequential data processing. However, LSTMs had their own limitations and complexities.

For one, computational inefficiency: LSTMs are computationally more intensive when compared to other recurrent architectures [4]. The computational inefficiency comes from the complex architectural nature of the LSTM, having multiple parameters to tune for each cell. Another limitation of this type of network is that they process data sequentially. You may think this is odd, and you’re not wrong. We as humans process words sequentially, building upon the context of what we have read previously to comprehend language. However, computers are incredibly good at parallelization, as in, computers can “read” an entire sequence at once. Having to process each word independently is the biggest bottleneck of these LSTMs. They also rather diminish the vanishing gradient problem, more than solve it, as they have difficulty with extremely long-term dependencies due to the sequential nature of these models [5]. While variations of LSTMs were created in hopes of alleviating the issues associated with them, such as the Gated Recurrent Unit (GRU), they suffered the same issues that LSTMs did.

This all changed with the introduction of the Transformer.

“Attention is All You Need:” All Hail the Transformer

The transformer architecture was introduced when researchers from the Google Brain team submitted a paper called “Attention Is All You Need” at the 31st Conference on NeurIPS (Neural Information Processing Systems) [6]. It was the first model of its kind that can “compute representations of its input and output without using… RNNs” [6]. They introduce a mechanism known as self-attention, that relates different positions of a single sequence to compute a representation of that sequence. They use complicated linear algebra to produce Scaled Dot-Product Attention and Multi-Head Attention, which are beyond the scope of this article, however the key takeaway is this: The Transformer architecture is the first of its kind to utilize parallelization when analyzing sequential data. As in, it doesn’t have the sequential bottlenecks that its predecessors (RNN, LSTM, etc.) had. Because of this, the Transformer can look at the history of the entire sequence all at once, being able to independently map long-term dependencies more easily.

**Attention Map of a Transformer** by Mao *et al.*

Now it’s easy to see how groundbreaking this discovery was. By being able to “see” all the data at once, the Transformer architecture is robust and powerful in a way that they had never achieved before. This allows the transformer to be able to process huge corpuses of text quickly, efficiently, and most importantly, by attending to certain words.

However, as with anything, Transformers are not without their own drawbacks. While these models have outperformed all the major benchmarks for NLP tasks, training these models is incredibly computationally expensive, and raises concerns over the high carbon footprint to train these models. However, further research is steadily addressing these limitations to reduce computational cost.

This single discovery accelerated the machine learning field in a way that had never been done before. Transforming the field of Natural Language Processing (NLP) and allowing for human-computer interaction in a way that has never been seen before.

Now that is powerful.

References

[1] Chaubey, Aashish. “Sequential Data — and the Neural Network Conundrum!” Analytics Vidhya, 11 Feb. 2020, medium.com/analytics-vidhya/sequential-data-and-the-neural-network-conundrum-b2c005f8f865

[2] Rumelhart, D., Hinton, G. & Williams, R. Learning representations by back-propagating errors. Nature 323, 533–536 (1986). https://doi.org/10.1038/323533a0

[3] Choubey, Vijay. “Undestanding Recurrent Neural Network (RNN) and Long Short Term Memory(LSTM).” Analytics Vidhya, 27 July 2020, medium.com/analytics-vidhya/undestanding-recurrent-neural-network-rnn-and-long-short-term-memory-lstm-30bc1221e80d.

[4] Hochreiter, Sepp, and Jürgen Schmidhuber. “Long Short-Term Memory.” Neural Computation, vol. 9, no. 1, Jan. 1997, pp. 1–42, https://doi.org/10.1162/neco.1997.9.1.1.

[5] Gers, Felix A., et al. “Learning to Forget: Continual Prediction with LSTM.” Neural Computation, vol. 12, no. 10, Oct. 2000, pp. 2451–71, https://doi.org/10.1162/089976600300015015.

[6] Vaswani, Ashish, et al. “Attention Is All You Need.” ArXiv.org, 12 June 2017, arxiv.org/abs/1706.03762.

The Future of AI: The Transformer and How We Got There

Written by Tomohiro Oga