Transformers — The Frontier Of ML Generalisation
Over the past few years, Machine Learning (ML) has begun transitioning from different models in multiple-disciplinary fields (such as computer vision, speech recognition, and natural language processing (NLP)) into one single format — language (Centre for Humane Technology, 2023). The architecture paving the way is the Transformer, an elegant Deep Learning (DL) model, first introduced in the paper “Attention Is All You Need” by Vaswani et al. (2017). This article acts as a roadmap in a series of posts that delves into the components of the Transformer to understand how they work together to enable deep language understanding and generative capabilities.
Firstly, we explore old NLP models. Next, we build a general understanding of what transformers are and discuss their architecture. Lastly, we provide a high-level overview of its components with directions to unique articles for more information.
RNNs and LSTMs
Before the Transformers’ creation, Recurrent Neural Networks (RNNs) and Long Short Term-Memory (LSTMs) were the best architectures for solving sequence modelling and transduction problems. These models accept a sequence of sequential data that follows a particular order (e.g., text or time-series data) using a type of recurrence in the form of cells (blocks of operations). RNN’s power comes from their ability to store a small amount of information from their inputs which is then recycled back into the network with the next set of inputs, providing an element of short-term memory.
Unfortunately, RNNs are influenced by exploding and vanishing gradients when using long sequences of data, causing the network weights to become incredibly large or small, preventing the model from learning and destabilising its behaviour.
LSTMs expand on the RNN architecture, mitigating shortfalls by enabling an extended memory. However, even with their more advanced capabilities, they are hard to train due to long gradient paths and unreliable when using Transfer Learning, a process adopted in Convolutional Neural Networks (CNNs) that takes a pre-trained model and allows fine-tuning to specialise to a small dataset. Furthermore, LSTMs and RNNs accept input one item at a time, a requirement for effectively understanding the context of the input sequence, drastically increasing training time (Seattle Applied Deep Learning, 2019). It was clear that something new was needed.
For more information on RNNs and LSTMs, check out the visual and intuitive blog post by Christopher Olah.
Transformers
In 2017, Vaswani and his team presented the Transformer architecture. Initially, it was intended as a replacement for LSTMs and RNNs, specifically designed to improve how sequence modelling and transduction problems are tackled. However, over the past few years, research has shown that they are beneficial for solving tasks across multiple disciplines (Hu et al., 2023; Dosovitskiy et al., 2021; Verma & Chafe, 2021). We highlight the transformer architecture in Figure 2.1.
Now I know what you are thinking, ‘Wow, this looks complicated!’ So let’s simplify it a bit. The architecture consists of two blocks: an encoder and a decoder (left and right, respectively). If you look closely, the blocks are nearly identical, with the addition of an extra multi-headed attention module in the decoder that is masked.
At a high level, the encoder accepts a sequence of encoded tokens simultaneously that is passed into an attention mechanism. The mechanism produces a new set of embeddings that have captured contextual information relating to the relationship between each item in the sequence. We combine this embedding with a residual connection (the initial input passed into the mechanism), normalise it and pass it through a Position-Wise Feed-Forward Network (PFN). We then merge it with another residual connection (associated with the PFN) and return a sequence of hidden representations that store learned information about the entire input sequence.
The decoder performs similar interactions but uses two attention mechanisms, one for ignoring future tokens (masked) in its respective input and the second for learning the relationship between its masked embeddings and the encoder’s output. Again, using residual connections between the attention modules and passing the information through a PFN. After merging the PFN output with the final residual connection, we produce a set of output probabilities for predicting the encoder’s input in the format of the decoder’s input. For example, an English-to-German translation task.
Why It Matters
Typically, new architectures only add one or two additions/modifications to an existing one. Vaswani et al. (2017) went above and beyond that by providing a robust model that replaced the old architecture with an entirely new one and created a new method for passing data into ML models.
To get a sense of how revolutionary their work is, let’s look at the critical aspects of the architecture:
- They removed the need for recurrent components (required for memory in RNNs and LSTMs) and replaced them with the Attention Mechanism.
- They made the Attention Mechanism as straightforward as possible, utilising fundamental Neural Network concepts and basic matrix operations.
- They converted the input data into embeddings and added positional encoding for faster computations.
- Used residual connections between modules (component blocks) to mitigate accuracy degradation when increasing model depth (stacking multiple encoders and decoders together).
- Added normalisation layers for gradient stability (eliminating vanishing and exploding gradients) and reduced training time.
- Formulated the Attention Mechanism into heads, allowing multiple of them for parallelisation to further reduce training time.
- Provided a great benchmark of hyperparameters allowing it to work ‘straight out of the box’ without needing much fine-tuning.
But it doesn’t stop there! The architecture enables applicability for Transfer Learning that originally never existed for NLP problems (or at least not effectively). Additionally, the encoder and decoder blocks can be decoupled from each other to solve different tasks without changing the core functionality. For example, ChatGPT and BERT (Radford et al., 2018; Devlin et al., 2019), two popular architectures, only have decoder or encoder layers, respectively. For one architecture to have this much capability is absolutely mind-blowing!
I imagine now you are thinking, ‘Ryan, this is cool and all, but how does it work, and how do I implement it?’ First, we need to understand what each component is and how it operates. Let’s check that out in the next section.
Components
Here’s the article series in list format!
To keep this article short and provide a better digest of the architecture’s components (in terms of depth and practice), we split them into separate posts. Select one of the links below to read about a corresponding component.
- Vector Embeddings — an exploratory look at how embeddings are created in preparation for Transformer architectures.
- Positional Encoding — (⭐) a detailed look at adding positional information into embeddings.
- Attention Mechanism — (⭐) a journey through the heart of the Transformer architecture, interpreting how it understands the context between tokens.
- Residual Learning — (⭐) we explore residual connections, learning why they are important and how they benefit Transformers.
- Layer Normalisation — (⭐) an inspection of a simple trick that improves the Transformer architecture.
The articles are designed to act independently from one another but together provide a means to paint a complete picture of how Transformers operate. Thanks for reading!
Interested in programming a Transformer yourself in PyTorch? Check out my Transformer GitHub repository that accompanies this article series. It contains the core code for the Transformer model and provides basic examples for working with the components.
Like My Content?
How about following me on Medium, subscribing to me to keep up to date with my latest content, upvoting this post, or leaving a comment? Every small piece of support means the world to me and helps me to continue my journey exploring the fascinating world of AI.
Interested in supporting me further? How about buying me a coffee?
Your support allows me to continue creating content and explore new ideas to share with you!
References
Center for Humane Technology, 2023. The A.I. Dilemma — March 9, 2023. [online] YouTube. Available from: https://www.youtube.com/watch?v=xoVJKj8lcNQ&t=853s.
Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K., 2019. Bert: Pre-training of deep bidirectional Transformers for language understanding. arXiv.org. Available from: http://arxiv.org/abs/1810.04805.
Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., and Houlsby, N., 2021. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv.org. Available from: https://arxiv.org/abs/2010.11929.
Hu, S., Shen, L., Zhang, Y., Chen, Y., and Tao, D., 2023. On transforming reinforcement learning by transformer: The development trajectory. arXiv.org. Available from: https://arxiv.org/abs/2212.14164.
Olah, C., 2015. Understanding LSTM Networks. [online] Available from: https://colah.github.io/posts/2015-08-Understanding-LSTMs/.
Radford, A., Narasimhan, K., Salimans, T., and Sutskever, I., 2018. Improving Language Understanding by Generative Pre-Training. Available from: https://s3-us-west-2.amazonaws.com/openai-assets/research-covers/language-unsupervised/language_understanding_paper.pdf.
Seattle Applied Deep Learning, 2019. LSTM is dead. long live transformers! [online] YouTube. Available from: https://www.youtube.com/watch?v=S27pHKBEp30.
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L., and Polosukhin, I., 2017. Attention is all you need. arXiv.org. Available from: https://arxiv.org/abs/1706.03762.
Verma, P., and Chafe, C., 2021. A generative model for raw audio using Transformer Architectures. arXiv.org. Available from: https://arxiv.org/abs/2106.16036.