Understanding Transformer Architecture: The Backbone of Modern NLP

Jack Harding
Nerd For Tech
Published in
8 min readJun 14, 2024
Photo by Solen Feyissa on Unsplash

In recent years, the field of NLP has witnessed a remarkable transformation, largely driven by the advent of the transformer architecture. Admittedly, I’m a little late to the party, given the paper was published in 2017. Since then, transformers have revolutionized the way we approach language modelling, translation, summarization, and other NLP tasks. Unlike traditional sequential models such as RNNs and LSTMs, transformers rely on self-attention mechanisms to capture global dependencies in text, enabling parallelization and scalability previously unattainable. In this article, we delve into the intricacies of the transformer architecture, exploring its key components, capabilities, and recent advancements. From its foundational principles to its diverse applications and future directions, join us on a journey through the backbone of modern NLP.

Background

Before the emergence of the transformer architecture, natural language processing primarily relied on sequential models like RNNs and LSTMs, which I covered in a previous article. These models, while effective for capturing temporal dependencies in text, struggled with long-range dependencies and parallelization, leading to inefficiencies in training and performance bottlenecks. The need for models that could better handle long sequences and leverage parallel processing paved the way for the development of the transformer architecture. Transformers addressed these limitations by utilizing self-attention mechanisms, marking a significant leap forward in the evolution of NLP models.

The Architecture

Transformers’ predecessor is the RNN; each successive layer has the previous layer's activation as an input, and LSTM allows for multiple layers to be used, increasing sentence memory. This approach uses a window that assesses the closeness between neighbouring words but lacks the ability to find the relationship between all words in the sentence.

Attention Map used for weighting in transformers: self-attention

The word to remember for transformers is “attention” such is the paper’s name: “attention is all you need”. The self-attention mechanism allows transformers to simultaneously evaluate long-range relationships between words, regardless of their order. RNNs evaluate each successive word which makes processing less efficient, transformers do this in parallel, increasing performance.

Attention is all you need paper, 2017

Encoder

The most common use of transformers at the moment is the encoder-decoder LLM. Encoders take a word, say “Hello” and convert it to an array of numbers called a vector or tensor. It is bidirectional, which means not only does it use the word's content, but also what relationship that word has with the other words in the sequence-this is self-attention. It is used for sequence classification (detecting sentiment in reviews) and masked language modelling (MLM) for finding missing words in a sentence.

Decoder

Decoders differ from encoders in that they are unidirectional, which means they don’t know anything about their neighbours-this is masked self-attention. It would take the “Hello” word from my previous example and generate the remaining sentence, instead of finding out what’s going on.

Encoder-Decoder

A common example of combining these two is translation. The encoder has already been trained to translate from English to German: “Welcome to Berlin” is our sentence, for each word translated, the encoder sends the decoder it’s context in the sentence to the decoder. When the stop sequence is reached, such as “.”, the decoder returns “wilkommen in Berlin”!

Self-Attention Mechanism

What makes transformers different from their RNN predecessors is how the hidden state is passed to the decoder. In RNNs, the hidden state for a sentence is calculated and sent to the decoder. Transformers calculate the hidden state for each word and score its importance using a softmax function. This excludes all but the most important hidden states.

Google Cloud offer a great video here

Multi-Head Attention

Multi-headed attention is a key feature of the transformer architecture that enhances its ability to capture complex relationships in the input data. This mechanism involves using multiple attention heads, allowing the model to focus on different parts of the sequence simultaneously. Each head independently transforms the input embeddings into three sets of vectors: queries (Q), keys (K), and values (V). The Q and K vectors are used to compute attention scores, which determine how much focus each word should have on others. The V vectors represent the actual data to be aggregated based on these scores.

By processing these Q, K, and V vectors through multiple heads, the model learns distinct representations and patterns in parallel, capturing a wide range of dependencies, both local and long-range. The outputs from all attention heads are then concatenated and passed through a linear layer, which combines the diverse information into a single, cohesive representation. This comprehensive approach significantly enhances the model’s performance and ability to understand intricate NLP tasks, making multi-headed attention a crucial component of the transformer’s success.

Capabilities & Advantages

Parallelisation

One of the advantages of transformers is their ability to process data in parallel. Unlike RNNs, which process tokens sequentially, transformers use self-attention mechanisms to evaluate all tokens simultaneously. This parallelisation significantly speeds up training and inference, making transformers more efficient and scalable.

Transformers leverage matrix operations extensively, which align well with the architecture of modern GPUs. In transformers, the self-attention mechanism involves calculating attention scores using matrix multiplications of the input embeddings transformed into queries (Q), keys (K), and values (V). These operations can be efficiently executed in parallel on GPUs, which are designed to handle large-scale matrix multiplications and other linear algebra operations.

By transforming the input sequence into matrices and performing batched operations, transformers utilize the full computational power of GPUs. This not only accelerates the training process but also allows for handling larger datasets and more complex models. The ability to parallelize these operations across multiple GPUs further enhances the scalability and efficiency of transformers, making them particularly suitable for large-scale NLP tasks.

Transfer Learning

Transformers have been instrumental in the rise of transfer learning in NLP. Pre-trained transformer models like BERT and GPT have set new benchmarks by being fine-tuned on specific tasks with relatively small amounts of data. This process involves training the transformer model on a large corpus of general text data, then fine-tuning it on task-specific datasets. This capability has democratized access to powerful NLP models, allowing researchers and developers with limited resources to leverage advanced language understanding and generation capabilities. Transfer learning with transformers not only improves performance on a wide array of tasks, but also significantly reduces the time and computational resources required to develop high-quality NLP models. I expanded on this in another article.

Challenges & Limitations

Computational Resources & Environmental Impact

The computational resources required to train LLMs are substantial, leading to significant environmental concerns. Training state-of-the-art models like GPT-3 or BERT involves billions of parameters and necessitates extensive use of high-performance GPUs over extended periods, consuming vast amounts of electricity and contributing to a considerable carbon footprint. According to a recent article by The Guardian, the energy consumption for training these models is not only immense, but also increasing as models grow larger and more complex. This energy use results in significant carbon emissions, which exacerbate climate change.

the carbon footprint of training a single early large language model (LLM) such as GPT-2 at about 300,000kg of CO2 emissions — the equivalent of 125 round-trip flights between New York and Beijing

LLMs become more prevalent and their use expands, the demand for computational resources continues to grow, raising urgent questions about sustainability. To mitigate these impacts, the research community is exploring more efficient model architectures, improvements in hardware, and the adoption of renewable energy sources for data centres.

Interpretability

Interpretability remains a significant challenge in the deployment of large language models (LLMs) like transformers. While these models achieve remarkable performance across various NLP tasks, understanding their internal decision-making processes is often opaque. A recent study by Anthropic, detailed in the article “Scaling Monosemanticity,” explores methods to enhance interpretability in transformer models. The researchers focused on identifying “monosemantic” neurons — neurons that respond to a single, clearly defined concept. By scaling models to larger sizes, they found that these monosemantic neurons became more prevalent, suggesting that increasing model size can lead to more interpretable components. However, achieving full transparency in how LLMs make decisions remains an ongoing research endeavour, crucial for ensuring trust and reliability in AI applications. This will become increasingly vital from a regulation standpoint, with the EU’s impending AI act that will require high-risk AI to explain how they made a certain decision.

Potential Bias

Historically, computer vision systems have grappled with bias, often stemming from skewed or limited training data. For example, facial recognition algorithms have exhibited higher error rates for certain demographic groups, reflecting the biases present in the datasets used for training. Similarly, object detection models have shown disparities in performance across different contexts, reinforcing the need for more diverse and representative training data.

In NLP, bias manifests in nuanced ways, primarily through the language used in training datasets and the cultural contexts embedded within it. Biases in language, such as gender stereotypes or racial prejudices, can inadvertently propagate through NLP models, leading to biased outputs and reinforcing existing societal inequalities. For instance, language models trained on text from the internet may inadvertently learn and perpetuate biased language patterns present in online discourse.

However, unlike computer vision, where bias often manifests in visual representations and object categorizations, bias in NLP is deeply rooted in the semantics and syntax of language itself. This inherent complexity presents unique challenges in identifying and mitigating bias effectively. Furthermore, NLP models often rely on pre-trained embeddings and language representations, which can encode biases present in the training data and propagate them to downstream tasks.

Conclusion

The rapid advancements in NLP brought about by transformers represent a remarkable feat of technical innovation. These models have revolutionised the field, achieving unprecedented levels of performance across a wide range of tasks in a remarkably short time. However, beneath the surface of this technological wonder lies a deep and systemic issue concerning the data and methods used to train these models. While significant scientific rigour has been applied to optimizing the metrics and performance of transformers, their commitment to equitability and fairness has often been overlooked or underprioritized. Over the past 18 months, there has been a surge of excitement and enthusiasm from both technologists and the mainstream about the transformative potential of NLP models. However, as we navigate the Gartner hype cycle, it’s becoming increasingly evident that the trajectory of enthusiasm may be leading us towards the “Trough of Disillusionment.”

Gartner hype cycle

As the hype subsides and the reality of these challenges sets in, it’s imperative that the NLP community confronts these issues head-on and works towards developing more equitable and responsible approaches to AI development. Failure to address these concerns risks undermining the trust and credibility of NLP technologies and could impede their broader societal impact.

--

--