Transformer : Why Don’t We Talk About BERT Anymore, and Even Ignore Model Architectures?

8 min readAug 1, 2024

This Week

This week, I read a bunch of papers, starting by chance with the 2019 BERT paper, and gradually expanded to discussions on model architectures and BERT by a former Google scientist in Singapore this month.

Just in time, I’ll use this topic to organize the three major model architectures: encoder-decoder, decoder only, and encoder only. I’ll also cover the two main training objectives/mechanisms: masked LM/denoising/cloze infilling, which help LM learn context or increase robustness, and the common causal self-attention/autoregressive mechanism.

Since BERT was popular around 2018, it’s now 2024, and I have a somewhat vague impression of the changes in AI over these six years. This feels different from my previous focus on current, concrete open-source projects and new products, giving a sense of historical flow.

01 Why Isn’t BERT Popular Anymore?

The main reason people don’t use BERT is that it is task-specific, particularly good at fine-tuning for specific tasks, but struggles with multiple comprehensive problems. To handle multiple tasks simultaneously, a complex structure and numerous task-specific classification heads are required, making the implementation and use of BERT models complicated.

In the research from 2018 to 2021, there was a trend of moving from single-task models to multi-task models, requiring comprehensive models that can solve various tasks.

In 2018 and 2019, people were still exploring the effects of various architectures, including ELMo, GPT, BERT, and XLNet. By 2020, the advantages of the GPT model began to emerge, and by the end of 2022, GPT-3.5 became popular. The decoder-only architecture used by GPT became the standard for large model architectures.

The encoder-only architecture of the BERT series can only solve specific tasks. The output is not a direct sequence result but a vector encoding the input sequence.

02 The Three Major Transformer Architectures: Encoder-Decoder, Decoder Only, Encoder Only

When the Transformer was first proposed, the encoder-decoder structure was used to solve machine translation tasks, such as translating between English, German, and French. The encoder-decoder mechanism fits this task, with cross-attention passing information between them. Therefore, the classic Transformer paper “Attention is All You Need” introduces the encoder-decoder model.

In subsequent research, the decoder-only structure was found to generate content more efficiently, as the input and output are both handled within the decoder, sharing parameters and simplifying the framework.

The cross-attention required in the encoder-decoder can be handled by self-attention in the decoder-only architecture.

In the diagram, the causal self-attention in the decoder part is similar to the masked multi-head attention in the decoder of the Beginner’s Guide to Transformers : Understanding the Basic Framework. The mask prevents the model from seeing future content, focusing attention on the previously generated content.

Assuming we have an input sequence “The cat sat”:

When calculating attention for “The,” the model can only see “The.”
When calculating attention for “cat,” the model can see “The cat.”
When calculating attention for “sat,” the model can see “The cat sat.”

This left-to-right concept is similar to autoregressive models, where each position can only see current and previous values, not future values. Models like GPT and LLaMA generate content in this way, functioning as autoregressive models.

These concepts apply to different fields: AR (autoregressive) means generating sequences from left to right, and causal self-attention refers to an attention mechanism that masks future content.

Differences between these architectures:

Encoder-decoder input and output parameters are independent, unaffected by AR, requiring cross-attention to pass hidden representations and learning context bidirectionally. The decoder generates content referencing both previously generated content and hidden representations from the encoder.

PrefixLM applies this mechanism of referencing additional content to the decoder-only architecture, adding a prefix for additional reference, making it sometimes considered non-casual.

Decoder-only shares input-output parameters, generating content unidirectionally from left to right.

In the comparison diagram, the encoder-decoder structure has bidirectional self-attention, learning context from both directions, unlike causal self-attention, which only learns from the preceding content.

These mechanisms are suitable for different downstream tasks:

Bidirectional self-attention excels at understanding word meanings in different sentences and complex sentences, suitable for semantic analysis and sentiment analysis.
Causal self-attention is particularly suited for generation tasks like text generation, automatic programming, and dialogue systems, as it generates text sequentially.

Bidirectional mechanisms are often used for specific tasks with small-scale data, with limited effects on large-scale parameter models. Familiar LLMs like GPT and LLaMA don’t use bidirectional mechanisms, being unidirectional, training on the entire input and generating content from left to right based on learned weights. This explanation of bidirectional mechanisms is mainly for historical context, as these explorations are somewhat outdated now.

Research on bidirectional models like BERT and CM3 hasn’t seen significant practical applications, remaining in the scientific research stage. However, understanding these mechanisms remains important for understanding AI’s rapid development and historical context.

Encoder-only models like BERT and CM3 use masked mechanisms (MLM) for training, similar to cloze tests in English, masking parts of sentences and training the model to infer the masked words from the context.

Models are pretrained by hiding parts of the input: predicting the next word sequentially left-to-right, masking words in the text.

This masking differs from the masked multi-head attention and causal self-attention described earlier, which mask future words.

Masked LM struggles with generation tasks, as it can’t generate text continuously, requiring additional mechanisms to fill in the masked positions.

The following image is from the 2022 CM3: Causally-Masked Multimodal Modeling paper:

Causal Masked LM (first row): Masks parts of the sentence and predicts masked words from left to right. Masked LM (second row): Randomly masks words in the sentence, predicting based on context without emphasizing left-to-right generation. Language Model (third row): Predicts the entire sentence word by word from left to right without masking; typical language models use causal self-attention and autoregressive mechanisms.

Another context learning mechanism is like ELMo, training an LSTM from left to right, then another from right to left, and concatenating them to understand context.

The masked mechanism not only trains models to understand context but also trains model robustness, adding noise for denoising training.

Decoder-only models like T5 use cloze filling and masked LM for denoising training, improving T5’s generalization ability and performance on various downstream tasks.

Some believe cloze infilling originated in code-related model training, as code often requires error correction.

On a side note, infilling could have originated from the world of code LLMs, where filling in the blank was more of a feature desired by coding applications.

03 Insights on Model Architectures

The following image is from around 2018, showing online discussions focusing mainly on BERT. Without specific prompts, RAG, or agents, discussions were more generalized.

In the past year, discussions have shifted to data > architecture. An OpenAI employee shared in his WordPress blog that based on his experience training models, the same data, given enough training time, yields similar results regardless of GAN or Transformer-based image models, resembling the training data’s characteristics.

He believes the differences between GPT, Bard, and Claude are more about training data than model architecture or hyperparameters.

Former Google scientist Yitay suggests model architecture seems less important now because we’re standing on the shoulders of giants, with many failed frameworks explored in past years. As I mentioned earlier, many frameworks failed, leading to the most effective mature model frameworks.

Tiny tweaks to transformers may not matter as much as data/compute. Sure. But it’s also not very accurate to say “architecture research” does not matter and “makes no difference.” I hear this a lot to justify not innovating at the architecture level.
The truth is the community stands on the shoulder of giants of all the arch research that has been done to push the transformer to this state today.

The chosen Transformer framework is excellent and mature enough that slight adjustments don’t significantly improve performance. However, this doesn’t mean we should stop innovating in architecture.

Another OpenAI scientist, Hyung Won Chung, listed a framework of key factors affecting large models, where model architecture is less urgent compared to computation and data:

Compute
Data
Learning objective/algorithmic development
Architecture

His view on AI development highlights the progress of decoder-only Transformer models like GPT over encoder-decoder models.

Adding optimal inductive bias for a given level of compute, data, algorithmic development, and architecture is critical. These are shortcuts that will hinder further scaling later on. Remove them later. As a community, we do the former well but not the latter.

In AI development, while adding frameworks to solve current issues is essential, it’s equally important to remove these extra frameworks at the right time. For example, GPT removed the encoder entirely.

04

Last week, in the Beginner’s Guide to Transformers : Understanding the Basic Framework, I shared my notes on AI papers. This week’s notes have doubled, but I still feel like I’m just getting started, with over 20 more papers to read. As my notes continue to grow, I hope to gain a deeper understanding of the current developments in AI and learn many new perspectives.

by: pamperherself

Transformer : Why Don’t We Talk About BERT Anymore, and Even Ignore Model Architectures?

This Week

01 Why Isn’t BERT Popular Anymore?

02 The Three Major Transformer Architectures: Encoder-Decoder, Decoder Only, Encoder Only

03 Insights on Model Architectures

04

Written by pamperherself