Encoder-Decoder vs. Decoder-Only

Minki Jung
9 min readMay 19, 2024

--

What is the difference between an auto-regressive transformer and a sequence-to-sequence transformer? The straightforward answer is that the auto-regressive one only features a decoder stack (dec-only), while the sequence-to-sequence one includes both an encoder and a decoder (enc-dec).

However, the implications of this architectural difference on their actual performance was unclear to me. For instance, the BloombergGPT paper mentions, “Named-Entity Recognition(NER) is an information extraction task, and a better fit for encoder-decoder or encoder-only architectures. The generative nature of LLMs does not confer an advantage for NER.” Unfortunately, it didn’t go deeper into the underlying mechanistic why this is the case.

Moreover, there has been increased interest in enc-dec architectures recently. For instance, the Core model by Reka is based on an enc-dec architecture, as opposed to a decoder-only one, outperforming Claude 3 Opus in third-party human evaluations.

In this blog post, I will dive deep into the differences between enc-dec and dec-only models, providing insights on when to use each model. I’m assuming that you are familiar with transformer architecture. If not, check out this wonderful resources: 3Blue1Brown, Karpathy, visualization.

How did dec-only approach become popular?

Dec-only architecture gained popularity primarily due to the success of the GPT model series developed by OpenAI. Starting in 2018, OpenAI released the GPT 1, 2, and 3 series annually, impressing many researchers with their remarkable language generation capabilities.

Unfortunately, OpenAI didn’t provide a detailed explanation for their choice of a dec-only model in their papers. However, numerous resources and apparent reasons exist.

  1. It is sufficient.

OpenAI’s mission is to build AGI. As such, they have shown more interest in general models than task-specific ones. Predicting the next token is currently the most general task in NLP, which is why OpenAI chose it as a pre-training objective, and a decoder stack was sufficient for this task.

2. It is simpler.

By focusing on the decoder, the model becomes more streamlined and easier to train. I will further explain how a dec-only approach simplifies the process below.

3. It is omnivorous.

The decoder-only architecture is pre-trained in an unsupervised manner, predicting the next token. This makes its training data more readily available compared to enc-dec architecture, which requires pairs of text as a training dataset.

4. It is scalable.

Dec-only models can reach greater depths as they don’t have the bottleneck often found in enc-dec models between the encoder and decoder stacks. I’ll elaborate on the bottleneck below.

Dec-only models are scalable for inference. When an encoder-decoder model generates a new token, it needs to re-run the attention matrix for all tokens. However, a dec-only model only needs to do this for the new token, as preceding tokens cannot attend to future tokens. This makes dec-only models more efficient and scalable.

5. Bi-directionality doesn’t matter at an sufficient scale.

An research scientist at OpenAI recently said that “at sufficient scale, bi-directionality doesn’t seem to matter much.” He didn’t explain any reason and added that this is “highly anecdotal.”

6. It just worked so well!

As researchers scaled up the dec-only model, they observed continuous improvements in its performance. This positive result encouraged them to continue training dec-only models, which became a standard approach for many researchers.

Two main issues with the enc-dec

1. An information bottleneck occurs as the layers get deeper

In the enc-dec architecture, the encoder processes the input tokens to produce a hidden representation. This process occurs layer by layer, moving from low to high levels. The decoder then takes the final layer(the highest level) of representations from the encoder, as its input.

The problem arises as the encoder layer gets deeper. The addition of more layers increases the semantic level difference between the final layer of the encoder and the first layer of the decoder, creating a potential information bottleneck. Consequently, the decoder’s lower layer might not accurately extract the high-level representation.

Hyung Won Chung recently gave a talk at Stanford CS25 where he said “In my experience this bottleneck didn’t really make any difference because my experience is limited to 25 layers of encoder of T5. But what if we have 10x or 1000x more layers? I’m not really comfortable with that. I think this is unnecessary design that maybe we need to revisit.”

In 2018, Tianyu He et al attempted to address this bottleneck issue. They proposed a layer-wise connection between the encoder and decoder. They questioned, “Why should the low-level representation of a target token be based on the highest-level ones of source tokens? Why not attend to each layer of the decoder for each corresponding layer of the encoder?” They demonstrated that this layer-wise approach improved the encoder-decoder model’s performance.

However, it seems this technique hasn’t been further explored. This could be because as Chung noted, the improvement wasn’t significant when the layer was not that deep. Additionally, the layer-wise approach adds complexity, requiring extra engineering to make it efficient.

2. Inefficient for multi-turn chats

The second issue with enc-dec architecture is its unsuitability for multi-turn chat applications. Chung highlighted this issue in his talk, stating, “Bi-directionality presents engineering challenges for multi-turn chat applications because the new input must be encoded again at every turn.”

To illustrate, consider an example. A user asks question X1 and the AI responds with answer Y1. The user then asks question X2. In a decoder-only model, it’s possible to use the cached attention values of X1 from the first interaction. This is because in these models, tokens can only attend to prior tokens. Consequently, the tokens of X1 won’t attend to the tokens of Y1 or X2.

However, in enc-dec models, attention values need to be recalculated when Y1 and X2 are added to the input. This is because tokens of X1 can now attend to tokens of both Y1 and X2.

Advantages of enc-dec architecture

Despite these challenges in enc-dec models, there are certainly advantages to them.

1. No missing attentions

In 2019, Noam Shazeer and his team explored the differences between encoder-decoder and decoder-only models in their proposal of the T5 models. (Refer to Chapter 3.2 of this paper for more details.)

They broke down those two architectures and pointed out the limits of dec-only model.

Noam Shazeer et al. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. 2019. (Language model is another name for dec-only model.)

Noam Shazeer et al. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. 2019. (Language model is another name for dec-only model.)

In this diagram, let’s focus on how the y1 output of the language model receives information from previous layers, as indicated by the red lines.

Red lines added by me

The issue is that the information of X1 doesn’t attend to any other tokens, but only to itself. According to the authors of the paper, this information is “unnecessarily limited” compared to the enc-dec architecture.

For instance, consider the English sentence “Bats flew into the cave” being translated into other language. Here, the word “bat” could refer to an animal or a racket. The intended meaning is determined by the context provided by the surrounding words, in this case “flew into the cave”. However, in a uni-directional causal mask, the information passed by the initial word “Bats” only attends to itself, not specifying the intended meaning of it.

Note that X1 attending to X2 is different from X2 attending to X1 since different Q and K matrices will be used. So, the information of X1 passed through the the second leftmost block of the middle layer in the diagram would not necessarily contain the intended meaning of the animal “bats”.

However, it appears that scaling up the model can resolve this limitation. I haven’t found a plausible explanation for how a decoder-only model can accurately comprehend the intended meaning of the word “bat”. My theory is that the embeddings of the other words, “flew into the cave”, change when they attend to the word “bat”, such that its semantic incorporate the information of “bat”. If my theory holds, this effect would indirectly grant the dec-only model a form of bi-directionality.

However, this indirect bi-directionality may not perform as well as encoder-decoder models. Named Entity Recognition (NER) is a good example. Take the sentence, “summer went to the theater with tom to watch a movie.” (Here “summer” is a woman’s name that’s intentionally not capitalized.)

A decoder-only model may not well identify “summer” as a named entity. This is because the model can only infer if “summer” refers to a person or a season based on surrounding tokens, such as the verbs “went” and “watch” that are not usually associated with a season as the subject. If the embedding from “summer” indicating “I am a season, not a named entity” is stronger than the cues from “went” and “watch” which suggest “it’s ‘summer’, a person, who went or watched something”, then the model would not perform the task correctly.

On the other hand, an encoder-decoder model doesn’t need to concern itself with this issue. The term “summer”, when it attends to “went” and “watch”, will shift its embedding towards a personhood, thereby directly indicating to the model that it is a named entity not a season.

2. When target and input are inherently different

Utilizing two distinct stacks, an encoder and a decoder, can prove advantageous when the input and output targets differ fundamentally. For instance, in the case of English to French translation, having two stacks — a encoder stack for English and a decoder stack for French — is sensible as these languages exhibit different regularities.

Additionally, if the target sequences are significantly shorter than the input sequences, the two-stack approach proves to be better. This was introduced by Chung in his talk. “We spend 99% of the time optimizing the PaLM(dec-only), and at the end we just spent like three days on T5. But the performance gain was a lot higher on T5(enc-dec). (…) The encoder-decoder architecture has an assumption that input and output will be very different, and that structure really shined here.”

The enc-dec approach might be more suitable for multi-modal models for a similar reasons. (And this is the reason why Reka used enc-dec architecture for their multi-modal model. Here is the part of the interview where Yi Tay, the who led the Reka team acknowledging that multimodality is relevant to the choice of enc-dec architecture. Unfortunately, however, Yi Tay refrains from talking more in detail.) I haven’t researched on this topic yet, but I am planning to dive deeper in the future posts.

Conclusion

The encoder-decoder architecture, which was prominent in 2018, is now largely managed by scaling. Encoder-decoder models, however, present engineering challenges such as information bottlenecks in deep layer models and a lack of cacheability in multi-turn chat applications. As a result, they are no longer necessary in some cases.

Nevertheless, there remains a niche for encoder-decoder models. When the input and output targets are significantly different, the encoder-decoder model might have some advantages due to its separate parameter sets. For tasks that benefit from bi-directionality, like NER, encoder-decoder models may perform better even with a smaller parameter size.

References

Tianyu He et al. Layer-Wise Coordination between Encoder and Decoder for Neural Machine Translation. 2018. link

Noam Shazeer et al. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. 2019. link

Hyung Won Chung et al. Scaling Instruction-Finetuned Language Models. 2022. link

Hyung Won Chung’s talk at Stanford CS25

About me

Hello, I’m Minki Jung, currently living in Montreal, Canada. I pursued my undergraduate degree in physics and worked as a specialized science journalist in Korea before moving to Canada.

Initially, I embarked on a full-stack web development program at a college and gained a year of experience as a full-stack developer. In July 2023, I started studying ML/AI, starting with the fast.ai courses and taking a year-long AI program at a college. There, I led a computer vision project detecting pests from sticky trap images with the Agriculture Canada.

Currently, I’m serving as an AI developer for the Canadian government in a co-op role, which is set to conclude at the end of August. I’m interested in joining a startup that is creating a customer product using LLMs. If my skills align with your company’s needs, please feel free to contact me through any of the channels listed below.

Resume: link

GitHub: https://github.com/minki-j

Linkedin: https://www.linkedin.com/in/minkijung/

Email: qmsoqm2@gmail.com

--

--