A Comparative Analysis of LLMs like BERT, BART, and T5

Zain ul Abideen
6 min readJun 26, 2023

--

Exploring Language Models

Introduction

In this blog post, I will be discussing Large language models like BERT, BART, and T5. Major advancements made in the field of LLMs till 2020 include the development of these models. BERT and T5 were developed by Google and BART was developed by Meta. I will be covering the details of these models in sequence based on their release date. In the previous blog post Autoregressive Models for Natural Language Processing, I discussed the autoregressive nature of Generative pre-trained transformers. In this blog, I will be comparing how these models differ from autoregressive models. So if you haven’t checked out the previous post, go check it out. The BERT paper was released in 2018, BART in 2019, and T5 in 2020. I will be covering the details of the papers in the same sequence.

Bidirectional Encoder Representations from Transformers (BERT)

The BERT model is based on multi-layer bidirectional transformer encoder. BERT is designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers. As a result, the pre-trained BERT model can be fine-tuned with just one additional output layer to create a state-of-the-art model. BERT uses a masked language model pre-training objective to overcome the unidirectionality constraint. The pre-training of BERT is also done by next sentence prediction.

BERT input representation

In contrast to Transformers, BERT’s input representations are a sum of token embeddings, segment embeddings, and position embeddings. There is also an addition of special classification tokens and sentence separator tokens. Token embeddings are word piece embeddings with vocabulary size of 30,000. The datasets used during pre-training are BookCorpus and Wikipedia.

Masked Language model

In MLM pre-training, 15% of the words of the input sequence are taken. Out of these 80% are masked, 10% are replaced with a random word, and 10% are left the same with no change. So every time the model views masking, it outputs a distribution of vocabulary tokens.

MLM working

Next Sentence Prediction

In the second pre-training task, the model is provided with two sentences in one sequence and the model has to output whether the second sentence isNext or NotNext based upon context. It has been made sure that the dataset was balanced with an equal number of isNext and NotNext classes. This helps in understanding the relationship between sentences.

NSP working

Training for Downstream tasks

After the pre-training has been done on large unlabeled datasets, model is trained for downstream tasks on labeled data. There are two approaches that are followed:

  1. Fine-tuning approach: In this approach, we add a classification layer at the end of the model which outputs softmax probabilities of words. The model is trained end-to-end on labeled dataset for downstream task. All the parameters of the model are updated.
  2. Feature-based approach: In this approach, weights of BERT model are frozen after pre-training. BERT embeddings for the labeled dataset are created and then a new model is trained on these embeddings. In the original paper, 2-layer BILSTM applied to embeddings from concatenated last 4 layers of BERT model performed best.

Bidirectional and Auto-Regressive Transformers (BART)

BART is the model developed by Facebook. It is sort of a combination of Google’s BERT and OpenAI’s GPT. BERT’s bidirectional and autoencoder nature helps in downstream tasks that require information about the whole input sequence. But it is not good for sequence generation tasks. GPT models are good at text generation but not good at downstream tasks that require knowledge of whole sequence. This is due to its unidirectional and autoregressive nature. BART combines the approaches of both models and thus is the best of both worlds.

BART architecture

The BART model consists of a bidirectional encoder and an autoregressive decoder. In case of the encoder, noise transformations have been applied to the sequence. The sequence is corrupted by mask symbols. This corrupted sequence is then sent as input into a bidirectional encoder. After this, the likelihood of the original document is calculated using an autoregressive decoder. It is not mandatory to align the input and output in the same way. But in case of fine-tuning, an uncorrupted document is input to both the encoder and decoder, the representations from final hidden state of the decoder are used.

Noise Transformations

There are multiple noise transformations applied to the sequence during pre-training of the BART model. The model is optimized to reconstruct the original sequence from the corrupted sequence.

BART results

Text-to-Text Transfer Transformer (T5)

The architecture of T5 model is almost the same as the original Transformer as proposed by Vaswani et al. Both the encoder and decoder consist of 12 blocks. This model has 220 million parameters. Only a few changes have been made to the architecture like they have removed the Layer norm bias and placed the layer normalization outside the residual path. There is a different position embedding scheme used in T5. To train the model, there is a combination of model and data parallelism used. The same technique of unsupervised pre-training and supervised fine-tuning is used.

Text-to-Text Framework

T5 models every problem in the form of text-to-text format. The input and output will always be in text format. They have also modeled a regression problem as well in text format. T5 was trained on 750 GB C4 dataset. This dataset was retrieved after cleaning up 20 TB Common Crawl dataset. Sentences with JavaScript code, curly brackets, HTML tags, placeholder text, and offensive language were removed.

Masking method

During unsupervised pre-training, an objective is designed that randomly samples and then drops out 15% of tokens in the input sequence. All consecutive spans of dropped-out tokens are replaced by a single sentinel token. In the case of BERT, the model is trained to predict one word for the corresponding mask. But T5 is hybrid. It is trained to output one word or multiple words for one mask. This allows the model to be flexible in learning the language structure.

Attention mask patterns

In the above figure, various attention mask patterns have been shown which are used in the model. A light cell indicates that the self-attention mechanism is not allowed to attend to the corresponding cell. In the left part, the entire input is visible at each output step. In the middle part (Causal), the output step cannot view any input from the future. In the right part (Causal with prefix), the self-attention mechanism uses fully-visible masking on a portion of the input sequence.

Closing Remarks

In conclusion, language models such as BERT, BART, and T5 have different architectures and ways of training. In comparison to unidirectional pre-training, MLM technique has led to a revolutionary change in the field of natural language processing and demonstrated the power of large-scale language models. These models have showcased remarkable capabilities in generating coherent and contextually relevant text, pushing the boundaries of what is possible in language generation tasks. In the next blog post, I will be covering in detail the transition from GPT-3.5 to ChatGPT and RLHF concept.

Thank you for reading!

Follow me on LinkedIn!

--

--