Large Language Models (LLM): Difference between GPT-3 & BERT

Shivika K Bisen
Bright AI
Published in
3 min readDec 23, 2022

Transformer models are powerful AI models that have changed the scene of language understanding. Recently, ChatGPT has amazed the tech world with its wonderful language generation

Both GPT3 and BERT are Transformer based pre-trained models widely used in NLP task

BERT

1. Underlying Model : BERT is a Bidirectional Encoder Representation from Transformer. It has 2 objectives: Masked Language Model (MLM) and Next Sentence prediction. Model learning happens left to right and right to left at the same time

2) Input: token embedded for a word + position embedding for that word ( sequence number of word in a text) + segment embedded for that word (encode which phrase/sentence word falls in).

‘CLS’ token marks the start & end of the entire text. ‘SEP’ token marks the separation of a segment

3) Architecture: Only made of Transformer Encoder units( base model= 12).

4) Training

a) Specific task : Training for a specific task. Requires some training data for transfer learning to be applied on this pre-trained model.

b) Fine-tuning for transfer learning: Compared to GPT, BERT has more control over the fine-tuning of internal model. It can be fine-tuned to do classification, Q&A, etc. Good to have atleast 300 training datapoints for fine-tuning.

c) Comparatively less training data (wiki book corpus)

d) Comparatively less training parameters (110 M)

5) Functioning: Random words in the input text are masked (15% masked words of all words). Multi-head attention in encoders are trained using masked word. Task is to predict the masked word using softmax activation given surrounding words when looking left to right & right to left. Then match the actual masked word vs predicted word & decrease the cross entropy loss function for optimal learning. Similarly next sentence is predicted (NSP)

6) Application: Excellent for tasks that help with dealing with NLP long-term dependencies like translations, summarization

7) Commonly known variations: RoBERTa ( goal of NSP is removed, trained on more data and with more parameters than BERT)

GPT-3

1) Underlying Model: Generative, auto-regressive model. The model learning happens in uni-direction ( left to right)

2) Input: Input token consists for token embedding of a word + position embedding of the word. ( no segment embedding)

3) Architecture: Made of Transformer Decoder units (96) only, mostly.

4) Training

a) General tasks: Trained for general tasks. b) Fine-tuning for transfer learning: less option for fine-tuning compared to BERT. But require less training data for fine-tuning. GPT-3 uses the “shot learning” concept. i.e zero-shot learning (no training example), one-shot learning (1 example), few-shot learning (few examples)

c) lot more training data than BERT for pretrained model ( helps leverage its knowledge of the relationships between known and unknown data

d) lot more training parameters (176 B)

5) Functioning: Using the concept of beam search, the model is trained to predict the next most probable word give an input embedding. The model only see the previous words ( since left to right) therefore decoder multi-head attentions use softmax activation to predict most probable next word given the previous words.

6) Application:

Excellent for tasks with natural language generation, Q/A, task leveraging common sense

7) Commonly known variations: ChatGPT

--

--

Shivika K Bisen
Bright AI

Gen AI/ML, Data Scientist | University of Michigan Alum | Generative AI, Recommendation & Search & NLP, Predictive models. https://sbisen.github.io/