Member-only story
What are the differences in Pre-Trained Transformer-base models like BERT, DistilBERT, XLNet, GPT, XLNet, …
This article is a cheat sheet of well-known Transformer-based models and tries to explain their uniqueness (while they are all based on the same architecture).

The combination of Transformer architecture and transfer learning is dominating the Natural Language Processing world. There are numerous pre-trained models (Huggingface alone has 40+) which might look the same at first glance because they are all using the same blocks. Make sure to have a basic understanding of the Transformer blocks to get the most out of this piece. (I highly recommend Jay Alamar “The Illustrated Transformer” post)
It is easy to correlate each model’s strengths with the number of parameters it has, but there must be more to it. I will review a couple of famous models and discuss their differences, like the overall architecture, pre-training objectives, and the training process. In general, try to answer the question What makes them unique?
BERT
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding [1]
Can we even talk about NLP + Transformer and do not mention the almighty BERT? Where it all started. The model is based on the Transformer’s encoder block…