BERT: Pre-Training of Transformers for Language Understanding

Understanding Transformer-Based Self-Supervised Architectures

Rohan Jagtap
The Startup

--

Photo by Brett Jordan on Unsplash

Pre-training Language Models has taken over a majority of tasks in NLP. The 2017 paper, “Attention Is All You Need”, which proposed the Transformer architecture, changed the course of NLP. Based on that, several architectures like BERT, Open AI GPT evolved by leveraging self-supervised learning.

In this article, we discuss BERT : Bidirectional Encoder Representations from Transformers; which was proposed by Google AI in the paper, “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding”. This is one of the groundbreaking models that has achieved the state of the art in many downstream tasks and is widely used.

Overview

BERT Pre-training and Fine-Tuning Tasks from the Paper (We will cover the architecture and specifications in the coming sections. Just observe that the same architecture is transferred for the fine-tuning tasks with minimal changes in the parameters).

BERT leverages a fine-tuning based approach for applying pre-trained language models; i.e. a common architecture is trained for a relatively generic task, and then, it is fine-tuned on specific downstream tasks that are more or less similar to the pre-training task.

--

--

Rohan Jagtap
The Startup

Immensely interested in AI Research | I read papers and post my notes on Medium