Introduction
In this article, I will explain the implementation details of the embedding layers in BERT, namely the Token Embeddings, Segment Embeddings, and the Position Embeddings.
Overview
Here’s a diagram from the paper that aptly describes the function of each of the embedding layers in BERT:
Like most deep learning models aimed at solving NLP-related tasks, BERT passes each input token (the words in the input text) through a Token Embedding layer so that each token is transformed into a vector representation. Unlike other deep learning models, BERT has additional embedding layers in the form of Segment Embeddings and Position Embeddings. The reason for these additional embedding layers will become clear by the end of this article.
Token Embeddings
Purpose
As alluded to in the previous section, the role of the Token Embeddings layer is to transform words into vector representations of fixed dimension. In the case of BERT, each word is represented as a 768-dimensional vector.
Implementation
Suppose the input text is “I like strawberries”. Here’s a diagram describing the role of the Token Embeddings layer (there is a…