How the Embedding Layers in BERT Were Implemented

Feb 19 · 5 min read


In this article, I will explain the implementation details of the embedding layers in BERT, namely the Token Embeddings, Segment Embeddings, and the Position Embeddings.


Here’s a diagram from the paper that aptly describes the function of each of the embedding layers in BERT:

Token Embeddings


As alluded to in the previous section, the role of the Token Embeddings layer is to transform words into vector representations of fixed dimension. In the case of BERT, each word is represented as a 768-dimensional vector.


Suppose the input text is “I like strawberries”. Here’s a diagram describing the role of the Token Embeddings layer:

Segment Embeddings


BERT is able to solve NLP tasks that involve text classification given a pair of input texts. An example of such a problem is classifying whether two pieces of text are semantically similar. The pair of input text are simply concatenated and fed into the model. So how does BERT distinguishes the inputs in a given pair? The answer is Segment Embeddings.


Suppose our pair of input text is (“I like cats”, “I like dogs”). Here’s how Segment Embeddings help BERT distinguish the tokens in this input pair:

Position Embeddings


BERT consists of a stack of Transformers (Vaswani et al. 2017) and broadly speaking, Transformers do not encode the sequential nature of their inputs. The Motivation section in this blog post explains what I mean in greater detail. To summarize, having position embeddings will allow BERT to understand that given an input text like:


BERT was designed to process input sequences of up to length 512. The authors incorporated the sequential nature of the input sequences by having BERT learn a vector representation for each position. This means that the Position Embeddings layer is a lookup table of size (512, 768) where the first row is the vector representation of any word in the first position, the second row is the vector representation of any word in the second position, etc. Therefore, if we have an input like “Hello world” and “Hi there”, both “Hello” and “Hi” will have identical position embeddings since they are the first word in the input sequence. Similarly, both “world” and “there” will have the same position embedding.

Combining Representations

We have seen that a tokenized input sequence of length n will have three distinct representations, namely:

  • Segment Embeddings with shape (1, n, 768) which are vector representations to help BERT distinguish between paired input sequences.
  • Position Embeddings with shape (1, n, 768) to let BERT know that the inputs its being fed with have a temporal property.


In this article, I have described the purpose of each of BERT’s embedding layers and their implementation. Let me know in the comments if you have any questions.


BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding; Devlin et al. 2018.