Part 1. Input Embeddings and Positional Encodings

5 min readJan 6, 2024

Colab notebook — Notebook

Preprocessed Dataset — Dataset

This is the second blog of the series implementing the “Attention is all you need” research paper from scratch. Here are the links for the other parts

Part 0. Intro blog

Part 2. Multihead attention (in progress)

Part 3. Encoder block (in progress)

Part 4. Decoder block (in progress)

Part 5. Assembling the model and training (in progress)

Part 6. Inference (in progress)

Input Embeddings

First, we will try to understand what are input embedding and why they are needed.

We deal with text data in Natural Language Processing (NLP) based problems. To give the text as input to any machine learning model we need to convert them to numbers.

Now one of the ways to convert the given input sentence to numbers is to first break down the sentence into individual words and then assign a number to each word. The process of breaking down text into smaller units (tokens) is called tokenization.

For example:-

This is called word-level tokenization. One of the best tokenization techniques is to use byte-level encoding tokenization such as tiktoken tokenizer which is used by OpenAI or sentencepiece which is used by Google.

Next, we need to represent these word-level tokens into vectors so that they convey some meaning. We will create these vectors using the nn.Embedding layer. This layer can be thought of as a dictionary(lookup table) of size vocab_size X embedding_size. Each row will contain float values and will be of size embedding_size. The embedding size is a hyperparameter and is referred to as d_model in the original paper and the value taken was 512 i.e. each word was represented as 512 dim vector.

The Embedding layer is trainable and as we train the model the values in the Embedding layer change to minimize the loss. We’ll look in the last section at the final values of the embedding layer.

Size is important — Let’s calculate the size after this operation. For the given sentence “I am going to school” the number of words is 5. Let’s assume the embedding dimension(d_model) to be 512. After encoding we’ll get a vector of shape 5x 512. (as evident in the above image)

Size for a batch of inputs — To speed up the training process instead of giving a single input sentence it is a common practice to give a batch of input sentences to the model. The sentences in a particular batch are independent of each other and are processed in parallel.

Problem — The size of individual sentences in a given batch can be different. The encoded vector of the sentence “I am going to school” will be of size 5 x 512 whereas the encoded vector of the sentence “Tim plays soccer” will be of size 3 x 512.

Solution — To tackle this problem we define two things, maximum sentence length, and pad token.

If the character length of the sentence is greater than the maximum length we will clamp there.
If the character length of the sentence is less than the maximum length then we will pad the sentence with a pad token.

Let’s say we define the batch size as 8, and the maximum word length as 32 so the batch input given to the model will be 8 x 32 x 512; we’ll call this (B x M X d_model)

Position encodings

Now, we will try to understand what are position encodings and why they are needed.

In any language, the order of the words matters. If we change the positions of the words in any given sentence, the entire meaning can be altered.

Problem — The transformer models don’t use any of the recurrence or convolution operations and hence there is no notion of positions in which the words are occurring

Solution — To solve the above problem position encodings of words of a sentence are also given along with the sentences to give positional information.

The position encodings used in the original paper involve sine and cosine functions operated on the position of the token. The positional embedding matrix will be of size max_length x embedding_dim where each row represents the encoded positional information. Here’s how the values are calculated in the paper

Note — The values in the positional embedding matrix will be pre-computed they won’t change during training. Here’s the snippet to compute positional encodings

For a given sentence the position encodings are obtained as follows

Get the max_length of the inputs.
Create a range_vector using torch.arange(max_length) which will give output as [0,1,2,3,…..max_length-1]
Get the embeddings of each element in the range vector from the position embeddings matrix defined above.

The output you’ll get will be B x M x d_model

Final Inputs

Now, we have both input and positional embeddings for a batch of input. To club them together we sum them directly as both are of size B x M x d_model. The input that goes inside the Encoder is a matrix of size B x Mx d_model. The complete process works as follows