Input Embedding Sublayer in the Transformer Model

Sandaruwan Herath
Data Science and Machine Learning
3 min readApr 17, 2024

The input embedding sublayer is crucial in the Transformer architecture as it converts input tokens into vectors of a specified dimension (commonly d model =512) using learned embeddings. This sublayer is foundational in processing the input for further manipulation by the Transformer’s subsequent layers.

Process of Input Tokenization and Embedding

Tokenization: Initially, a sentence is transformed into tokens using a tokenizer. Different tokenization methods exist, such as Byte Pair Encoding (BPE), WordPiece, and SentencePiece. The choice of tokenizer can influence the model’s performance due to variations in how words are broken down into tokens.

For instance, consider the sentence:” Dolphins leap gracefully over waves.”

A BPE tokenizer might tokenize this as: [‘Dolphins’, ‘leap’, ‘grace’, ‘##fully’, ‘over’, ‘waves’, ‘.’]

This demonstrates how BPE splits “gracefully” into “grace” and “##fully”, aiding the model in understanding and processing complex words by breaking them into recognizable segments.

Embedding Sublayer

Embedding: Each token is then mapped to a high-dimensional space (vector) using learned embeddings. The embedding process involves converting tokens into a dense vector of floats (the embedding vector) that the model can process. Each vector essentially captures semantic and syntactic information about the token.

Example of Embedding

Suppose we are embedding the word “waves” from the sentence above. The embedding layer will convert “waves” into a dense vector of size 512. Here is a hypothetical representation of the embedding:

waves = [0.032, -0.024, 0.118, …, 0.021] # A 512-dimensional vector

This vector encapsulates not just the identity of the word “waves” but also carries contextual cues due to its position and usage in the training data seen by the model.

Positional Encoding

Since the Transformer does not inherently process sequential data (as it lacks recurrent structures), positional encodings are added to the embeddings to provide context concerning the order of tokens. These encodings use sine and cosine functions of different frequencies to encode the position of each word within a sentence:

Positional Encoding (pos, 2i) = sin(pos / 10000^(2i/d_model))

Positional Encoding (pos, 2i+1) = cos(pos / 10000^(2i/d_model))

Where pos is the position and i is the dimension. This encoding helps the model to distinguish the position of each word in the sentence, maintaining the sequential nature of the input data.

Multi-Head Attention Mechanism

The embedded vectors are then input into the multi-head attention mechanism. Here’s a simple example to illustrate how multi-head attention works using the tokenized and embedded sentence:

For the word “leap” in the sentence, the attention mechanism will analyze its relationship with all other words, including itself, to enhance the contextual relevance of each word within the sentence. This process involves computing weighted sums of all embeddings, where the weights are determined by the dot products of the embedding vectors, showcasing how each word relates to one another within the context of the sentence.

Summary of the Embedding Sublayer

The input embedding sublayer is a sophisticated blend of tokenization, embedding, and positional encoding, structured to transform raw text into a format that the Transformer model can efficiently process. By converting words into rich, contextual embeddings, and preserving information about the sequence of the words, this sublayer sets the stage for complex operations like attention and feed-forward processing in subsequent layers. The continuous vector space and positional nuances captured in the embeddings allow the Transformer to perform nuanced language understanding and generation tasks effectively.

Next

The positional encoding component of the Transformer model is vital as it adds information about the order of words in a sequence to the model’s inputs. Unlike some older sequence processing models that inherently understand sequence order (e.g., RNNs), the Transformer uses positional encodings to maintain this essential context.

References

[1]. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., … & Polosukhin, I. (2017). Attention is all you need. In Advances in neural information processing systems (pp. 5998–6008).

Rothman, D. (2024). Transformers for Natural Language Processing and Computer Vision. Packt Publishing.

--

--

Sandaruwan Herath
Data Science and Machine Learning

IT Consultant/Lecturer | Data Analyst/BI Consultant/Machine Learning