The Feedforward Network (FFN) in The Transformer Model

Sandaruwan Herath
Data Science and Machine Learning
3 min readApr 19, 2024

The Transformer model revolutionizes language processing with its unique architecture, which includes a crucial component known as the Feedforward Network (FFN). Positioned within both the encoder and decoder modules of the Transformer, the FFN plays a vital role in refining the data processed by the attention mechanisms.

Location within the Transformer Model

Following the multi-head attention sublayer and post-layer normalization (post-LN), the output, which maintains a dimensionality of dmodel​=512, enters the Feedforward Network (FFN). This sublayer is pivotal in processing the data sequentially within each position across the sequence independently.

Transformer model [1]

Structure and Functionality of the FFN

The FFN within both the encoder and decoder of the Transformer is constructed as a fully connected, position-wise network. This design means that each position in the input sequence is processed separately but in the same manner, which is crucial for maintaining the positional integrity of the input data.

Key Characteristics of the FFN

1. Fully Connected Layers:

The FFN comprises two linear (fully connected) layers that transform the input data. The first layer expands the input dimension from dmodel​=512 to a larger dimension dff​=2048, and the second layer projects it back to dmodel​.

2. Activation Function:

A Rectified Linear Unit (ReLU) activation function is applied between these two linear layers. This function is defined as ReLU(x)=max(0,x) and is used to introduce non-linearity into the model, helping it to learn more complex patterns.

3. Position-wise Processing:

Despite the sequential nature of the input data, each position (i.e., each word’s representation in a sentence) is processed independently with the same FFN. This is akin to applying the same transformation across all positions, ensuring uniformity in extracting features from different parts of the input sequence.

Mathematical Representation

The operations within the FFN can be mathematically described by the following equations:

FFN(x)=max(0,xW1​+b1​)W2​+b2​

Where:

W1​ and W2​ are the weight matrices for the first and second linear layers, respectively.

b1​ and b2​ are the biases for these layers.

The ReLU activation is applied element-wise after the first linear transformation.

Example of FFN Processing

Consider a simplified example where the input x is a vector representing a single word’s output from the post-LN stage:

x=[0.5,−0.2,0.1,…]x=[0.5,−0.2,0.1,…] (512-dimensional)

The first layer of the FFN transforms this vector into a higher 2048-dimensional space, adds a bias, and applies the ReLU activation:

x′=max(0,xW1​+b1​)

Assuming non-negative outputs from ReLU for simplicity, the second layer then projects this vector back down to the original 512-dimensional space:

FFN output=xW2​+b2​

This output is then normalized by a subsequent post-LN step and either fed into the next layer of the encoder or used as part of the input to the multi-head attention layer in the decoder.

Summary of Feedforward Network in the Transformer Model

In summary, the Feedforward Network is a cornerstone of the Transformer architecture, enhancing its capability to handle diverse and complex linguistic tasks with remarkable efficiency and effectiveness. By systematically refining the output from the attention layers, the FFN helps maintain the Transformer’s high performance across different natural language processing applications.

Next

The output of the FFN provides a refined representation of each input position, which combines the contextual embeddings adjusted for potential non-linear relationships within the data. Following the FFN, the Transformer applies another round of post-LN to prepare the processed data for the next encoder layer or for transitioning to the attention mechanisms in the decoder stack. This structured approach ensures that the Transformer captures and emphasizes various linguistic features effectively, making the FFN a critical component in the model’s architecture.

References

[1]. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., … & Polosukhin, I. (2017). Attention is all you need. In Advances in neural information processing systems (pp. 5998–6008).

[2]. Rothman, D. (2024). Transformers for Natural Language Processing and Computer Vision. Packt Publishing.

--

--

Sandaruwan Herath
Data Science and Machine Learning

IT Consultant/Lecturer | Data Analyst/BI Consultant/Machine Learning