Exploring the Multi-head Attention Sublayer in the Transformer

Published in

Data Science and Machine Learning

17 min readApr 18, 2024

The multi-head attention mechanism is a hallmark of the Transformer model’s innovative approach to handling sequential data. It enhances the model’s ability to process sequences by enabling it to attend to different parts of the sequence simultaneously. This article will delve into the architecture of the multi-head attention sublayer, its implementation in Python, and the role of post-layer normalization.

Architecture of Multi-Head Attention

(left) Scaled Dot-Product Attention. (right) Multi-head attention consists of several attention layers running in parallel[1]

Input Structure: Each input vector to the multi-head attention sublayer combines the word embedding with its positional encoding, maintaining a dimension of d model =512. This comprehensive embedding encapsulates both the semantic meaning of the word and its contextual position within the sequence.

Operational Detail: The key functionality of this sublayer is to project the embeddings into smaller, more focused subspaces using multiple attention “heads.” Each head captures unique aspects of the sequence, providing a diversified perspective that is crucial for complex understanding and prediction tasks.

For example, in the sentence:

“Innovators create and disrupt in equal measure.”

The word “disrupt” interacts differently with “innovators,” “create,” and “measure,” depending on the focus of each attention head. Some heads might focus on the synergy between “create” and “disrupt,” while others might explore the connections between “disrupt” and “measure.”

Dimensionality Reduction:

Each of the eight heads projects the dmodel dimensions into a smaller space of dk =64 dimensions.
Query (Q), Key (K), and Value (V) matrices are derived from the input embeddings and are instrumental in the attention computation.
Each matrix has dimensions [sequence_length,dk].

Scaled Dot-Product Attention

The core of the attention mechanism is the scaled dot-product attention, calculated as follows:

This equation adjusts the raw dot products between the queries and the keys with the factor.

√dk , which helps in stabilizing the gradients during training.

Combining Heads

The outputs from all heads (Z 0 to Z 7) are concatenated to form a single matrix that preserves the original dimensionality (dmodel):

MultiHead(Q,K,V)=Concat(Z 0….,…,Z 7)W O

Where W O is another learned weight matrix that helps in combining the insights from all the attention heads.

Python Implementation Example

Let’s implement a simplified version of the multi-head attention using Python. This will help in visualizing how inputs are transformed through this layer.

def softmax(x):
    e_x = np.exp(x - np.max(x))
    return e_x / e_x.sum(axis=-1, keepdims=True)

def scaled_dot_product_attention(Q, K, V):
    matmul_QK = np.dot(Q, K.transpose())
    scale = np.sqrt(K.shape[-1])  # square root of dk
    logits = matmul_QK / scale
    weights = softmax(logits)  # apply softmax to obtain weights
    output = np.dot(weights, V)
    return output

# Example variables
d_model = 512
d_k = 64
length = 10  # sequence length
heads = 8

# Randomly initialized matrices for Q, K, V
Q = np.random.rand(length, d_k)
K = np.random.rand(length, d_k)
V = np.random.rand(length, d_k)

# Multi-head attention output
outputs = np.concatenate([scaled_dot_product_attention(Q, K, V) for _ in range(heads)], axis=-1)
outputs = outputs.reshape((length, d_model))  # reshaping to match d_model

print("Output shape:", outputs.shape)

In this example:

We use random matrices for Q, K, and V to simulate the operation.
The scaled_dot_product_attention function demonstrates the basic attention mechanism for one head.
The output from all heads is concatenated to form a final output that matches the original embedding size (dmodel ).

Post-layer Normalization and Residual Connections

After processing the input through the multi-head attention mechanism, the output undergoes post-layer normalization. This step involves adding a residual connection (the input of the sublayer) to the output of the attention mechanism before applying layer normalization:

layer_output = layer_normalization(outputs + input_embeddings)

This process helps in mitigating the vanishing gradient problem and promotes faster convergence.

Summary of Exploring the Multi-head Attention

The multi-head attention sublayer is pivotal in enabling the Transformer to handle different representations of the data simultaneously, making it highly effective for NLP tasks. By examining multiple components of the sequence in parallel, the model can capture a richer understanding of the context, which is enhanced by the post-layer normalization ensuring stable and effective learning.

Step 1: Simplified Representation of Inputs in a Multi-Head Attention Mechanism

To thoroughly understand the multi-head attention sublayer of the Transformer model, we will begin by exploring its functionality using simplified Python code. This approach will help us grasp the intricate details of the attention mechanism. Here, we reduce the complexity by scaling down the input dimensions to make the computations and their outcomes easier to visualize and understand.

Setting Up the Environment

1. Prepare Your Workspace:

Ensure you have a Google account and access to Google Drive.
Save the notebook titled MHA_Sub_Layer.ipynb to your Google Drive. This notebook is available in the GitHub repository linked with this chapter.

2. Open the notebook in Google Colaboratory to start experimenting with the code.

3. Initialize Your Notebook:

import numpy as np
from scipy.special import softmax

Simplifying the Input Dimensions

In the traditional Transformer model, the dimension dmodel is typically 512. For educational purposes, we’ll reduce this to d model =4. This simplification allows us to focus on the underlying mechanics without getting overwhelmed by large vector sizes.

Example Input:

We will consider a sequence of three inputs, each with four dimensions.

Begin by importing the necessary Python libraries and setting up the initial parameters for our model demonstration.

print("Step 1: Input Representation: 3 inputs, each with d_model=4 dimensions")
x = np.array([[1.0, 0.0, 1.0, 0.0],   # Input vector 1
              [0.0, 2.0, 0.0, 2.0],   # Input vector 2
              [1.0, 1.0, 1.0, 1.0]])  # Input vector 3
print(x)

Output and Explanation

The console output confirms the structure and dimensions of our input matrix:

Step 1: Input Representation: 3 inputs, each with d_model=4 dimensions

[[1. 0. 1. 0.]

[0. 2. 0. 2.]

[1. 1. 1. 1.]]

Each row in the matrix x corresponds to an input vector to the Transformer’s multi-head attention sublayer. In this case, the dimensionality reduction to four features per input helps in tracing the computation steps more transparently.

Next Steps: Incorporating Weight Matrices

With the inputs defined, the next step involves applying learned weight matrices to these inputs to project them into queries, keys, and values for the attention calculations. This process will be detailed in the subsequent sections, where we will:

· Introduce weight matrices. Q w , K w, and V w
· Multiply these matrices by the input vectors to generate queries, keys, and values.
· Utilize these components to demonstrate how the scaled dot-product attention combines them to form the output vectors.

This simplified setup not only makes the Transformer’s complex mechanisms more approachable but also lays a strong foundation for understanding more intricate configurations and optimizations used in real-world applications. The visualization and step-by-step explanation help demystify the initial processing that inputs undergo in a multi-head attention mechanism, setting the stage for a deeper exploration of the model’s capabilities.

Step 2: Initializing Weight Matrices for Queries, Keys, and Values

To proceed with the multi-head attention mechanism of the Transformer model, we need to initialize and apply three distinct weight matrices: one each for queries (Q), keys (K), and values (V). These matrices are crucial for transforming the input vectors into formats suitable for the subsequent attention calculations.

Designing the Weight Matrices

In standard implementations of the Transformer, as described by Vaswani et al. (2017), the dimensionality (d k) of these matrices is often equal to 64. However, for clarity and educational purposes, we’ll reduce these dimensions to simplify the visualizations and calculations.

Specification: Dimensionality Reduction: Here, dk =3, allowing us to use smaller matrices that are easier to handle computationally and conceptually in our examples.

Matrix Dimensions: Each matrix will have dimensions. 3×4 (reflecting dk×dmodel).

Initializing the Weight Matrices

Let’s initialize these matrices with specific values to see how they transform the input vectors.

x through multiplication. These matrices are crucial for defining how each input vector is interpreted by the model in terms of its query, key, and value representations.

Query Weight Matrix (Q w ):

print("Step 2: Initializing weight matrices for 3 dimensions x d_model=4")
print("Query Weight Matrix (Q_w):")
w_query = np.array([[1, 0, 1],
                    [1, 0, 0],
                    [0, 0, 1],
                    [0, 1, 1]])
print(w_query)

Output:

Query Weight Matrix (Q_w):

[[1 0 1]

[1 0 0]

[0 0 1]

[0 1 1]]

This matrix will transform each input vector into a query vector which represents the vector’s request for what to focus on from other vectors (keys).

2. Key Weight Matrix (Kw):

print("Key Weight Matrix (K_w):")
w_key = np.array([[0, 0, 1],
                  [1, 1, 0],
                  [0, 1, 0],
                  [1, 1, 0]])
print(w_key)

Output:

Key Weight Matrix (K_w):

[[0 0 1]

[1 1 0]

[0 1 0]

[1 1 0]]

The key weight matrix converts input vectors into keys which are used by queries to fetch relevant information.

3. Value Weight Matrix (Vw):

print("Value Weight Matrix (V_w):")
w_value = np.array([[0, 2, 0],
                    [0, 3, 0],
                    [1, 0, 3],
                    [1, 1, 0]])
print(w_value)

Output:

Value Weight Matrix (V_w):

[[0 2 0]

[0 3 0]

[1 0 3]

[1 1 0]]

This matrix helps transform input vectors into values, which are aggregated based on the attention weights to form the output of the attention mechanism.

Applying the Weight Matrices

With these matrices initialized, the next step involves their application to the input vectors x. This process will generate the respective sets of query, key, and value vectors, which are crucial for attention.

# Multiplying the input vectors by the weight matrices to obtain Q, K, V
Q = np.dot(x, w_query.T)
K = np.dot(x, w_key.T)
V = np.dot(x, w_value.T)

This multiplication aligns each input vector with each matrix to produce the new vectors that embody the roles necessary for computing the attention scores.

Summary of Step 1 and Step 2

The initialization and application of these weight matrices are fundamental steps in configuring the Transformer’s multi-head attention mechanism. They allow the model to project the input data into a space where the attention-driven relationships between different elements of the data can be effectively computed and utilized. The subsequent steps will build on these transformed vectors to execute the attention operations, ultimately leading to a powerful and flexible model capable of handling complex sequential tasks.

Step 3: Generating Query, Key, and Value Vectors Through Matrix Multiplication

With the weight matrices initialized, the next step in configuring the multi-head attention mechanism is to apply these matrices to the input vectors. This application transforms the input vectors into the query (Q), key (K), and value (V) vectors necessary for the attention calculations. Each vector serves a unique purpose in determining the output of the attention mechanism.

Matrix Multiplication Process

Calculating Queries, Keys, and Values:

Query Vectors (Q)

The query vectors are obtained by multiplying the input vectors x by the query weight matrix wquery. Queries are used to probe the keys.

print("Step 3: Matrix Multiplication to Obtain Q, K, V")
print("Query Vectors (Q): x * w_query")
Q = np.matmul(x, w_query)
print(Q)

Output:

Query Vectors (Q): x * w_query

[[1. 0. 2.]

[2. 2. 2.]

[2. 1. 3.]]

Each row in Q represents the query vector corresponding to each input vector, reflecting how each input will interact with the keys.

Key Vectors (K)

Similarly, the key vectors are produced by multiplying the input vectors x by the key weight matrix wkey. Keys are matched against the queries to compute attention scores.

print("Key Vectors (K): x * w_key")
K = np.matmul(x, w_key)
print(K)

Output

Key Vectors (K): x * w_key

[[0. 1. 1.]

[4. 4. 0.]

[2. 3. 1.]]

The key vectors K facilitate the retrieval of relevant information from the values based on the compatibility (similarity) with the queries.

Value Vectors (V)

The value vectors are calculated by multiplying the input vectors x by the value weight matrix wvalue. Values are the actual content that will be retrieved and used to construct the output of the attention step.

print("Value Vectors (V): x * w_value")
V = np.matmul(x, w_value)
print(V)

Output

Value Vectors (V): x * w_value

[[1. 2. 3.]

[2. 8. 0.]

[2. 6. 3.]]

Values V hold the data that will be compiled by the attention mechanism to form the final output based on the processed queries and keys.

Explanation and Significance

Queries (Q): Represent intentions or focus points of each input vector, seeking relevant data from the keys.
Keys (K): Provide a lookup functionality, matching themselves with queries to fetch corresponding values.
Values (V): Contain the actual content that will be aggregated based on the attention scores derived from the query-key interactions.

These components (Q, K, V) are foundational for the subsequent attention calculations that determine the output based on both the content (values) and the context (position) provided by the keys and queries, respectively.

Summary of Step 3

The third step of our simplified Transformer model setup is crucial as it prepares the structured data (Q, K, V) necessary for executing the attention mechanism. This preparation involves transforming the input vectors into a format that aligns with the functional requirements of the multi-head attention sublayer, setting the stage for dynamic and context-aware processing in neural network models. With these transformations complete, the next stage involves utilizing these vectors to calculate the attention scores that dictate the output of the layer.

Step 4: Computing Scaled Attention Scores

Continuing from the initialization and application of the weight matrices, the next phase in our multi-head attention mechanism involves calculating the attention scores. These scores determine how much each element of the sequence should attend to every other element. The computation of these scores is critical as it directly influences the effectiveness of the attention mechanism in capturing relevant information.

Calculating Attention Scores

The attention scores are calculated using the query (Q) and key (K) vectors. The original Transformer model utilizes the scaled dot-product attention, which is a method of obtaining these scores by scaling the dot products of queries and keys. This method is efficient and scales well with larger dimensions.

Scaled Dot-Product Attention

Equation: The attention scores are computed as follows:

Attention Scores Attention Scores= QKT/√dk

Where:

QKT is the dot product of the query matrices and the transpose of the key matrices.
dk is the scaling factor used to reduce variance introduced by high dimensionality, which in this simplified case has been approximated and rounded for simplicity.

print("Step 4: Scaled Attention Scores Calculation")
k_d = 1  # square root of d_k=3 rounded down to 1 for simplicity in this example
attention_scores = (Q @ K.transpose()) / k_d
print(attention_scores)

Output

Scaled Attention Score Calculation

[[ 2. 4. 4.]

[ 4. 16. 12.]

[ 4. 12. 10.]]

These scores reflect the relative attention each element of the sequence should pay to every other element, with higher scores indicating greater attention.

Step 5: Applying Softmax to Normalize Attention Scores

After calculating the raw attention scores, the next step is to normalize these scores across each row (for each input vector) using the softmax function. This normalization allows the model to convert the scores into a probability distribution, which is crucial for weighting the values appropriately.

Softmax Normalization

The softmax function is applied to each row of the attention scores matrix. This step ensures that the attention scores sum to 1, making them interpretable as probabilities where higher values correspond to higher attention levels.

Applying Softmax Function:

print("Step 5: Applying Softmax to Normalize Attention Scores")
attention_scores[0] = softmax(attention_scores[0])
attention_scores[1] = softmax(attention_scores[1])
attention_scores[2] = softmax(attention_scores[2])
print("Softmax Attention Scores for Input #1:", attention_scores[0])
print("Softmax Attention Scores for Input #2:", attention_scores[1])
print("Softmax Attention Scores for Input #3:", attention_scores[2])

Output

Applying Softmax to Normalize Attention Scores

Softmax Attention Scores for Input #1: [0.06337894 0.46831053 0.46831053]
Softmax Attention Scores for Input #2: [6.03366485e-06 9.82007865e-01 1.79861014e-02]
Softmax Attention Scores for Input #3: [2.95387223e-04 8.80536902e-01 1.19167711e-01]

Each vector of scores now represents a probability distribution, indicating the proportion of attention that should be paid to each element based on the softmax-normalized scores.

Summary of Step 4 and Step 5

The transition from raw attention scores to normalized softmax scores is a pivotal step in the Transformer’s attention mechanism. It refines the model’s focus, emphasizing relevant parts of the input data based on the computed probabilities. This attention mechanism allows the Transformer to dynamically prioritize different parts of the input data, a fundamental aspect of its superior performance in various tasks.

Moving forward, these normalized scores will be used to weight the value vectors, leading to the final output of the multi-head attention sublayer which aggregates relevant information across the sequence. This output will then proceed to further layers or sublayers depending on the specific architecture of the Transformer model being employed.

Step 6: Computing Final Attention Representations

After normalizing the attention scores using softmax, the next step in the Transformer’s multi-head attention mechanism is to compute the weighted sum of value vectors (V), which forms the final output for each input vector. This process leverages the attention scores to highlight relevant information from the input data.

Detailing the Calculation Process

Combining Attention Scores and Value Vectors: To synthesize the final attention output for each input vector, we multiply each normalized attention score by the corresponding value vector. This operation scales the value vectors based on how much each vector should be attended to according to the attention scores.

Calculating Weighted Values

Attention Weighting for the First Input Vector: Multiply each element of the attention score vector for the first input by each corresponding value vector. This highlights the parts of the value vectors that are most relevant according to the attention mechanism.

print("Step 6: Finalizing Attention by Weighting Value Vectors with Scores")
print("Value Vectors (V):")
print(V[0])
print(V[1])
print(V[2])

print("Normalized Attention Scores for Input #1:")
print(attention_scores[0])

print("Weighted Attention Contributions for Input #1:")
attention_contributions = np.array([attention_scores[0][i] * V[i] for i in range(len(V))])
print("Attention Component 1:", attention_contributions[0])
print("Attention Component 2:", attention_contributions[1])
print("Attention Component 3:", attention_contributions[2])

Output

Value Vectors (V):

[1. 2. 3.]

[2. 8. 0.]

[2. 6. 3.]

Normalized Attention Scores for Input #1:
[0.06337894 0.46831053 0.46831053]
Weighted Attention Contributions for Input #1:
Attention Component 1: [0.06337894 0.12675788 0.19013681]
Attention Component 2: [0.93662106 3.74648425 0. ]
Attention Component 3: [0.93662106 2.80986319 1.40493159]
Each “Attention Component” vector represents the scaled values for each value vector based on the input’s respective attention score.
These contributions are what the model considers when synthesizing the final output from the multi-head attention sublayer.

Summing the Weighted Contributions

After calculating the weighted contributions from each value vector, the final step for this input is to sum these contributions. This summation synthesizes the individual weighted elements into a single vector that represents the output of the attention mechanism for the first input vector.

final_attention_output = np.sum(attention_contributions, axis=0)
print("Final Attention Output for Input #1:", final_attention_output)

Output

Final Attention Output for Input #1: [1.93662106 6.68210531 1.59506841]

This final output vector is a comprehensive representation that combines relevant features from across the input sequence, weighted according to the calculated attention scores.

Extending the Calculation to Other Inputs

The process described above for the first input vector (x1) should be repeated identically for the other input vectors in the sequence (x2, x3, etc.). This ensures that each piece of input data is processed with the same level of attentional detail, allowing the Transformer to effectively handle sequences of varying lengths and complexities.

Summary of Step 6

Step 6 is critical as it translates the normalized attention scores into a practical output that significantly impacts the subsequent layers of the Transformer. By effectively synthesizing information from across the input sequence, the model can maintain a high level of contextual awareness, essential for tasks such as translation, text summarization, and more sophisticated natural language understanding challenges. This step concludes with a set of output vectors that are ready to be processed further, either in additional attention layers or other types of neural network layers, depending on the architecture of the specific Transformer model being utilized.

Step 7: Summing Up the Weighted Attention Contributions

In the multi-head attention mechanism, after obtaining the weighted contributions of the value vectors based on the attention scores, the next step is to sum these contributions to produce a consolidated output vector for each input. This summed vector effectively synthesizes all relevant information as dictated by the attention mechanism.

Calculating the Final Output for Each Input

The weighted value vectors for each input are summed to yield the final attention output for that input. This process aggregates the influence of all parts of the input sequence into a single vector that represents the combined contextual insights gained through the attention mechanism.

Implementation for Input #1

Summing the Attention Contributions: Each contribution vector (previously calculated as the product of attention scores and value vectors) is summed to form the final output vector for the first input vector.

print("Step 7: Summing the Weighted Attention Contributions for Input #1")
attention_input1 = attention_contributions[0] + attention_contributions[1] + attention_contributions[2]
print("Final Output Vector for Input #1:", attention_input1)

Output

Final Output Vector for Input #1: [1.93662106 6.68310531 1.59506841]

This vector [1.93662106 6.68310531 1.59506841] is the first line of the output matrix for the attention mechanism, representing a synthesis of all relevant value information as directed by the input’s attention distribution.

Extending the Process to Other Inputs

Following the same procedure, we compute the final outputs for the remaining inputs in the sequence. This systematic approach ensures that each input vector is processed uniformly, maintaining consistency across the computation.

Generalization to Multiple Inputs

The steps executed for input #1 are replicated for each subsequent input vector (input #2, input #3, etc.), employing the same weight matrices and softmax-normalized attention scores to generate each input’s respective output vector.

Example Calculation for All Inputs:

# Assuming attention_scores and V are defined for all inputs similarly
print("Step 8: Computing Outputs for All Inputs")
all_attention_outputs = []
for i in range(len(V)):  # Assuming V contains all input vectors
    attention_contributions = np.array([attention_scores[i][j] * V[j] for j in range(len(V))])
    final_output = np.sum(attention_contributions, axis=0)
    all_attention_outputs.append(final_output)
    print("Final Output Vector for Input #{}:".format(i+1), final_output)

Output

Final Output Vector for Input #1: [1.93662106 6.68310531 1.59506841]
Final Output Vector for Input #2: [values]
Final Output Vector for Input #3: [values]

Each output vector is a distillation of the entire sequence’s relevant contextual data as applicable to each specific input.

Step 9: Aggregation of Multi-Head Attention Outputs

After computing the outputs from one attention head, the model aggregates outputs from all heads. This aggregation enhances the robustness and depth of the contextual information captured.

Simulating Outputs from Multiple Heads:

# Example of generating outputs for multiple heads
print("Step 9: Generating Outputs for Multiple Attention Heads")
heads_output = np.random.random((3, 8, 64))  # Simulated outputs for 3 inputs across 8 heads and 64 dimensions
final_multi_head_output = np.concatenate(heads_output, axis=-1)  # Concatenating outputs across heads
print("Aggregated Multi-Head Output Shape:", final_multi_head_output.shape)

Output

Aggregated Multi-Head Output Shape: (3, 512)

The shape (3, 512) indicates that the outputs for all three inputs have been concatenated across eight heads, restoring each input’s dimensionality to dmodel =512, consistent with the Transformer architecture specifications.

Summary of 7 to 9

Steps 6 and 7 crucially form the core of the Transformer’s attention mechanism, efficiently summarizing the sequence data into usable outputs that feed into subsequent layers. Step 9’s aggregation of multiple heads’ outputs ensures a comprehensive contextual understanding, pivotal for tasks requiring nuanced comprehension of complex input sequences. This process, culminating in the reconstitution of the full dimensionality of the inputs, sets the stage for the final transformations in the Transformer model’s architecture.

Summary

This article explores the architecture and functionality of the multi-head attention sublayer in the Transformer model, highlighting its role in enhancing the model’s ability to process sequential data through parallel attention mechanisms. To improve the article’s clarity and depth, it is recommended to integrate visual aids that clarify complex mechanisms, provide more detailed and relatable examples, and include comprehensive code comments for educational purposes. Additionally, expanding on the explanation of post-layer normalization and its impact on the model’s workflow will enhance understanding. Including theoretical explanations, mathematical justifications, and further reading references will also enrich the content’s quality. Improvements in content structuring and the reduction of technical jargon will make the article more accessible. Finally, broadening the discussion to include a variety of real-world applications will showcase the multi-head attention’s versatility and efficacy in practical scenarios.

In the intricate architecture of the Transformer, Post-Layer Normalization (Post-LN) plays a pivotal role in stabilizing the learning process and ensuring the model’s robust performance across different linguistic tasks. This component is critical in refining the outputs from both the attention sublayers and the feedforward sublayers within the encoder and decoder modules of the Transformer.

References

[1]. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., … & Polosukhin, I. (2017). Attention is all you need. In Advances in neural information processing systems (pp. 5998–6008).

Rothman, D. (2024). Transformers for Natural Language Processing and Computer Vision. Packt Publishing.

Exploring the Multi-head Attention Sublayer in the Transformer

Summary of Exploring the Multi-head Attention

Step 1: Simplified Representation of Inputs in a Multi-Head Attention Mechanism

Step 2: Initializing Weight Matrices for Queries, Keys, and Values

Summary of Step 1 and Step 2

Step 3: Generating Query, Key, and Value Vectors Through Matrix Multiplication

Step 4: Computing Scaled Attention Scores

Step 5: Applying Softmax to Normalize Attention Scores

Step 6: Computing Final Attention Representations

Summary of Step 6

Step 7: Summing Up the Weighted Attention Contributions

Step 9: Aggregation of Multi-Head Attention Outputs

Next

References

Written by Sandaruwan Herath