Self-attention

2 min readOct 18, 2023

What does Self-attention mean in computer vision?

Self-attention, also known as self-attention mechanism or scaled dot-product attention, is a fundamental concept in deep learning and natural language processing (NLP) that has also found applications in computer vision. Self-attention is a mechanism that allows a model to weigh and consider different parts of the input data when making predictions, rather than relying on fixed, handcrafted patterns or filters.

In the context of computer vision, self-attention is often used in models like transformers to process image data. Here’s how it works:

Input Data: In the case of computer vision, the input data is typically a grid of features, where each feature corresponds to a specific location (e.g., pixels in an image).
Query, Key, and Value: Self-attention works by computing three sets of vectors: Query, Key, and Value. These vectors are derived from the input data.

Query: These are vectors that represent positions or elements in the input data. They are used to determine which parts of the input are relevant for a given position.
Key: Key vectors are also derived from the input data and help in computing the compatibility or similarity between positions.
Value: The Value vectors contain the actual information you want to attend to in the input data.

Attention Scores: For each position in the input data, the self-attention mechanism computes attention scores by comparing the Query vector of that position with the Key vectors of all positions. The attention scores represent how much attention each position should pay to the others. This is done using a scaled dot-product or other similarity measures.
Weighted Sum: The attention scores are used to compute a weighted sum of the Value vectors, where the weights are determined by the attention scores. This weighted sum represents the output for each position.
Multiple Heads: In many models, self-attention is applied with multiple sets of Query, Key, and Value vectors, known as attention heads. This allows the model to capture different relationships and patterns in the data.

The output of the self-attention mechanism is then used as input for subsequent layers in the neural network, allowing the model to learn to focus on different parts of the input data based on the context of the task.

In computer vision, self-attention has been employed in vision transformer models, such as the “Vision Transformer” (ViT), which have shown impressive results in tasks like image classification, object detection, and segmentation. By considering interactions between different positions in the input data, self-attention helps these models capture long-range dependencies and relationships in images, making them powerful tools for a wide range of computer vision tasks.

Self-attention

Written by Saba Hesaraki