Vision Transformers (Part 2): Entering the Depth of “Image is Text”
Before you continue reading, make sure to read the brief introduction toward Vision Transformer part 1 here.
Now that you know at least some basic understanding of the Vision Transformer architecture, let’s dig deeper on how it works.
Here is a small refresher of the architecture from the paper.
Each image is split into multiple “patches” — or part of the image. This image is projected linearly into the model as a sequence of “Embedding”, which is just a vector that contains the information of that part of the image (Patch Embedding) and the location or relative position of that image (Position Embedding).
What happens in the Encoder is that for each patch, using the “Attention” mechanism — we can find the “context” of the patch by finding the association (or attention weight) between the Query Vector and Key Vector. It is then multiplied by the Value Vector to obtain the final Attention score, and in Multi-Head Attention, all patches are simultaneously trained in parallel and are concatenated as a single “head” (or result of Attention mechanism from each patch). This results in a general representation of the image and capture each segment information using this general head. Do note that inside the Encoder, it is similar to RNN architecture just without the complicated Convolutional Layers.
However, for Object Classification, the only important embedding is the Class Embedding, which upon finished training, will contains the information about what the object actually is, since the output will have 10 embeddings after, we only take the very first embedding which then fed to any other models (can be any other than MLP Head) to do predictions, even SoftMax function is applicable.
More in-depth explanation of:
Attention: https://www.v7labs.com/blog/vision-transformer-guide
Vision Transformer: 1706.03762v7.pdf (arxiv.org)