Member-only story
What is Query, Key, and Value (QKV) in the Transformer Architecture and Why Are They Used?
An analysis of the intuition behind the notion of Key, Query, and Value in Transformer architecture and why is it used.
TL;DR
QKV is used to mimic a “search” procedure, in order to find the pairwise similarity measures between tokens. Further, these similarity measures act as weights, to get a weighted average of tokens’ meanings, to make them contextualized.
Query acts as the part of a token’s meaning that is being searched, for similarity. Key acts as the part of all other tokens’ meanings that Query is compared against. This comparison is done by the dot-product of their vectors which results in the pairwise similarity measures, which is then turned into pairwise weights by normalizing (i.e. Softmax). Value is the part of a token’s meaning that is combined in the end using the found weights.
Lastly, you might have noticed me saying each of QKV being “part of a token’s meaning”. By that, I mean each of the QKV is obtained by a learned transformation of a token’s initial embedding. Hence, each can be learned to extract a specific different meaning from the embedding. And by employing, Multi-head attention, we allow multiple different…