Transformers QK, OV, induction circuit

Stephen Jonany
3 min readFeb 11, 2024

--

What. Some notes that helped me make sense of [QK, OV, induction circuit] terms from Induction heads — illustrated, which is related to the Anthropic papers A Mathematical Framework for Transformer Circuits and In-context Learning and Induction Heads

TL;DR

  • QK circuit answers: which state indices should I attend to given that I have this content?
  • OV circuit answers: given these relevant state contexts for the current position, how should I update current position’s State?
  • Induction circuit: composed of two heads across different layers. The first head writes the new field: “the token before me is <x>”. The second head writes the new field: “the token after me is <x>” by relying on the first head’s field from a previous token position.

Programming setup

I’ll opt for a programming-style way of understanding. Following the previous post, we treat the residual stream as an array State[N_TOKENS]. The syntax that I’ll use is pseudo-python.

Simplifications. However, we will make some simplifications so we can follow along Induction heads — illustrated: (1) treat State as a python dictionary (2) assume QK and OV works on 1 State field at a time (3) assume attention will only return 1 State of 1 token position, as opposed to a mixture of State’s.

QK circuit answers: which state index should I attend to given that I have this content? If it’s a python method, its signature would look like so

def qk(cur_pos: int, residual_stream: State[cur_pos+1]) -> state_to_attend_to: State

OV circuit answers: given the most relevant state context for the current position, how should I update current position’s State? If it’s a python method, its signature would look like so

def ov(state_to_attend_to: State, cur_pos_state: State) -> updated_cur_pos_state: State

Induction head as an example

Let’s apply the above definitions to Induction heads — illustrated. There are two layers, each with 1 head, and each head has its own QK, OV implementation. Let’s implement them! Note: Each blob [I am “D”, at posn=0] is a single State, and each of the colored row corresponds to a single State field.

Embedding layer outputs: State[], where State[1] = {cur_token=”D”, cur_pos=0}

Layer 0, aka “prev token layer”, saves previous token’s cur_token as a new field called prev_token.

# qk returns State[cur_pos-1], but we write it so we can see the decomposed q and k logic
def qk(cur_pos: int, residual_stream: State[cur_pos+1]) -> state_to_attend_to: State:
q = residual_stream[cur_pos].cur_pos
ks = [residual_stream[j].cur_pos for j in len(residual_stream)]
# In the model, this "which k is right before me?" as a matrix multiplication
best_j = j where ks[j] == q-1
return residual_stream[best_j]

# ov writes a new field: prev_token, based on the attended state = previous token state
def ov(state_to_attend_to: State, cur_pos_state: State) -> updated_cur_pos_state: State:
updated_cur_pos_state.prev_token = state_to_attend_to.cur_token

Layer 1, aka “induction layer”, uses a previous token’s <prev, next> pattern whose prev matches my current content, then writes matched_pattern.next as a new field called next_token

# qk returns State[j] whose State[j].prev_token == cur_token.
# Aka: the State that might know the next token that comes after me.
def qk(cur_pos: int, residual_stream: State[cur_pos+1]) -> state_to_attend_to: State
q = residual_stream[cur_pos].cur_token
# This is called K-composition, because K is relying on prev_token, which is
# a field written-to by a previous layer
ks = [residual_stream[j].prev_token for j in len(residual_stream)]
best_j = j where ks[j] == q
return residual_stream[best_j]

# ov writes a new field: next_token, based on the attended state = matching pattern state
def ov(state_to_attend_to: State, cur_pos_state: State) -> updated_cur_pos_state: State
updated_cur_pos_state.next_token = state_to_attend_to.cur_token

Misc lingo

K-composition, Q-composition. This is about whether the q and ks of the final layer relies on a field written by layer 0. That is, in K-composition, the K of layer 1 is what depends on layer 0’s new field “prev_token”, but the Q of layer 1 doesn’t compose: it just relies on the base field “cur_token”. In Q-composition, on the other hand, the Q of layer 1 depends on layer 0’s new purple field “I was also at posn=0”

--

--

Stephen Jonany

Software engineer at Snowflake ❄️. Previously at Google. Book quotes on engineering, science, productivity, life. linktr.ee/sjonany