Self attention — a clever compromise
KION KIM
1701

In ‘Self Attention’, in the ‘Sentence Representation’ graph, if α_{it} is the weight for token ‘t’ in sentence ‘i’, shouldn’t the ‘green’ attention layer said

α_{i1}… α_{it}… α_{iT}

instead of

α_{t1} … α_{1t} … α_{tT} ?