Invariant Point Attention in Alphafold 2

Jude Wells
4 min readDec 8, 2021

--

In the supplementary materials for the recently published alphafold 2 paper, the Deepmind team outline their algorithm for invariant point attention. This blog post explores this concept, to try and understand how this differs from standard attention mechanisms in transformer networks. It is written as a work-in-progress for trying to understand the mechanism and should not be considered a guide.

The goal of creating invariant representations in protein structure prediction is to include the useful inductive bias that any two protein structures that are related by a rigid euclidean transform (rotation or translation) are essentially the same. Put another way: if we rotate and translate the entire protein structure prediction it remains equally valid. When representing aspects of the protein structure we would like these representations to remain the same if a global rigid transformation were applied to all elements that represent the spatial encoding.

Pseudo-code for invariant point attention taken from the supplementary materials.

The implementation has a lot in common with the typical attention mechanism creating queries, keys, and values for each token, where each of these vectors is the result of applying a linear transform to sᵢ (algorithm line 1):

The vector sᵢ is referred to as the “single sequence representation” and it is a 384-dimensional vector which is created from the first-row of the multiple-sequence-alignment (MSA) (the first row being the sequence of the target protein):

Looking at the algorithm above we note that the only input representation that contains spatial representations that would change under global rigid transformations are the Tᵢ these are 4x4 matrices that encode a rotation and a translation for each residue. The other input representations, sᵢ and zᵢⱼ represent the sequence information and the pair representation respectively. (The pair representation encodes information about the distance between pairs but not spatial positioning as such it is invariant to global transforms).

We want the encoded vectors :

to contain information from the structural representations Tᵢ and have a mechanism by which the attention weights can be differentially influenced by Tᵢ and Tⱼ

The attention weights are computed in line 7 of the algorithm:

Much of the equation is of a typical self-attention mechanism: the weight is proportional to the dot product of a query vector q and a key vector k. Information from the pair representation has an effect on the wight via the addition of b, a scalar computed from the pairwise representation of residues i,j. As we said before, careful consideration must be made as to how information from matrix T can be incorporated into the attention weights while still preserving invariance. The objective is such that the attention weights should remain unchanged if all matrices T are subject to the same rigid transformation. The way this is achieved can be seen in the right portion of the equation above:

We generate additional query and key vectors for each residue which are multiplied by their respective T matrices. Taking the euclidean distance of these two transformed vectors ensures that the value is invariant to global rigid transformations and this is shown in the proof below:

So far we have established that the attention weights are invariant. It remains to be shown that the final representation is also invariant:

In line 10 of the algorithm, we see that the final representation has a component that is dependent on the euclidean transforms Tᵢ

The Tᵢ are 4x4 matrices that encode a rotation and a translation, which means their values would be affected by global rotations and translations.

In the supplementary materials the following proof of the invariance is provided:

The proof shows that applying a global transformation T_global results in no change to the representation.

--

--

Jude Wells

Data scientist with research interests in natural language processing and computer vision.