Context Specific Positional Encoding (CoPE) for Protein Language Models

Krishna Yerramsetty
Krishna Yerramsetty
4 min readJun 10, 2024

It took me a while to grok the concept of positional encoding/embeddings in transformer attention modules. In a nutshell, the positional encodings retain information about the position of the two tokens (typically represented as the query and key token) that are being compared in the attention process. Without this information, the transformer has no way to know how one token in the context is different from another exact token in the same context. For example: if abxcdexf is the context, where each letter is a token, there is no way for the model to distinguish between the first x and the second x. For a good summary of the different kinds of positional encodings, please see this excellent review. In general, positional embeddings capture absolute or relative positions, and can be parametric (trainable parameters trained along with other model parameters) or functional (not-trainable). A key feature of the traditional position encodings is the decay in inner product between any two positions as the distance between them increases. See figure below from the original RoFormer paper by Su et al.

A recent paper: Contextual Position Encoding: Learning to Count What’s Important proposes a slightly different take on these positional encodings: Include the context in estimating the positional embeddings. Borrowing an example from the paper:

Assume this context: yyxyyxyy where each letter is again a token. From the paper: “If we assume x tokens have the same context representation (i.e. the same key vectors), their attention difference will only depend on their positions i and j”. And “we can see that y will have larger attention than x when i > ∆/δ, thus the model cannot attend to the last x if it is too far away. This gives us an intuition why independent position and context addressing might fail on very simple tasks.” Please read the paper for the mathematical derivation of the differences in context specific attention ∆, and position specific attention δ. In essence the paper argues that any positional encodings that do not take into effect the context can fail for certain tasks, like counting.

I believe this approach to position encodings could be immediately useful for protein language models. Protein sequences differ in some interesting ways from languages like English. For example: amino acids 100s of base pairs away from each other in the sequence space can be very close to each other in the 3-dimensional structure space. Another example are disulfide bonds formed between Cysteine amino acids that are sometimes 100s of residues apart in the sequence space. Much weaker hydrogen and ionic bonds are also formed between the sidechains of amino acids that are closer in the 3-dimensional space, even when significantly separated in the sequence space. When I write sequence space, this just means how the amino acids are represented in text which is also the primary structure of the protein. For example, MKSIYFVAGL… represents the first 10 amino acids of the GLP-1 protein where each amino acid shares a peptide-bond with its neighboring amino acid. To complicate things further, not all amino acids have the same propensity to form hydrogen bonds or ionic bonds with other amino acids or with water in the environment. This leads to some amino acids interacting with (or paying more “attention” to) other amino acids depending on their side-chain chemistry and not just due to the distance between them in the sequence space. For a good introduction to the different types of interactions between the amino acids of a protein, please see this reference. In short, the distance in sequence space for proteins is not the same as distance between words in languages like English.

With that detour about proteins out of the way, let’s get back to the idea of contextual position encoding. I hope I was able to convince you that traditional relative positional embeddings whose inner-products decay as the relative distance increases may not be a good solution for protein language models. To quickly test this, I used the torchtitan repo from Pytorch and replaced the RoPE embeddings with CoPE embeddings in the llama-2–7b model. I used approximately 4000 (3000 for training and 1000 for validation, randomly split) E. Coli protein sequences from UniProt for the pretraining task . You can find my repo here and some more details in there.

The following two plots show the mean cross-entropy loss for training and validation, respectively. What is interesting is that the amount of time taken to train is reduced when using CoPE and also the validation loss is much better. One obvious reason is that I’ve implemented CoPE parameters for each head separately within a transformer block which are extra learnable parameters that can help with the training process. Having said that, I am still surprised at how good these results are. Stay tuned as I play with this more in the next couple of weeks

Cross-entropy loss for training data
Cross-entropy loss for validation data

--

--

Krishna Yerramsetty
Krishna Yerramsetty

Data Scientist with over 7 years of experience. Too many things to learn and experience, too little time :)