Transformers for small molecule property prediction

29 min readApr 29, 2024

Are you curious to know what this text holds for you?

What are transformers?
What is the architecture of a transformer, and how do all the parts of a transformer work?
Which kinds of encoder-only and decoder-only models exist, and what are they doing?
How can you use transformers on proteins, RNA and DNA?
Why are small molecules important for drug discovery?
Which kind of molecular transformers exist for molecular property prediction (MPP)?
How are small molecules represented?
Which tokenisation methods, vocabulary sizes and positional encodings are used in molecular transformers?
Which parameters have to be optimised in a transformer?

This text describes the basics of transformers and their applications in cheminformatics, with a focus on applications in drug discovery. If you are already familiar with transformer architecture, you can jump to the second part of this text, starting with the headline “Applications in bioinformatics.”

What is a transformer?

Transformers appeared first in the paper “Attention is All You Need” by Vaswani et al. in 2017. Since then, the approach has spread rapidly in Natural Language Processing (NLP). This is demonstrated by the number of citations this paper received seven years after its publication (117989 by April 26th, 2024). In comparison, a study in 2014 revealed that the most cited paper of all time until then had about 300000 citations in 63 years after its publication. While there are several reasons for the higher number of citations of a paper now than in the past (e.g. higher digital accessibility to papers and more open access papers, more interdisciplinary research and international collaborations, longer reference lists and general growth of the scientific output), this paper will probably be one of the most influential papers in this decade.

What is the reason for its popularity? A transformer is a sequence-to-sequence model that translates an input sequence into an output sequence. This has been developed to translate texts from one language to another automatically. Indeed, the original transformer paper demonstrates the capacity of translations on the newstest201l standard English-to-German or English-to-French translations. Automatic translations have already existed for a long time. In its most simplistic form, one could relate any word from the text in one language to the same word in another based on a dictionary. However, this would not include the fact that different languages have different grammatical structures and semantic expressions. To include those effects, long-range connections between words have to be considered. Before we can understand how transformers improved to capture those long-range interactions, let’s see how AI models were used to translate texts between languages before the appearance of transformers.

Recurrent models

If you want to translate a sentence from one language into another, the translation of a word depends on the context. Each word in your text depends on the previous word(s) in the text. Recurrent models (like recurrent neural networks) do exactly this: They use the information of earlier words to guess the next word. In this case, recurrent models learn patterns of which words follow based on previous words.

This works well when sentences are relatively short. However, if sentences are long or the context of a word depends on previous sentences, recurrent models need a lot of memory because all previous words have to be saved.

Attention

The most important word that determines the guess of a word in a sentence is often the word that immediately precedes the guessed word. Take the following sentence as an example:

“This text was written with a laptop on an afternoon of a cloudy day in spring 2024.”

With the context of the first three words (“This text was …”), a recurrent model could make several guesses about the next word. Among those guesses, words with a high probability are those in which the next word is a verb in past participle that describes something which can be done with the subject (“This text”) because the preceding word (“was”) could indicate that the sentence is written in a passive voice. Now let’s look at the German translation of this sentence:

“Dieser Text wurde mit einem Laptop an einem wolkenreichen Tag im Frühling 2024 geschrieben.”

In this case, the word “wurde” (“was” in German) at the beginning of the sentence also indicates that this is a sentence which could be in passive voice. However, the past participle of the corresponding verb follows at the end of the sentence (“geschrieben”). Guessing every word based on the previous word would lead to several false predictions because “geschrieben” relates to the beginning of the sentence rather than the words directly preceding it. This situation will worsen if more objects are located between a sentence's beginning and end.

But how does a human know that the word “geschrieben” has to follow? If a human reads “Dieser Text wurde …” followed by an object, they know that many other objects can follow, but there has to be a verb at the end of the sentence that describes something that depends on the subject at the beginning, but not on the objects which follow later. In other words, a human pays attention to the important parts of the text. And transformers do the same: They focus on the more important parts of the sentence by giving them higher weights than other parts. There are different attention mechanisms. Transformers use self-attention. This means that the attention mechanism is applied to a single sequence or, in other words, that the model is attending to different parts of the input sequence itself (e.g. relating sentences of the input English text to the same sentence) rather than interacting with a different sequence (e.g. relating an English sentence to its translation in German). Each element in the sequence uses itself to calculate the attention scores.

But how does it work in practice? How do we know which part of the input sentence attends to which other part? The attention mechanism uses queries Q, keys K and values V.

Description of the attention mechanism. The softmax function is not shown.

Every word is represented by a feature vector, which contains the information of the word in the context of many texts which have been used to train the transformer. The query is a feature vector that describes what one searches, e.g. the query has what we want to pay attention to. We calculate the dot product between the keys and the query vector. Every word in our sentence has its key represented by a feature vector. One can also take other options to combine queries and keys like additive attention. However, using a dot product is computationally faster. On the other hand, taking the dot product requires normalising the result by using the square root of the dimension of the query and the key vectors dₖ. This makes sure that the variance of the dot product is equal to the product of the variances of the query and key vectors. In the next step, the attention passes through a softmax layer, which converts the attention scores into probabilities. This step ensures the scores are normalised and can be turned into weights. The output of the softmax layer is used to generate a weighted sum of the value vectors, which have a dimension dᵥ. This weighted sum draws attention to the relevant parts of the text. Finally, the weighted sum is used to generate the output vector, a new representation of the input vector based on the relevant parts of the self-attention procedure. The following formula describes the attention mechanism:

While transformers can be used with single attention layers, you get the full performance with multi-head attention. This means that the attention mechanism is split into h different multiple heads. Each head has its attention mechanism with h sub-queries, sub-keys and sub-values from Q, K and V and all heads run in parallel. At the end, the dᵥ-dimensional output vectors of all individual heads are concatenated. Using multi-head attention, information from different subspaces at different positions is combined, allowing us to understand various aspects of the input sequence.

In summary, the attention mechanism focuses on the input sequence's essential parts. It allows parallel input processing, leading to a massive computational cost reduction. Compared to previously used recurrent neural networks (RNN) and convolutional neural networks (CNN), it enables the analysis of long input sequences with long-range dependencies.

The transformer architecture

Now that we know how the attention mechanism works, we can focus on the entire transformer architecture. The following figure shows the model architecture from the original paper.

Model architecture of the transformer. Figure extracted from Vaswani et al.

The transformer consists of an encoder (left) and a decoder (right). The transformer model’s encoder unit identifies the relationship of a token to its surrounding tokens, which is done via self-attention. The decoder is a generator and predicts the next token from the previous tokens. Before we look at both parts separately, we focus on their input.

Tokens and embeddings

The encoder's input is a tokenised sequence. Depending on the tokenisation method, tokens can be characters, words, and subwords in NLP, amino acids in proteins, monomers in polymers, or nucleic acids in DNA or RNA. Let’s look at an example using the tokeniser from the Mistral 7b-v0.1 model, which uses the byte fallback Byte-Pair-Encoding (BPE) tokeniser.

from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig, pipeline
text = "This text was written with a laptop on an afternoon of a cloudy day in spring 2024."
model = "/kaggle/input/mistral/pytorch/7b-v0.1-hf/1"
tokenizer = AutoTokenizer.from_pretrained(model)
token_ids = tokenizer.encode(text, add_special_tokens=True)
print("Token IDs:", token_ids)
tokens = [tokenizer.decode([token_id]) for token_id in token_ids]
print("Tokens:", tokens)

Token IDs: [1, 851, 2245, 403, 4241, 395, 264, 19891, 356, 396, 8635, 302, 264, 6945, 28724, 1370, 297, 7474, 28705, 28750, 28734, 28750, 28781, 28723]
Tokens: ['<s>', 'This', 'text', 'was', 'written', 'with', 'a', 'laptop', 'on', 'an', 'afternoon', 'of', 'a', 'cloud', 'y', 'day', 'in', 'spring', '', '2', '0', '2', '4', '.']

Every token has its token ID. The choice of tokens is based on the training data (e.g. many texts), and dependencies are learned from this data. Tokens are usually generated starting from single characters. If common pairs of characters are found in the training data, tokens are merged, and new tokens with more characters are generated. This process is performed until the maximum number of tokens (vocabulary size) is reached. Every token of the final vocabulary gets a unique ID, e.g. ID 1 for the sentence starting token “<s>” and ID 395 for the word “with” in the example above. Words do not have to be represented by a single token. The word “cloudy” above is represented by the tokens “cloud” and “y”. All unique tokens of a model are the model’s vocabulary.

Tokens (or token IDs) do not contain information about common patterns with other tokens. This information is included in its embeddings. Embeddings are a tensor representation of tokens. An embedding method has to be chosen to get the embedding of a token ID. Here, we demonstrate this using the Bert model.

from transformers import BertTokenizer, BertModel
import torch
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')
text = "This text was written with a laptop on an afternoon of a cloudy day in spring 2024."
input_ids = tokenizer.encode(text, add_special_tokens=True)
with torch.no_grad():
    output = model(torch.tensor([input_ids]))
embeddings = output.last_hidden_state
print("Embeddings: ", embeddings)

Embeddings:  tensor([[[ 5.8428e-03, -3.1986e-01, -5.1822e-02,  ..., -5.0903e-01,
           2.2683e-01,  6.5193e-01],
         [-2.4332e-01, -2.9686e-01, -2.3736e-01,  ..., -6.9435e-01,
           7.9248e-01,  3.9062e-01],
         [ 2.4810e-01, -4.3217e-01,  1.8182e-01,  ..., -4.4352e-01,
           3.1200e-01,  5.4439e-01],
         ...,
         [-8.9808e-03,  2.9612e-04,  8.2847e-01,  ..., -6.8853e-01,
           5.7239e-01,  2.0459e-01],
         [ 2.2761e-01, -2.0283e-01, -1.4832e-01,  ...,  8.7972e-02,
          -5.9535e-02, -2.4424e-01],
         [ 4.1316e-01, -1.4976e-01, -2.1302e-01,  ..., -1.3397e-01,
          -3.0089e-01, -3.2930e-01]]])

The last hidden layer of the model contains the embeddings. They are a vector of length N*768 (for BERT) with N as the number of tokens, e.g. every token has 768 values (features), which characterise the semantics of this token based on the model.

The embedding captures the contextual meaning of the token in the model, which mainly comes from the model’s training data. Next, positional encodings are added to the embeddings. Positional encodings contain information about the position of the token in the sequence, e.g. the “<s>” token for the beginning of the sentence is at position 0 in the sequence of tokens above, and the word “was” is at position 3. The easiest way to implement positional encodings would be to create a matrix of size N*768, in which the first index contains the position of a token and the second the embedding values. However, this has the disadvantage that long sequences lead to large embedding values. Normalisation to an interval [0, 1] would also create problems, as sequences of different lengths would be encoded differently. One possibility for including positional encodings is trigonometric functions. In this case, the positional encoding P of the kth input token of a sequence of length L and d as the output embedding dimension is

Trigonometric positional encodings ensure distinct sequential encoding along a period d. To avoid identical positional encodings for words separated by exactly d, an additional dimension index i is used for the first (i = 1), second (i = 2), third (i = 3), … position of the positional encoding vector. The original transformer paper used the sine function for odd and the cosine function for even dimensions i. In this way, every token gets a vector of positional encodings whose length is identical to the semantic embedding vector E (768 for the BERT example above) and whose positional values are generated from altering trigonometric functions of increasing dimension. Here is the application on the example above with n = 10000:

import numpy as np
import math

def positional_encoding(max_length, d_model, n):
    p = np.zeros(max_length*d_model).reshape(max_length, d_model) 
    for k in np.arange(max_length):
        for i in np.arange(d_model//2):
            theta = k / (n ** ((2*i)/d_model))       
            p[k, 2*i] = math.sin(theta) 
            p[k, 2*i+1] = math.cos(theta)
    return p

n = 10000
encodings = positional_encoding(embeddings.shape[1], embeddings.shape[2], n)
print("Positional encodings: ", encodings)

Positional encodings:  [[ 0.00000000e+00  1.00000000e+00  0.00000000e+00 ...  1.00000000e+00
   0.00000000e+00  1.00000000e+00]
 [ 8.41470985e-01  5.40302306e-01  8.28430762e-01 ...  9.99999994e-01
   1.02427522e-04  9.99999995e-01]
 [ 9.09297427e-01 -4.16146837e-01  9.27994032e-01 ...  9.99999978e-01
   2.04855043e-04  9.99999979e-01]
 ...
 [-7.50987247e-01  6.60316708e-01 -9.56906112e-01 ...  9.99998217e-01
   1.84369435e-03  9.99998300e-01]
 [ 1.49877210e-01  9.88704618e-01 -2.95380764e-01 ...  9.99998013e-01
   1.94612169e-03  9.99998106e-01]
 [ 9.12945251e-01  4.08082062e-01  6.26025610e-01 ...  9.99997799e-01
   2.04854901e-03  9.99997902e-01]]

The embedding vector E and the positional encoding vector P are added to get the input embedding vector I = E + P of the transformer’s encoder.

input_embeddings = embeddings + encodings
print("Input embedding shape: ", input_embedding.shape)
print("Input embedding: ", input_embeddings)

Input embedding shape:  torch.Size([1, 21, 768])
Input embedding:  tensor([[[ 0.0058,  0.6801, -0.0518,  ...,  0.4910,  0.2268,  1.6519],
         [ 0.5982,  0.2434,  0.5911,  ...,  0.3057,  0.7926,  1.3906],
         [ 1.1574, -0.8483,  1.1098,  ...,  0.5565,  0.3122,  1.5444],
         ...,
         [-0.7600,  0.6606, -0.1284,  ...,  0.3115,  0.5742,  1.2046],
         [ 0.3775,  0.7859, -0.4437,  ...,  1.0880, -0.0576,  0.7558],
         [ 1.3261,  0.2583,  0.4130,  ...,  0.8660, -0.2988,  0.6707]]],
       dtype=torch.float64)

This input embedding vector contains information on the semantic patterns of the token to other tokens from the model and the token's position in the input sequence.

The encoder

The transformer’s encoder identifies the relationship between tokens and their surroundings from the input sequence. First, the (same) input embedding is multiplied with a matrix of (different) key, query and value weights.

import torch.nn as nn
d_model = input_embeddings.shape[2]
Wq = nn.Linear(d_model, d_model).to(torch.float64)
Wk = nn.Linear(d_model, d_model).to(torch.float64)
Wv = nn.Linear(d_model, d_model).to(torch.float64)
print("Query weight matrix shape: ",Wq.state_dict()['weight'].shape)
print("Query weight matrix: ",Wq.state_dict()['weight'])
Q = Wq(input_embeddings) 
K = Wk(input_embeddings) 
V = Wv(input_embeddings) 
print("Query matrix shape: ", Q.shape)
print("Query matrix: ", Q)

Query weight matrix shape:  torch.Size([768, 768])
Query weight matrix:  tensor([[ 0.0003,  0.0281, -0.0285,  ...,  0.0244, -0.0335,  0.0336],
        [ 0.0094,  0.0297,  0.0200,  ..., -0.0049, -0.0220, -0.0350],
        [ 0.0215, -0.0153, -0.0069,  ..., -0.0103,  0.0021,  0.0281],
        ...,
        [ 0.0312, -0.0123,  0.0241,  ...,  0.0177,  0.0034, -0.0118],
        [-0.0250,  0.0118,  0.0161,  ...,  0.0146, -0.0176, -0.0011],
        [ 0.0186,  0.0179, -0.0020,  ...,  0.0072,  0.0263, -0.0048]],
       dtype=torch.float64)
Query matrix shape:  torch.Size([1, 21, 768])
Query matrix:  tensor([[[ 0.6177, -0.7542,  0.2440,  ...,  0.8863,  1.0564, -0.0314],
         [ 0.4458, -0.5819,  0.1107,  ...,  0.6800,  0.5362,  0.1896],
         [ 0.1733, -0.4517,  0.2923,  ...,  0.2532,  0.4522,  0.1436],
         ...,
         [-0.1428, -0.1520,  0.2920,  ...,  0.3975,  0.6154,  0.3730],
         [-0.4936, -0.6261, -0.0061,  ...,  0.0336,  0.5100,  0.6463],
         [-0.4831, -0.4134, -0.1049,  ..., -0.0041,  0.3216,  0.2640]]],
       dtype=torch.float64, grad_fn=<ViewBackward0>)

After this matrix multiplication, the multi-head attention, as described above, is performed on those matrices by splitting the key, query and value matrices into h = 8 individual heads.

n_heads = 8
d_key = d_model // n_heads
Qmh = Q.view(1, -1, n_heads, d_key).permute(0, 2, 1, 3)
Kmh = K.view(1, -1, n_heads, d_key).permute(0, 2, 1, 3)
Vmh = V.view(1, -1, n_heads, d_key).permute(0, 2, 1, 3)
print("Query multi-head matrix shape: ", Qmh.shape)

Query multi-head matrix shape:  torch.Size([1, 8, 21, 96])

Note that the shape of the query multi-head matrix is 1x8x21x96 for one sequence, eight heads, 21 tokens per sequence and 96 query scores (from the multiplication of the positional embeddings with the query weights). Now, the scaled dot product between the query matrix and the transpose of the key matrix is applied to get the attention filter. After normalisation to the dimension of the key vector, a softmax filter is applied, and the resulting attention probability matrix is multiplied by the value matrix. This is done for all heads, and the results are then concatenated. Finally, the concatenated matrix is passed through a linear layer. Look how the shape of the matrix changes during the multi-head attention process.

attention_filter = torch.matmul(Qmh, Kmh.permute(0, 1, 3, 2)) / math.sqrt(d_key)
print("Attention filter shape: ", attention_filter.shape)
attention_probabilities = torch.softmax(attention_filter, dim=-1)
print("Attention probability shape: ", attention_probabilities.shape)
attention_mh = torch.matmul(attention_probabilities, Vmh)
print("Attention multithead shape: ", attention_mh.shape)
attention_con = attention_mh.permute(0, 2, 1, 3).contiguous()
print("Attention concatenated shape: ", attention_con.shape)
attention = attention_con.view(1, -1, n_heads*d_key)
print("Attention shape: ", attention.shape)
attention_output = nn.Linear(d_model, d_model).to(torch.float64)(attention)
print("Attention output shape: ", attention_output.shape)

Attention filter shape:  torch.Size([1, 8, 21, 21])
Attention probability shape:  torch.Size([1, 8, 21, 21])
Attention multithead shape:  torch.Size([1, 8, 21, 96])
Attention concatenated shape:  torch.Size([1, 21, 8, 96])
Attention shape:  torch.Size([1, 21, 768])
Attention output shape:  torch.Size([1, 21, 768])

After multi-head attention, the data goes through an Add and Normalization step. Some of the input embeddings do not pass the multi-head attention layer. Those unaffected embeddings will be added to the attention output. This procedure is called residual connection and ensures that important information from the input will not get lost during the attention process.

Layer normalisation standardises the neurons' activation along the features' axis. For example, every token is an output of the multi-head attention process, and there are 768 features for every token. Absolute feature values for different tokens are difficult to compare, so the values are normalised by subtracting the mean and dividing the standard deviation of all features of this token.

attention_added = input_embeddings + attention_output
print("Added attention shape: ", attention_added.shape)
normalized_shape = attention_added.shape[2]
layer_normalization = nn.LayerNorm(normalized_shape).to(torch.float64)
attention_normalized = layer_normalization(attention_added)
print("Normalized attention shape: ", attention_normalized.shape)
print("Normalized attention matrix: ", attention_normalized)

Added attention shape:  torch.Size([1, 21, 768])
Normalized attention shape:  torch.Size([1, 21, 768])
Normalized attention matrix:  tensor([[[-0.3634,  0.2977, -0.3566,  ..., -0.2027, -0.6990,  1.7048],
         [ 0.3599, -0.3267,  0.4237,  ..., -0.5140, -0.0320,  1.3988],
         [ 1.1434, -1.8855,  1.1838,  ..., -0.1778, -0.7283,  1.6637],
         ...,
         [-1.1106,  0.3911, -0.3001,  ..., -0.2658, -0.1055,  1.1629],
         [ 0.2367,  0.5531, -0.6815,  ...,  0.6958, -0.9394,  0.6684],
         [ 1.4116, -0.0988,  0.3785,  ...,  0.4224, -1.2299,  0.5689]]],
       dtype=torch.float64, grad_fn=<NativeLayerNormBackward0>)

The output of the Add and Norm layer is the input of a position-wise feed-forward network (FFN). The FFN consists of fully connected dense layers. In its original implementation, the FFN has two such layers. The dimensionality of the hidden layer, which is the number of neurons in this layer, is usually set four times the dimension of the model. After each layer, a ReLU activation function is applied. The role of the ReLU activation function is to introduce non-linearity into the otherwise linear operations of the FFN. Here is the full FFN layer applied to the output of the Add and Norm layer from our example. Note the different dimensions of the matrix.

d_ffn = d_model * 4
w1 = nn.Linear(d_model, d_ffn).to(torch.float64)
ffn1 = w1(attention_normalized).relu()
print("Shape after first layer: ", ffn1.shape)
w2 = nn.Linear(d_ffn, d_model).to(torch.float64)
ffn2 = w2(ffn1)
print("Shape after second layer: ", ffn2.shape)
print("Matrix after FFN: ", ffn2)

Shape after first layer:  torch.Size([1, 21, 3072])
Shape after second layer:  torch.Size([1, 21, 768])
Matrix after FFN:  tensor([[[ 3.4022e-01, -2.3713e-01, -1.7562e-01,  ..., -3.6415e-01,
           3.3102e-01,  3.0839e-01],
         [ 5.1068e-01, -1.8180e-01, -1.1031e-01,  ..., -3.7001e-01,
           4.9287e-01,  3.4755e-01],
         [ 3.8706e-01, -1.5478e-01, -1.7109e-01,  ..., -3.4323e-01,
           5.4294e-01,  3.9710e-01],
         ...,
         [ 3.3360e-01, -4.0859e-01, -3.0088e-01,  ..., -3.1095e-01,
          -2.0039e-01,  5.8830e-02],
         [ 2.2552e-01, -1.5239e-01, -1.7305e-01,  ..., -2.2446e-01,
           5.1117e-04,  2.4696e-01],
         [ 2.6823e-01, -3.2144e-01, -1.2353e-01,  ..., -1.0087e-01,
           7.8821e-02,  2.9985e-01]]], dtype=torch.float64,
       grad_fn=<ViewBackward0>)

The output of the FFN goes through another Add and Norm layer. The embeddings added to the FFN are the input of the FFN or, in other words, the output of the first Add and Norm layer.

ffn_added = attention_normalized + ffn2
normalized_ffn_shape = ffn2.shape[2]
layer_normalization_ffn = nn.LayerNorm(normalized_ffn_shape).to(torch.float64)
ffn_normalized = layer_normalization_ffn(ffn_added)
print("Normalized shape after FFN: ", ffn_normalized.shape)
print("Normalized matrix after FFN: ", ffn_normalized)

Normalized shape after FFN:  torch.Size([1, 21, 768])
Normalized matrix after FFN:  tensor([[[-0.0301,  0.0510, -0.5227,  ..., -0.5561, -0.3637,  1.9406],
         [ 0.8385, -0.4965,  0.2991,  ..., -0.8599,  0.4419,  1.6862],
         [ 1.4937, -2.0039,  0.9865,  ..., -0.5158, -0.1870,  2.0132],
         ...,
         [-0.7690, -0.0289, -0.5974,  ..., -0.5738, -0.3099,  1.1787],
         [ 0.4357,  0.3758, -0.8462,  ...,  0.4446, -0.9283,  0.8768],
         [ 1.6123, -0.4231,  0.2313,  ...,  0.2958, -1.1314,  0.8262]]],
       dtype=torch.float64, grad_fn=<NativeLayerNormBackward0>)

Now we passed one encoder layer: We took the input embeddings, passed them through the first sublayer (Multi-Head attention layer), followed by an Add and Norm layer with residual connection from the input embedding, then passed the output through a second sublayer (feed-forward neural network) followed by a second Add and Norm layer. Notably, dropout is used in sublayers of most common transformers. Dropout is a regularisation technique that drops neurons of a neural network with a given probability.

The entire encoder layer is passed N times in a transformer. The original transformer paper uses N = 6 identical encoder layers. The set of all encoder layers is the encoder stack. The output of one encoder layer is the input of the next encoder layer. The output of the full encoder stack contains information about the connection of tokens in the input sequence. This information can be used for downstream tasks. In a transformer, the output is used as an input to a multi-head attention layer of the decoder (see below).

Encoders are used for discriminative tasks. Encoder-only models are typically used for pre-training and fine-tuning work. In this case, the encoder uses many unlabeled data and learns patterns between the tokens from the training data. This information is contained in the output of the pre-trained encoder stack. The pre-trained model is tuned on a labelled downstream data set, which is usually significantly smaller than the unlabeled training data set.

Two encoder-only models are BERT (Bidirectional Encoder Representations from Transformers) and RoBERTa (Robustly optimised BERT Approach). BERT was trained on Natural Language Understanding and suggests two aims: Masked Language modelling (MLM), in which a random number of input tokens is masked and the correct token at those positions is predicted by the model, and Next Sentence Prediction (NSP), in which the model predicts if two given parts of a longer input are consecutive in the input or not. The first task is especially intensively used for drug discovery (see below and the following articles). RoBERTa is BERT with optimised hyperparameters. It is trained on more data and with larger batch sizes. It also uses dynamic masking in which the masked tokens in MLM are dynamically allocated during training.

But now let’s turn to the other half of the transformer: the decoder.

The decoder

While the encoder is used for discriminative tasks, the decoder is a generator of new tokens. Specifically, it predicts the next token based on only preceding tokens. To achieve this, the decoder uses masked self-attention and cross-attention between the encoder embeddings and the generated tokens of the decoder. But let’s go step by step.

Like the encoder, the decoder is a stack of different decoder layers (N = 6 in the original transformer paper), and the output layer is the input of the next decoder layer. The decoder has two inputs: its output, which feeds the following decoder stack, and the encoder output, which enters the multi-head attention layer of any decoder layer that is the second sublayer of the decoder layer.

The decoder initially has no output, so what does the first decoder layer get as input? As mentioned above, the task of the decoder is to predict new tokens from previous tokens. The first token is always the start token (“‘<s>” with token ID 1 in our example). We generate the contextual embeddings for this token and convert them to input embeddings for the decoder using positional encodings as described above. Those positional embeddings enter the first decoder layer's first sublayer.

The first sublayer of the decoder is a masked Multi-Head attention layer. It works like the multi-head attention layer of the encoder but with two crucial differences. While it tries to predict the token at position i (position one at the beginning when the start token is at position 0), it only uses the information of the previous tokens at positions 0, …, i — 1, so, in other words, it only attends to the left. Second, for the prediction of the token at position i, the sequence has to be masked at this and all following positions, e.g. there will be masked tokens in the input, and the next masked token in the sequence should be predicted from the set of available tokens from the vocabulary. This is repeated token by token from the start until the end token is reached and represented in the figure of the transformer architecture above by “shifted right”. The decoder predicts a token for the masked position (see below how this is done in detail). During transformer training, the generated token is compared to the true token. In this way, the decoder learns from its mistakes via teacher forcing using a cross-entropy loss function.

Similar to all other sublayers, the output of the decoder’s masked multi-head attention layer goes through an Add and Norm layer using the sublayer’s input via the residual connection. This layer's output enters the decoder's second sublayer: The multi-head attention layer. This multi-head attention layer works exactly as the encoder’s multi-head attention layer, with the only difference being in its input. While the query, key and value matrices entering the encoder’s multi-head attention layer are all the same (positional embeddings in the first encoder layer or the output of the previous encoder layer in every subsequent encoder layer), the input of the decoder’s multi-head attention layer is different. The query and key matrices are the output from the encoder stack, while the value matrix is the decoder’s masked multi-head attention layer output. This attention type is called cross-attention. It calculates the relationship of the decoder’s generated tokens to the patterns learned from the encoder on the input tokens. The output of the decoder’s multi-head attention layer goes through an Add and Norm layer with residual connection to the output of the previous masked multi-head attention layer. The output of this layer goes through an FFN and Add and Norm layer, which works the same as described in the encoder. The output of this layer works as input for the next decoder layer.

The previously described three sublayers of the decoder (masked multi-head attention layer, multi-head attention layer, FFN layer) are one decoder layer and passed multiple times (N = 6 in the original paper). The output of the decoder stack of all decoder layers is a matrix containing the features of every token in the sequence. How does the decoder use this to predict the probability of the next token? It uses a linear layer. However, one would need one linear layer per token in the entire vocabulary because every token would be a valid next token (although we are only interested in the most probable one based on the learned context). One could build D linear layers with D as the vocabulary dimension to solve such a classification task. However, if D is big (e.g., all tokenised words from the English dictionary), it would result in many layers. To avoid this (and to get float probabilities of every token from the dictionary to be the next token in the generated sequence), the output matrix of the decoder stack is flattened and concatenated to a long one-dimensional vector containing all features, and this vector is feed into the linear layer. Matrix multiplication with the parameters of the linear layer representing the weights of every token from the dictionary leads to a defined score for every token.

Finally, those scores pass a softmax layer, which converts the output of the last hidden linear layer into logits. Logits are raw, non-normalized scores representing the likelihood of each possible output token being the next token in the sequence. The softmax function is applied to these logits to convert them into probabilities. The output of the softmax layer is, therefore, a probability distribution over all vocabulary tokens. The token with the highest probability is often selected as the next token. This token is added at the end of the generated sequence. (Exception: During training, the correct token instead of the generated token is added to the end of the sequence; see explanation above.) Positional embeddings for this new sequence are generated and fed into a new round of the decoder stack to generate the next token. The decoder is passed multiple times until an end token is generated, marking the end of the generation process.

Decoders are used for generative tasks and can also be used separately. A typical decoder-only model is XLNet. XLNet generates sequences and uses permutation language modelling (PLM). During the training of PLM, the next token in a sequence is generated from a random permutation of all previous tokens in the input sequence.

Application in bioinformatics

Relation to proteins, RNA and DNA

Transformers were introduced for translation tasks. As they are sequence-to-sequence models, they can be used for all functions with sequences to find patterns in the sequence data (encoder-only model with a possible downstream task on labelled data) or to generate new sequences (decoder-only or full transformer models).

Most drug discovery tasks are designed to be performed with a transformer. For example, proteins are chains of amino acids. There are 20 standard amino acids that, together with some additional tokens like the start and end token, the mask token and tokens for unnatural amino acids) have been used to generate a protein vocabulary for usage in language models. In addition, proteins with high sequence similarity have similar properties as evolution via sequence mutation generated proteins with similar features. For example, proteins with similar sequences tend to adopt similar structures, which is used in MSA-based protein structure prediction tools like AlphaFold. Many protein language models (pLMs) have been developed to extract those features, and the knowledge acquired by pLMs during pre-training can be used to predict tasks of protein properties from the embeddings in the hidden states of the pLMs.

Similarly, RNA and DNA consist of four nucleic bases each. These are adenine, thymine, guanine and cytosine in DNA and the same in RNA, except that thymine is replaced by uracil. If DNA and RNA form a double-stranded helix, adenine and thymine (DNA) or uracil (RNA) are paired together, similar to guanine and cytosine at the inner part of the helix. This is used for RNA/DNA structure prediction and feature extraction, and DNA and RNA language models have been developed based on that vocabulary.

Protein and DNA/RNA language models are a big topic and will be discussed in a separate article. Here, I would like to focus on available molecular transformer models.

Small molecules

Small molecules are organic compounds with a low molecular weight (typically below 900 Da). Because they are small, they can easily diffuse through cell membranes, which allows intracellular drug discovery relevant to many medical and biological applications. They can interact with many biological targets like proteins, DNA, and RNA, which allows for modifying various processes and pathways. In these interactions, they can act as inhibitors, agonists, antagonists or modulators of their binding partners and as enzymes for the change in kinetics of chemical reactions. Small molecules have been successfully used to treat diseases like cancer, cardiovascular diseases and neurological diseases.

Furthermore, small molecules can be synthesised in a laboratory setting. High-throughput screening of large libraries is possible, which enables the optimisation of their drug properties. This also makes them less expensive in comparison to bigger drugs like antibodies. Their size and chemical properties also enable oral administration to patients, which is highly preferred in drug development.

The advantages of small molecules for drug discovery led to many applications and demand a systematic design of small molecule properties. This task is termed Molecular property prediction (MPP). Many properties of small molecules have to be predicted, and most of those properties are also essential to designing other drugs like peptide binders, antibodies or protein mutations. They include on-target activity, physicochemical properties and toxicity. Please also read my previous articles on active learning and prescreening via graph matching on prescreening ultra-large libraries and direct preference optimisation on optimisation of multiple small molecule properties. Here, we focus on existing transformers that have been used to predict the properties of small molecules and the dataset for training various small molecule properties.

Language of small molecule transformers

The first part is the vocabulary to use for a small molecule predictor. The most distributed vocabulary for small molecules is based on the Simplified Molecular Input Line Entry System (SMILES) language. In this notation, chemical compounds are described by short ASCII strings. Those strings contain information about all atoms and bonds of small molecules and can be easily converted to two- or three-dimensional representations of small molecules in molecular editors. Furthermore, SMILES strings can be converted into vector representations for their usage in machine-learning approaches. While SMILES strings are the most used language, they also have some problems (more on this in an extra article). This led to the invention of other molecule languages like Self-Referencing Embedded Strings (SELFIES). Circular fingerprints or lists of atoms can be used. If implemented correctly and trained long enough, the performance of a transformer on an MPP should be independent of the choice of language. Indeed, comparisons between SMILES and SELFIES languages show that — even though a slight difference is observed on downstream tasks - this difference is very small, e.g. here.

Tokeniser and vocabulary size

The transformer acts on the tokens from the model. The number of all available tokens in the model depends on the used tokeniser (see above in the transformer description). The granularity of tokens affects how the Transformer finds patterns in sequences. Character-level tokenisers, in which every character is a single token, are highly flexible but not the most efficient. Word (in NLP) or functional group tokens are efficient but might not be flexible enough to find all the necessary details between connected tokens from the training data. Something in between, like Byte-Pair Encoding (BPE), could be a good compromise.

Some tokenisers applied in molecular transformers are:

Atom tokeniser: In this tokeniser, the small molecule is split into atoms, numbers, and special characters.
Regex: The regular expression tokeniser also splits the small molecule into atoms but with slight modifications, e.g. for square brackets
Byte-pair encoding (BPE) tokeniser: BPE uses subword tokenisation. An explanation of the tokenisation algorithm can be found here, including normalisation, pre-tokenization, splitting words into characters, and applying merging rules. Note that RoBERTa uses BPE, which also applies to all chemical transformers based on RoBERTa.
SELFIES internal tokeniser: The developers of chemical languages usually create their own tokeniser specified for their language, which is applied in the case of SELFIES on the Regression Transformer.
Circular fingerprint: Circular Fingerprints or extended-connectivity fingerprints (ECFP) are also used to represent and compare structures of small molecules and tokenise those structures.

A larger vocabulary can provide better language coverage, reducing the number of out-of-vocabulary (OOV) tokens. With fewer OOV tokes, the model is more likely to understand all the complexity in the training data. This is particularly important for languages with rich morphology, like small molecules. However, a larger vocabulary increases the embedding matrix, which increases the memory requirement and training time. In addition, a vocabulary that is too large can lead to overfitting in the case of insufficient training data. A well-balanced vocabulary size helps the model to generalise to unseen data.

The vocabulary length of different molecular transformers varies significantly and ranges from only 42 tokens in MolBERT to 50000 in RoBERTa. As pointed out, the vocabulary size should influence the transformer’s performance on downstream tasks. However, there are not many comparisons which are purely based on the vocabulary size of molecular transformers. The same holds for comparison between different tokenisation methods. As both are critical for the performance of the transformer, future studies should focus on those comparisons.

Positional encoding

Positional embeddings provide information about tokens' relative or absolute positions within the sequence. Different positional encoding methods are used in a transformer:

Fixed absolute positional embeddings: Those encodings are unchangeable and calculated once at the beginning. An example is the trigonometric functions of increasing dimension from the original transformer paper (see above).
Trainable absolute positional embedding: The embeddings at each position in the sequence are learned during the transformer training. The length of the positional encodings is constrained. Those trainable absolute positional encodings are used in BERT and RoBERTa.
Relative positional embedding: Relative positional encodings are used to adjust for sequences of different lengths, which are very common in the small molecule space as the generation of small molecules also change the number of tokens needed to represent them. For those, the position of a token is learnt based on its relative position to its surrounding tokens. Examples are MolBERT and the Regression Transformer.
Rotary positional embeddings: This is a combination of absolute and relative positional embeddings used in MolFormer.
2D or 3D embeddings: As chemical languages are one-dimensional representations or small molecules, tokens close in the language might not be close in 2D or 3D. Representations that include the effect of the 2D or even 3D distances could improve the performance of molecular transformers. One encoding that includes this information is included in the Molecule attention transformer (MAT).

Relative positional embeddings and rotary positional embeddings are more flexible than absolute positional embeddings and embeddings, which include 2D or 3D information about the small molecules and can better describe token surroundings as they occur in the molecule itself. Only some studies have shown improvements with those flexible embeddings, but most existing molecular transformers still rely on BERT models, which use absolute positional embeddings. The next generation of molecular transformers has to show if flexible positional embeddings significantly increase the performance of those transformers on downstream tasks.

Model parameters and fine-tuning

Adjusting the hyperparameters of a model is one of the most essential criteria for generating a transformer of high performance. There are many parameters to tune, e.g.

The depth of the model: That is the number of layers of the encoder and decoder, which are, in general, different.
Hidden Size: The dimensionality of the hidden layers and embeddings.
The number of attention heads.
Feed-Forward Network Dimension: The dimension of the inner layer of the FFN.
Vocabulary size: Number of unique tokens
Sequence length: The maximum number of input sequences the model can process.
Model initialisation: The method used to initialise the weights (not captured in detail in this article)
Batch size: Number of training examples used per training iteration.
Learning rate: This says how fast the model changes the weights upon an estimated error.
Optimizer (e.g. ADAM, SGD, etc.)
Learning rate scheduler: Method to adjust the learning rate over time (e.g. warm-up with decay, step decay, etc.)
Number of training epochs: How often should the training go through the training data
Regularisation, e.g. dropout, L2 regularisation, etc.
Use of gradient clipping: Method to prevent exploding gradients
Attention dropout rate
Scaled dot-product attention: Scaling factor of the dot product of the query matrix and the transpose of the key matrix
Positional encoding, e.g. absolute, relative, rotary or 2D/3D position encoding (see above)
Tokenisation method, e.g. BPE, Atom, etc. (see above)
Embedding normalisation, e.g. should layer normalisation be applied to embeddings
Loss function, e.g. cross-entropy
Evaluation metrics, e.g. accuracy, F1, AUC ROC, etc.
Use of pre-trained models, e.g. BERT
Pre-training objective, e.g. MLM, NSP, etc.

While points 8–14 are only necessary during training, all other parameters characterise the final transformer. Some of them have been mentioned above. In general, models with more parameters give better performance. However, this does not mean that one should always use the molecular transformers with the maximum number of parameters, as the choice of the parameter depends on all these parameters and the specific task.

In addition, fine-tuning is very important. As mentioned above, pre-trained transformers can be used for downstream tasks on labelled data. This can be done with frozen weights, which means that the weights of the based transformer are unchanged during fine-tuning, or with updated weights, in which hyperparameter tuning is performed on the weights of the transformer. To reduce the computational demand of hyperparameter finetuning, only a small amount of transformer weights might be updated during fine-tuning while the other weights are kept frozen. Fine-tuning weights generally leads to better performance on downstream tasks than frozen weights. This is for example demonstrated by MolBERT and by SELFormer on the Estimated SOLubility (ESOL), FreeSOLV and Lipophilicity datasets, which contain experimental data on aqueous solubility, hydration free energies and the octanol/water distribution coefficient LogD at ph 7.4.

Finally, and maybe most importantly, is the choice and size of the pre-training data. Several large small molecule datasets are available. The most common ones, which are also open-access, are ZINC, ChEMBL, and PubChem. A future article will describe those datasets and their usage on small molecule tasks.

Summary

This article described the transformer architecture, the most influential model in NLP. While the transformer was initially developed for machine translation and text summarisation tasks, it is now used in many more areas like speech processing, computer vision, music generation, finance and weather forecasting. As a sequence-to-sequence model, it is also intensively used in bioinformatics and Cheminformatics. While the application for drug discovery in bioinformatics will be explained in future articles, here we focused on transformers applied to small molecules. It is demonstrated how small molecules can be expressed in machine-readable languages like SMILES and SELFIES, how those languages are tokenised and which positional encodings are used in molecular transformers. While there are already more than ten open-access molecular transformers that can be used for MPP tasks, it can be expected that better transformer models will be available soon.

Main references:

[1] Vaswani, Ashish, et al. “Attention is all you need.” Advances in neural information processing systems 30 (2017).

[2] Sultan, Afnan, et al. “Transformers for molecular property prediction: Lessons learned from the past five years.” arXiv preprint arXiv:2404.03969 (2024).