Preference optimisation of protein language models

6 min readMar 20, 2024

What do I find in this text?

Why is multi-objective optimisation important for drug discovery?
What is Reinforcement Learning from Human Feedback (RLHM)?
What is Direct Preference Optimization (DPO)?
How is DPO used for multi-objective optimisation of drug properties?

Drug design depends on multiple parameters.

The design of a new drug depends on many properties.

Usually, the first step in drug design is to find a drug with a high binding affinity to the target protein. Alternatively, the target protein can be modified by itself, e.g. via mutations in its sequence, to change its binding to a receptor. The binding can be made stronger or weaker depending on the biological context. But binding affinity is only one of many criteria which have to be fulfilled for a successful drug. Other criteria are expressibility, synthesizability, stability, immunogenicity, solubility and bioavailability. If a drug fails only in one of these factors, the development process of this drug is not successful. Usually, all these criteria are optimised independently in silico and then tested in vitro and in vivo. To reduce the computational cost, a multi-objective optimisation framework is desirable. Here, I briefly describe two techniques used for the multi-objective optimisation of different drug properties: Reinforcement Learning from Human Feedback (RLHF) and Direct Preference Optimization (DPO). Both can be used with large language models (LLM) to generate specific replies. If applied to protein language models (pLM), they can be used to design drugs like small molecules, peptides or antibodies with desired properties against a target of a given sequence.

Reinforcement Learning from Human Feedback (RLHF)

Reinforcement learning (RL) is a machine learning paradigm where agents learn by interacting with an environment to maximise cumulative rewards. Through trial-and-error exploration, RL agents adjust their actions based on received feedback, aiming to achieve a defined goal. Techniques such as Q-learning and policy gradient methods enable agents to make decisions by estimating the expected rewards of different actions. Deep reinforcement learning (DRL) extends RL to handle high-dimensional input spaces using deep neural networks, achieving impressive results in game-playing and robotic control tasks.

Many tasks involve complex or poorly defined goals, limiting the possible impact of deep RL and machine learning. This leads to a nonoptimal behaviour of the RL approach and misalignment between desired outputs and the objectives of an RL system. An alternative to improve RL to give desired outputs would be communicating the objectives to its agents. This can be done via human feedback to trajectories of the agent and is the basis of the original formulation of Reinforcement Learning from Human Feedback from Paul Christiano et al. The method has been extended to Constitutional AI. Its basic concept is shown in the following figure.

The process consists of two steps. In the first step (top), an AI assistant generates responses to harmfulness or toxic inputs. The model evaluates its response according to a principle in the constitution, and then revise the original response in light of the critique. In this way, the responses are repeatedly revised based on principles from the constitution at each step. After this step, supervised learning is used to finetune a pre-trained language model on the final revised responses.

In the second step (bottom), the AI assistant of the first step is used to generate a pair of responses to each input in a dataset of harmful inputs. RL is used to train a model according to the constitutional principle. AI feedback on those responses is used to produce an AI-generated dataset for harmlessness, which is mixed with the human feedback dataset. A model is trained on this dataset. Finally, the SL model from the first step is finetuned via RL against this new model.

The parameterised reward model gives feedback for optimising a language model. It is estimated using the negative log-likelihood loss on the preference data and then used to optimise the optimal policy.

Direct Preference Optimization (DPO)

DPO is very similar to RLHF. The main difference from RHLF is that in DPO, the reward function is directly expressed as a function of the optimal policy. Consequently, the optimal model is expressed in terms of the optimal and reference policies. This has the advantage that, in comparison to RLHF, no reward model has to be generated, and the DPO loss is directly the negative log-likelihood of the preference model.

Harmonic Discovery has used DPO to fine-tune a molecular language model to pass the filters of medical chemists. More specifically, they used a generative pre-trained transformer architecture (GPT) and an LSTM-based architecture from the literature on small molecules to sample training data for DPO. Fine-tuning with DPO was based on a binary filter, which only led those molecules pass which do not contain undesired chemical substructures, are in a specific molecular weight range and have less than two chiral centres and eight rings. In a second task, they finedtuned their original baseline models on kinase inhibitors and then finetuned those new models with DPO bioactivity data from IC50 measurements. In both experiments, they observed a significant increase in the filter pass rate upon model fine-tuning with DPO, with nearly no decrease in the fraction of generated molecules with valid chemical structures, in the fraction of unique, valid structures and in the diversity of molecules in the first task, but some remarkable drop in these metrics on the second task. The main reason for this drop is the low amount of positive samples in the training dataset.

DPO of autoregressive protein language models

Aikium used DPO to design peptide binders against given receptors. They fine-tuned ProtGPT2, a BERT-style encode transformer model, to generate binders against a receptor, which the user gives in a chat-like template. The advantage of ProtGPT2 is the size of its vocabulary: While other pLMs like ESM2 have a size of about 33 to capture all standard and some non-standard amino acids and, e.g. gaps, ProtGPT2 has a size of 50257, which allows the expression of protein designs with more complex structures based on long oligomeric residue blocks. Fine-tuning was performed on peptide-receptor pairs where the conditional probability of a new pair is the product of all probabilities of previous pairs.

After fine-tuning the pLM, DPO generates binders with given physicochemical properties. In their paper, they especially generated binders with enhanced target specificity and without a low isoelectric point (pI), which correspond to highly negative peptides unlikely to bind to any (not heavily positively charged) target. They show 97.5% accuracy after DPO fine-tuning with destroying perplexity. Moreover, their designed binders increase the median pI values by a factor between 1.2 and 1.5, depending on the sampling strategy used.

Summary

In summary, Direct Preference Optimization is a good alternative for multi-parameter optimisation. The mentioned examples show that the decrease in accuracy after DPO optimisation, which is inevitable in comparison to the fine-tuned baseline model on a single property, is small, but the increase in reward, measured by the number of accepted samples on other properties, is comparably significant. DPO on pLMs is not limited to small molecule and peptide design but can also be used for large protein design, including antibodies. While DPO should never replace the opinions of an expert in the drug discovery process like a medical chemist, it may be a way to numerically express their views in a data-driven way if enough positive and negative examples are present in the dataset.