Large Language Models for Drug Discovery

Jer Hayes
Labs Notebook
Published in
7 min readJun 21, 2023

When it comes to using LLM models that deal with small molecules or proteins, there are essentially two families of pretained models BERT and GPT. Both model families are based on the transformer architecture, but they differ in their training objectives and the way they process input data.

BERT is a deep bidirectional model, originally created by Google, that has been designed to pre-train representations of language from unlabeled text. The model considers both left and right context in all layers to learn a better understanding of language. During pre-training, BERT uses a masked language modeling objective where some of the input tokens are masked and the model is trained to predict the original token based on the context. This approach has proven to be successful, as BERT has achieved impressive results in various NLP tasks such as named entity recognition and question answering. In the context of drug discovery the masking aspect is important here as it can allow us to take example input, e.g., a FASTA sequence of a protein and by masking parts of that sequence the model will infer the missing structure this can allow us to generate new sequences by inferring the masked structure.

GPT, or Generative Pre-trained Transformer, is another transformer-based language model developed by OpenAI. Unlike BERT, GPT is a unidirectional model and can only consider the context that precedes a given word. To train GPT, a large amount of text data is used for a language modeling task where the model predicts the next word in a sequence based on the preceding words. In the context of drug discovery the text these models are trained on is usually either very large chemical datasets where the molecules are represented as strings in the SMILES format or as FASTA strings for large molecules (proteins). We can use these models to quickly generate novel structures which could potentially be useful. However, we would liketo score these new structures with respect to things like toxicity, binding affinity, ease of synthesis, drug likeliness to know how useful a novel structure may be. Generating new structures whether they are small molecules or proteins is easy — finding ones that are actually useful is difficult.

Let’s look at some examples from Hugging face to illustrate how these models can be used directly as-is before looking at fine-tuning.

Out of the box uses

ProtGPT2 is a language model for protein design that is trained on the protein space. By following the principles of natural proteins, ProtGPT2 generates de novo protein sequences that exhibit natural amino acid propensities. The architecture of ProtGPT2 is based on the GPT2 Transformer and comprises 36 layers with a model dimensionality of 1280, totaling 738 million parameters. [nferruz/ProtGPT2 · Hugging Face ]

In a few short lines of code we can use ProtGPT2 to generate protein sequences as in the example below:

>>> from transformers import pipeline
>>> protgpt2 = pipeline('text-generation', model="nferruz/ProtGPT2")
# length is expressed in tokens, where each token has an average length of 4 amino acids.
>>> sequences = protgpt2("<|endoftext|>", max_length=100, do_sample=True, top_k=950, repetition_penalty=1.2, num_return_sequences=10, eos_token_id=0)
>>> for seq in sequences:
print(seq):
{'generated_text': 'MINDLLDISRIISGKMTLDRAEVNLTAIARQVVEEQRQAAEAKSIQLLCSTPDTNHYVFG\nDFDRLKQTLWNLLSNAVKFTPSGGTVELELGYNAEGMEVYVKDSGIGIDPAFLPYVFDRF\nRQSDAADSRNYGGLGLGLAIVKHLLDLHEGNVSAQSEGFGKGATFTVLLPLKPLKRELAA\nVNRHTAVQQSAPLNDNLAGMKILIVEDRPDTNEMVSYILEEAGAIVETAESGAAALTSLK\nSYSPDLVLSDIGMPMMDGYEMIEYIREWKTTKGG'}
{'generated_text': 'MQGDSSISSSNRMFT\nLCKPLTVANETSTLSTTRNSKSNKRVSKQRVNLAESPERNAPSPASIKTNETEEFSTIKT\nTNNEVLGYEPNYVSYDFVPMEKCNLCNENCSIELASLNEETFVKKTICCHECRKKAIENA\nENNNTKGSAVSNNSVTSSSGRKKIIVSGSQILRNLDSLTSSKSNISTLLNPNHLAKLAKN\nGNLSSLSSLQSSASSISKSSSTSSTPTTSPKVSSPTNSPSSSPINSPTP'}
{'generated_text': 'M\nSTHVSLENTLASLQATFFSLEARHTALETQLLSTRTELAATKQELVRVQAEISRADAQAQ\nDLKAQILTLKEKADQAEVEAAAATQRAEESQAALEAQTAELAQLRLEKQAPQHVAEEGDP\nQPAAPTTQAQSPVTSAAAAASSAASAEPSKPELTFPAYTKRKPPTITHAPKAPTKVALNP\nSTLSTSGSGGGAKADPTPTTPVPSSSAGLIPKALRLPPPVTPAASGAKPAPSARSKLRGP\nDAPLSPSTQS'}
{'generated_text': 'MVLLSTGPLPILFLGPSLAELNQKYQVVSDTLLRFTNTV\nTFNTLKFLGSDS\n'}
{'generated_text': 'M\nNNDEQPFIMSTSGYAGNTTSSMNSTSDFNTNNKSNTWSNRFSNFIAYFSGVGWFIGAISV\nIFFIIYVIVFLSRKTKPSGQKQYSRTERNNRDVDSIKRANYYG\n'}
{'generated_text': 'M\nEAVYSFTITETGTGTVEVTPLDRTISGADIVYPPDTACVPLTVQPVINANGTWTLGSGCT\nGHFSVDTTGHVNCLTGGFGAAGVHTVIYTVETPYSGNSFAVIDVNVTEPSGPGDGGNGNG\nDRGDGPDNGGGNNPGPDPDPSTPPPPGDCSSPLPVVCSDRDCADFDTQAQVQIYLDRYGG\nTCDLDGNHDGTPCENLPNNSGGQSSDSGNGGGNPGTGSTHQVVTGDCLWNIASRNNGQGG\nQAWPALLAANNESITNP'}
{'generated_text': 'M\nGLTTSGGARGFCSLAVLQELVPRPELLFVIDRAFHSGKHAVDMQVVDQEGLGDGVATLLY\nAHQGLYTCLLQAEARLLGREWAAVPALEPNFMESPLIALPRQLLEGLEQNILSAYGSEWS\nQDVAEPQGDTPAALLATALGLHEPQQVAQRRRQLFEAAEAALQAIRASA\n'}
{'generated_text': 'M\nGAAGYTGSLILAALKQNPDIAVYALNRNDEKLKDVCGQYSNLKGQVCDLSNESQVEALLS\nGPRKTVVNLVGPYSFYGSRVLNACIEANCHYIDLTGEVYWIPQMIKQYHHKAVQSGARIV\nPAVGFDSTPAELGSFFAYQQCREKLKKAHLKIKAYTGQSGGASGGTILTMIQHGIENGKI\nLREIRSMANPREPQSDFKHYKEKTFQDGSASFWGVPFVMKGINTPVVQRSASLLKKLYQP\nFDYKQCFSFSTLLNSLFSYIFNAI'}
{'generated_text': 'M\nKFPSLLLDSYLLVFFIFCSLGLYFSPKEFLSKSYTLLTFFGSLLFIVLVAFPYQSAISAS\nKYYYFPFPIQFFDIGLAENKSNFVTSTTILIFCFILFKRQKYISLLLLTVVLIPIISKGN\nYLFIILILNLAVYFFLFKKLYKKGFCISLFLVFSCIFIFIVSKIMYSSGIEGIYKELIFT\nGDNDGRFLIIKSFLEYWKDNLFFGLGPSSVNLFSGAVSGSFHNTYFFIFFQSGILGAFIF\nLLPFVYFFISFFKDNSSFMKLF'}
{'generated_text': 'M\nRRAVGNADLGMEAARYEPSGAYQASEGDGAHGKPHSLPFVALERWQQLGPEERTLAEAVR\nAVLASGQYLLGEAVRRFETAVAAWLGVPFALGVASGTAALTLALRAYGVGPGDEVIVPAI\nTFIATSNAITAAGARPVLVDIDPSTWNMSVASLAARLTPKTKAILAVHLWGQPVDMHPLL\nDIAAQANLAVIEDCAQALGASIAGTKVGTFGDAAAFSFYPTKNMTTGEGGMLVTNARDLA\nQAARMLRSHGQDPPTAYMHSQVGFN'}

Now let us look at a BERT based model — ProtBert. As its name suggested, ProtBert is based on a BERT model on a large corpus of protein sequences in a self-supervised fashion. By pre-training on a massive dataset of protein sequences (The ProtBert model was pretrained on Uniref100, a dataset consisting of 217 million protein sequences), ProtBert has potentially acquired and understanding of the fundamental principles of protein structure. In this domain, function is driven by structure. [Rostlab/prot_bert · Hugging Face]

Below is an example of using ProtBert to impute masked values in protein sequences and the type of output it returns.

from transformers import BertForMaskedLM, BertTokenizer, pipeline
tokenizer = BertTokenizer.from_pretrained("Rostlab/prot_bert", do_lower_case=False )
model = BertForMaskedLM.from_pretrained("Rostlab/prot_bert")
unmasker = pipeline('fill-mask', model=model, tokenizer=tokenizer)
unmasker('D L I P T S S K L V V [MASK] D T S L Q V K K A F F A L V T')

[{'score': 0.11088453233242035,
'sequence': '[CLS] D L I P T S S K L V V L D T S L Q V K K A F F A L V T [SEP]',
'token': 5,
'token_str': 'L'},
{'score': 0.08402521163225174,
'sequence': '[CLS] D L I P T S S K L V V S D T S L Q V K K A F F A L V T [SEP]',
'token': 10,
'token_str': 'S'},
{'score': 0.07328339666128159,
'sequence': '[CLS] D L I P T S S K L V V V D T S L Q V K K A F F A L V T [SEP]',
'token': 8,
'token_str': 'V'},
{'score': 0.06921856850385666,
'sequence': '[CLS] D L I P T S S K L V V K D T S L Q V K K A F F A L V T [SEP]',
'token': 12,
'token_str': 'K'},
{'score': 0.06382402777671814,
'sequence': '[CLS] D L I P T S S K L V V I D T S L Q V K K A F F A L V T [SEP]',
'token': 11,
'token_str': 'I'}]

Fine-tuning

Fine-tuning is a perhaps the most useful application large language models (LLMs) for drug discovery related tasks. They have the potential to become the base-lines for any AI / machine learning models for molecular structure. In the same way that when pre-training on massive amounts of unlabeled gives LLMs the ability to understand language at a deep level, training on massive datasets that cover the large parts molecular space allows these models to gain a deep understanding of that space. In practice, fine-tuning allows the model to learn how to solve specific tasks based on a smaller amount of labelled data.

For example, seyonec/ChemBERTa-zinc-base-v1 · Hugging Face is a BERT-like transformer model for masked language modelling of chemical SMILES strings. SMILES strings represent small molecules. For example, the drug Aspirin or rather the active ingredient in the drug can be represented in SMILES as follows:

O=C(C)Oc1ccccc1C(=O)O

This string represents the atoms and the bonds between the atoms for this structure:

3D Conformer visual representation of Aspirin

In a similar fashion to the pretrained models for protein, this LLM trained over SMILES strings can learn learned representations of atoms and functional groups can be used to address challenges in drug discovery, such as toxicity, solubility, drug-likeness, and synthesis accessibility, even when working with smaller datasets. We can fine-tune BERT using the representations the model has learned.

Using the existing model to build other models:

from simpletransformers.classification import ClassificationModel
import logging

logging.basicConfig(level=logging.INFO)
transformers_logger = logging.getLogger("transformers")
transformers_logger.setLevel(logging.WARNING)

We can take a dataset like the Tox21 challenge, which was introduced in 2014 in an attempt to build models that are successful in predicting compounds’ interference in biochemical pathways using only chemical structure data. Ultimately we can take ChemBERT and build a classification model for toxicity that can get results like:

{'mcc': 0.7465463085902863,
'tp': 63,
'tn': 331,
'fp': 8,
'fn': 26,
'auroc': 0.9578403102316795,
'auprc': 0.8825009755264663,
'acc': 0.9205607476635514,
'eval_loss': 0.2261218675929639}

These models can be used for drug screening, e.g., they can help us to score new candidates that have been generated by some of other LLM models. We have seen how easy it is to create new molecular structures but it is also relatively easy to create ML models to make predictions about this new structures for us to select the best candidates.

What is Accenture Labs doing?

Accenture Labs is using Generative AI across several domains including health and life sciences. One example is using LLM for protein generation and especially the discovery of novel antimicrobial peptides. Biological drugs (biopharmaceuticals, biotherapeutics) are made from proteins and in recent years, protein-based biologics have become an important type of therapeutic molecule due to their many benefits. These benefits include high specificity, low toxicity and immunogenicity, and the ability to replace or supplement the body’s own proteins and hormones. GPT models that have been trained on the protein molecular space can generate valid protein sequences as shown in the section on ProtGPT. Such models can be fine-tuned to generate antimicrobial peptides. They can also be further fine-tuned to generate antimicrobial peptides with properties of interest (e.g., developability). For example we may want to generate peptides that have an antibiotic effect as the long-term use and abuse of traditional antibiotics have resulted in bacterial drug resistance. Discovering new antibiotics is becoming increasingly challenging but the generation of novel antimicrobial peptides may give rise to useful alternatives and new treatments.

This image shows two FASTA sequences and the respective folds of two peptides. Both these peptides potentially have an antimicrobial effect.

In addition, Accenture Labs have used LLM as feature representations of proteins and small molecules in Autoencoder and GANs based models to improve the utility of these models. The representations of BERT or GPT-based models can be viewed as an efficient data compression of the original feature.

Discussion

In conclusion, LLM models such as BERT and GPT are useful for drug discovery related tasks, especially for fine-tuning on smaller labelled datasets. Pretraining on large amounts of unlabeled data allows LLMs to gain a deep understanding of the molecular space, and learned representations of atoms and functional groups can be used to address challenges in drug discovery. In this domain there are vast libraries of unlabelled data of both proteins and small molecules. These fine-tuned models can be used for drug screening and selecting the best candidates. Overall, the combination of LLM models and machine learning has great potential for molecular structure prediction and drug discovery.

Further reading

Payel Das from IBM has a very detailed video of their work on LLM in relation to the molecular space. Payel Das — Design and Evaluation of Foundation Models and Generative AI in Molecular Space — YouTube

Ferruz, N., Schmidt, S. & Höcker, B. ProtGPT2 is a deep unsupervised language model for protein design. Nat Commun 13, 4348 (2022). https://doi.org/10.1038/s41467-022-32007-7

Lei J, Sun L, Huang S, et al. The antimicrobial peptides and their potential clinical applications. American Journal of Translational Research. 2019 ;11(7):3919–3931. PMID: 31396309; PMCID: PMC6684887.

The code examples shown are drawn from a small number of the public models available on huggingface — Models — Hugging Face

--

--