Underrated Machine Learning Papers for Protein Design

Sean Aubin
Published in
6 min readMar 3, 2020


ProteinQure is a top 100 AI Company determined by CB Insights

CB Insights (a leading tech and startup publication) has included ProteinQure as a top AI company in the Healthcare space for 2020. We thought we could celebrate by highlighting some of the methods that ProteinQure has used to help generate novel proteins for therapeutic purposes.

The field of Machine Learning (ML) is generating a continuous avalanche of papers, even when restricted to the domain of Computational Biology. At ProteinQure, we stay up to date with the newest academic developments with an eye for pragmatically applying them to our projects. Here are some papers we consider to be “hidden gems” with unique insights you may have missed.

Deep Learning Accelerates Molecular Dynamics Simulations

In most cases, Deep Learning requires huge amounts of training data. Acquiring that training data for drug discovery is usually expensive since it involves procuring data points from wet-lab experiments. Even if a huge dataset for training neural networks is assembled, it will still be bottlenecked by neural networks’ inability to extrapolate to new data points. This can be a real problem in drug discovery since the dimensionality of data space increases exponentially even with minor increases in sequence space to explore. It’s unfeasible to curate a dataset that is representative of the complete potential input distribution.

Biophysics based approaches like Molecular Dynamics (MD) simulations circumvent this issue by directly modeling protein dynamics/properties. Unfortunately, trajectories from MD simulations are often computationally expensive to obtain, especially for larger timescales.

To accelerate an MD simulation, a subset of possible trajectories is sampled as a way to “cheat” around simulating every time-step of a trajectory. Sampling is performed until an acceptable trajectory is found, which can take a long time.

In “Neural Networks based Variationally Enhanced Sampling” (NN-VES) a group from ETH Zurich proposes biasing this trajectory sampling using ML. It does so by assuming that a system can be described in terms of a simpler set of variables, called Collective Variables. The desired state is modeled as a probability distribution over this set of Collective Variables.

By optimizing over the target distribution, NN-VES is able to find an acceptable trajectory much faster than a vanilla MD simulation.

Learning Protein representation beyond simple structural descriptions

ML models depend heavily on the format of the training data given. Each piece of information included in the representation of the data is known as a feature. We have to decide how to “describe” our potential therapeutic candidates. In drug discovery, the usual solution is to use a simple 1D description of the molecule (for example its sequence) or a vector of physiochemical attributes chosen beforehand. But this can be either too simplistic (sequences) or arbitrary (which attributes). Instead, we need a description that can represent the complexity of interactions in a molecule’s structure but not necessarily require human parsing. One solution to this problem is to use ML to discover a representation from the raw data. This approach is known as representation learning and often results in better performance than hand-designed representations. Best known for applications in natural language processing (such as GPT2), it is now being explored for drug discovery.

Unified rational protein engineering with sequence-only deep representation learning” (UniRep) was developed by a team at Harvard and uses unlabelled protein sequence data to learn a broadly applicable representation of proteins. These representations will be then used in the downstream tasks, such as protein stability prediction and secondary structure prediction. This paper demonstrated the superiority of UniRep on various aforementioned downstream tasks with extensive experiments. Although UniRep is limited by various factors (sampling biases in the sequence data, length of training, the size and coverage of sequence databases), it provides a new perspective to protein design directly from sequence.

These new representations allow you to train more accurate sequence to property models.

The figure below shows some questions ProteinQure tries to answer with these techniques.

Fig 1. These are the questions we can answer with representation learning

Evolutionary Patterns Hint at Structure

When proteins evolve from a common ancestor or origin, some structural and functional properties are preserved. Thus, there is an underlying pattern in how sequences mutate or evolve. This evolutionary pattern hints towards the presence of intra-sequence interactions across amino acids.

Correlated Mutations and Residue Contacts in Proteins” presents a simple and intuitive way of understanding this mechanism by examining the correlations of mutations across sequences within a protein family. In other words, it calculates a score for each pair of positions that denotes how prevalent this amino acid pair is. This correlation is then tied to possible residue contact in the 3D protein structure. They speculate that amino acids that correspond to highly correlated mutations probably interact with each other in the 3-dimensional space.

We can thus use evolutionarily conserved pairs of amino acids to help predict 3D structure.

This hypothesis was tested on protein families with known structure and achieved reasonable accuracy. This is particularly valuable since it reveals insights into the 3D structure of the protein. Especially considering how hard it is to predict protein structures now, let alone when this paper was published in 1994!

Novel Structured Representations for Small Molecule Generation

Designing and generating new molecules with specific chemical properties is a challenging problem. As mentioned above we can represent a molecule from different perspectives, e.g. SMILES strings or molecular graph. Although much work exists generating molecules from linear SMILES strings, this representation cannot capture important molecular features. On the other hand, graph-based methods have shown that generating new molecules via incremental expansion at the atom-level would improve generated molecule accuracy and validity. However, choosing starting points for molecular graphs is difficult and can cause the generation of chemically invalid molecules.

Fig from Jin et Al (2019)

Junction Tree Variational Autoencoder for Molecular Graph Generation” proposes a novel molecular graph generative model using deep neural encoder-decoder architecture. This approach first generates a junction tree, which a special kind of graph, as the scaffold of a molecular graph. Then it combines various valid building blocks (like vocabulary in a language) to generate a new graph. This paper evaluated the junction-tree idea on several datasets under various scenarios and showed this approach generates new molecules with valid chemical properties.

These approaches allow us to generate novel molecules based on a given data set of valid chemical molecules.

Guiding Discovery over Multiple Experiments

Peptide optimization for specific biochemical functions is a critical step in drug discovery and development pipelines. Conventionally, this has been primarily through blind/unguided screening techniques, such as phage display and random mutagenesis. These approaches are costly and inefficient. “Discovering de novo peptide substrates for enzymes using machine learning” demonstrates how ML can be leveraged to guide peptide optimization.

By using machine learning to suggest which peptides to test you can more efficiently use experiments to arrive at potent hits.

In contrast with the previous ML approaches where only assays with best-predicted performance are experimentally evaluated for confirmation, the presented work in this article uses an iterative approach for peptide optimization. The optimization starts with an initial set of peptides. Next, the ML component recommends the next set of peptides to be synthesized and experimentally examined for hits. These evaluated peptides are then fed back to the ML module and the process repeats until an optimized set of peptides is obtained. Note molecules suggested by the ML module can be attempts at the end objective or data points that have high informational content for model improvement.

Special thanks to Sedigh, Sid and Hamid for submitting article summaries and working with me to make them understandable.



Sean Aubin

NeuroPunk, Software Nurse and Human Systems enthusiast.