Predicting protein structure from coevolutionary information

This research perspective was sponsored by Entrepreneur First (EF), Europe’s leading pre-seed investment fund for deep tech technical founders. It runs a unique 6-month entrepreneurship programme helping their members build high growth defensible technology companies from scratch, typically pre-idea or team.

EF engages active researchers in the programme to enable them to maximise the impact of their research. So if you are a PhD student / postdoc in a technical subject, especially in Machine Learning, Artificial Intelligence, IoT, Wearable Technology, NLP, Bioinformatics, Big Data or similar, apply here to join the next cohort or click here to ask Chloe, EF’s talent associate for more information!

Research perspective written by Tomasz Kosciolek
Bioinformatics Group, Department of Computer Science
University College London, London, UK

How can I explain the importance of this research to the general public?

The key to understanding protein function, the relationships between proteins, and aspects of their evolution is through protein three-dimensional structure. The structure of a protein is determined by its primary (amino acid) sequence, so instead of performing costly and laborious X-ray crystallography or nuclear magnetic resonance (NMR) experiments, there have been efforts over the years to develop computational methods that would be able to predict the structure of a protein just from its amino acid sequence.

Protein structure predictions were previously limited either to close evolutionary neighbours of proteins with known experimental structures (via a process called homology modelling), or to some proteins which exhibited structural similarities to known proteins (via a process called fold recognition). De novo modelling (i.e. using only protein sequence, but no other experimental information) which is necessary to predict proteins with remote structural similarity to known proteins, or entirely new protein folds was limited to a handful of success stories and couldn’t be routinely used. MetaPSICOV is a meta-predictor: the method uses machine learning to combine different statistical approaches and physicochemical properties to make its predictions significantly better than any of the input methods alone. It is able to predict contact for many de novo modelling cases — a stepping stone to make many more protein structures accessible to prediction.

This study has proven that with available information on protein sequences, we can now build computational models that give us accurate information related to more protein structures. With the help of MetaPSICOV we are now closer than ever not only to discovering new protein folds and understanding molecular evolution better, but also to solving many biomedical challenges.

Why is this important for researchers in fields other than computational biology?

This study shows that we are now able to accurately predict contacts for protein families of around 200 non-redundant sequences and, on average, achieve 60% higher precision than any other previous method. For research in structural biology and structural bioinformatics, this means that by using state-of-the-art de novo protein structure prediction algorithms, such as FRAGFOLD or Rosetta, we are able to predict with high accuracy the structures of proteins previously addressed by the Protein Structure Initiative (PSI). More generally, this means that structural biology can benefit directly from the vast sequencing efforts in recent years and the accumulation of sequence data.

Why is this important for researchers in the same field?

This study builds upon the recent developments in statistical methods to predict residue-residue contacts by a meta-predictor combining largely orthogonal sources of information for inferring covariation signals from multiple sequence alignments. The advent of accurate residue-residue contact prediction methods showed promise in advancing the protein structure prediction problem by applying predicted contacts to de novo structure prediction protocols. However, high accuracy individual covariation methods require large and diverse multiple sequence alignments that usually correspond to families with already known experimental 3D structures. By leveraging the orthogonality of different classes of statistical methods used for contact predictions, MetaPSICOV is able to predict intra-protein contacts accurately for protein families three times smaller than with individual methods.

Original research

MetaPSICOV: combining coevolution methods for accurate prediction of contacts and long range hydrogen bonding in proteins
David T. Jones, Tanya Singh, Tomasz Kosciolek, Stuart Tetchner
Bioinformatics, published online 26 November 2014


This work was supported by the Biotechnology and Biological Sciences Research Council UK and the Wellcome Trust. The original research was published in Bioinformatics.

Originally published at