Image for post
Image for post

The Protein Folding Problem

Recent advancements on the ultimate problem in biology

The key to finding the cure to diseases such as Alzheimer’s and Parkinson’s lie in a fundamental biomolecule; the protein. Biologists and physicists alike have been trying to solve the protein folding problem for over half a century with no real progress until recently. This article will give insight into how proteins fold, why it’s so difficult to predict how it folds, and solutions that can be designed around an accurate protein folding algorithm, as well as various other topics that may be of interest.

What are proteins?

Proteins are really complex macromolecules that are made of strings of hundreds or thousands of amino acids. They perform every biological function in your body and are absolutely key in every organism. From fighting diseases to providing structure for cells, proteins play every role in your body and in every other living organism. Our DNA contains all the information for creating all these proteins, in the form of the nucleotides: A, C, G, and T. Then, DNA is transcribed to mRNA which is an intermediary molecule in the process of protein creation. In mRNA, the T’s are replaced with U’s, when transcription occurs. Finally, mRNA gets transcribed into the 20 different amino acids that make up proteins.

Protein Structure

Proteins start off as a really long sequence of amino acids, in this state the protein is unstable as it’s not at its lowest energy state. To reach this state, the protein folds into a complex 3D shape which is determined by the sequence of amino acids it started off as, as well as the environment it’s in.

This structure is really important because the way it’s folded completely defines how the protein functions in your body. For example, if the protein holds a T shaped structure, it is likely an antibody that binds to viruses and bacteria to help protect the body. If the structure of the protein is globular, it is likely used for transporting other small molecules throughout your body.

By understanding this folding code, we could essentially eradicate neurological diseases such as Alzheimers and Parkinsons; diseases that are known to be caused by proteins misfolding in your brain, creating clumps of protein that disrupt brain activity. The protein folding problem can essentially be broken into three parts, outlined well be the following quote.

The protein folding problem is the most important unsolved problem in structural biochemistry. The problem consists of three related puzzles: i) what is the physical folding code? ii) what is the folding mechanism? and iii) can we predict the 3D structure from the amino acid sequences of proteins?

(Jankovic and Polovic 2017)

Knowing this code could completely change the way we deal with treatments for many different diseases, and could potentially lead to completely artificially made materials and prosthetics. Knowing how proteins fold also opens up new potential within targetted drug discovery. It would also result in us being able to create biomaterials that could be used to create incredibly accurate prosthetics or really anything that requires compatibility with a living organism. Advances in biodegradable enzymes could help us reduce the effect of pollutants such as plastic or oil by helping break them down efficiently.

However, the problem here is trying to predict the 3D structure of a protein is nigh on impossible because of the sheer amount of combinations of structures a protein can be in. According to the acclaimed, Levinthal’s paradox, it would take longer than the age of the universe to find iterate through every combination of a typical protein’s structure.

The way a protein folds is dependant on the amino acids it’s made of and the environment that it’s in. This folding is a result of amino acids coming together through attraction across disparate lengths of the protein and is driven by a process called energy minimization.

Energy Minimization

Picture going on a hiking expedition across fields of rolling hills. The tops of these hills represent protein fold combinations that are really unlikely to occur, and valleys represent states that proteins are drawn towards. For a protein of x amino acids, there are 3^x states it can be in, and when the number of amino acids ranges from hundreds to thousands, this soon becomes way too many combinations for modern computers to even attempt to solve. This is why we can't just try every combination and see which has the least energy.

With recent surges in the amount of genetic data we have access to as a result of cheaper genome sequencing, many biologists are turning to a data-driven solution to the protein folding problem. The cost of sequencing a genome has dropped from billions of dollars to a price that is far more practical.

The Data-Driven Approach

Previous methods for trying to predict protein structures include things with long names such as x-ray crystallography, cryo-electron microscopy, and nuclear magnetic resonance. These approaches are really expensive and take months to do, so more and more researchers are moving towards an approach using machine learning to predict proteins. CASP (Critical Assessment of Protein Structure Predictions) is a bi-yearly experiment where research groups test their prediction methodology in competition with dozens of the world’s leading teams.

CASP had seen lots of stagnation in improvements for protein predictions in the early 2000s, however, this has changed recently, with every year bringing significant improvement in the quality of the results.

In the most recent CASP, Google’s Deepmind took everyone by surprise by winning first place by a huge margin. Their algorithm, coined Alphafold, was designed to do de novo prediction, which is modelling target proteins from scratch, a daunting task.


Alphafold’s algorithm consists of a dual residual network architecture where one of the networks predicts the distance between pairs of amino acid bonds, and the other network predict the angles at chemical bonds which connect each amino acid. On top of this, they used a Generative Adversarial Network (GAN) to create new protein fragments, which were then fed into the network to continuously improve it. Their code and research paper has yet to be released, however, there have been some attempts to replicate their concept. (MiniFold)

Using this, they were able to create predictions for many different proteins, and create a state of the art algorithm in this field. The current limitations that are holding Alphafold and other machine learning algorithms back are the lack of data available for this at the moment. Currently, there are about 145,000 different proteins catalogued in the Protein Data Bank, while the total number of different proteins found in nature is thought to be around 10¹². Yet the total number of combinations for proteins are immense; 200 amino acids gives us nearly 20²⁰⁰ possible combinations for possible structures, leaving lots of room for scientific discovery.

It is very likely that the integration of various technologies and algorithms in the fields of deep learning, quantum computing and bioinformatics that will lead to the creation of an algorithm that can successfully solve this problem. Utilizing techniques such as deep q-learning, quantum computing, and capsule networks, we hope to foresee a future in which we can positively change the lives of billions of people through the solution of the protein folding problem.

If you enjoyed my article or learned something new, make sure to:

  • Connect with me on LinkedIn.
  • Send me some feedback and comments (

Written by

machine learning developer, software engineer

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store