DeepMind’s protein folding solution — what just happened?

Machine learning has unlocked one of the biggest scientific breakthroughs of this century

Eli Dourado
Dec 1, 2020 · 5 min read
Illustration by Christoph Burgstedt

Monday, DeepMind rocked the world of molecular biology by announcing it had essentially solved the problem of protein folding. In the just-concluded 14th biennial Critical Assessment of protein Structure Prediction (CASP), a friendly contest pitting research teams against each other, the DeepMind team finished with an average error of 0.16 nanometers, a distance less than the width of many atoms. The discrepancy between DeepMind’s predictions and experimentally-derived structures is so small that it is impossible to tell whether the error is a fault in the predictions or in the empirical measurements they are being compared against. In other words, excluding a few outlier cases, no further improvement can be expected — DeepMind has solved the problem.

Chart by DeepMind

What is the protein folding problem?

If you’re not a biologist, you probably think about protein most in the context of nutrition. Bodily structures, including your muscles and a cow’s steak-producing regions, are primarily made of protein. Dietary protein is a source of amino acids to build bodily structures — so that you can put on muscle mass after a workout, say.

But proteins do more than serve as the building blocks of the body — the ones that serve as mostly static structural components are actually a special case. More generally, proteins are self-assembling nanomachines that do almost everything in the body. Your cellular processes — everything which can be said to make you alive — are tasks carried out by proteins.

Proteins are defined linearly. They are coded by strings of nucleotides in your DNA and RNA. They are formed by chains of amino acids reacting with each other. But despite this simple linear identity, proteins act in time and space. Once produced, atomic forces cause them to self-assemble into messy 3D structures that determine their function. In 1972, Christian Anfinsen postulated in his Nobel lecture that it should be possible to determine the 3D structure of a protein from its linear amino acid sequence.

Easier said than done. The problem is so complex that it could never be done with computational brute force. For nearly five decades, scientists have made halting progress. When DeepMind came on the scene in 2018 with its first-generation AlphaFold algorithm, it was a clear conceptual leap forward. This year’s rearchitected and improved algorithm left all contenders in the dust.

Why does protein folding matter?

Since proteins are self-assembling molecular nanomachines that carry out the vital processes of living organisms, they are the key to biological life. They are fundamental to pharmaceutical research, where scientists are often trying to find a molecule that will activate or inactivate a particular protein. Since we only know the structures of around a quarter of the proteins in the human body, this has often been a trial-and-error effort. By using AlphaFold 2 or its successors to create a catalog of the structures of every protein humans can produce, scientists will be able to reason about which molecules could be good candidate drugs, dramatically reducing the error rate. This, in turn, could turbocharge drug development and enable the discovery of cures for almost every disease. We may even discover that already-approved drugs can be used to treat conditions we hadn’t tried them on yet.

Understanding the structures of proteins can also be helpful in the fight against infections. We are fortunate with the current coronavirus pandemic that researchers already had experience with the similar SARS and MERS viruses. With that experience, we knew that by targeting the coronavirus’s spike protein, we could disable it, and our current vaccine candidates are based on that understanding.

But imagine a pandemic where we didn’t have prior experience with a similar virus. With the ability to map the structure of its proteins, we could determine what kind of molecule would be needed to inactivate it. Then we could use our existing catalog of drugs to see if any were a match. Instead of blindly experimenting with random antimalarials, we could reason about which existing drugs could be a first-wave therapeutic. This could save countless lives.

Most generally, understanding protein structures will help us understand cellular processes at a finer level of granularity. More fundamental understanding could help us design enzymes that enable bacteria to pull carbon dioxide out of the atmosphere. It could help us understand biological aging at a cellular level and enable a cure. It is, in other words, one of the most promising scientific breakthroughs of our age.

What does this mean for machine learning?

DeepMind, a subsidiary of Alphabet, is an artificial intelligence laboratory best known for its AlphaZero program that is better than any human at games like Chess and Go. While many of us enjoyed watching AlphaZero play chess — it is astonishingly creative — beating humans at games is of little social significance. From AlphaZero to OpenAI’s GPT-3 to self-driving algorithms, machine learning has not until now unlocked superhuman capabilities in areas of immense social importance.

AlphaFold 2 puts to rest any lingering doubts that machine learning can be a game-changer for human progress. Protein folding might never have been solved without it. DeepMind’s progress in 2018’s CASP inspired other participants to use machine learning in their 2020 entries, leading to progress across the entire field. With AlphaFold 2 showing such head-and-shoulders improvement over 2018’s AlphaFold 1, more and more scientists are going to have to learn ML techniques.

With AlphaFold 2 demonstrating the ability of machine learning to solve such meaningful problems, it is time now to apply it more broadly. We need dozens of DeepMind-style teams working on comparable and follow-on problems both in medicine and outside of it.

Aren’t there any caveats?

There are always caveats. First, protein nanomachinery is dynamic, but AlphaFold only predicts fixed protein structures. This limitation is a consequence of the fact that our existing techniques for empirically determining the structure of a protein — X-ray crystallography and cryo-electron microscopy — capture a static structure only. This static picture is the ground truth against which AlphaFold was trained. While AlphaFold has essentially solved the static structure prediction problem, there is a further rabbit hole of dynamic behavior to understand.

The second caveat is that scientific breakthroughs need operational follow-through before their effects are felt by society. Singular technical accomplishments must be translated into day-to-day practices. SpaceX first landed a rocket five years ago — this achievement lowered the cost of launch, but unless you were in the space industry, you probably wouldn’t have noticed. It is only through initiatives like the impossibly large Starlink satellite constellation that SpaceX’s launch cost advance is making itself felt by wider society. Back in the world of biology, CRISPR gene editing, discovered in 2012, has revolutionized science in the lab, but it is still not yet being applied in routine therapy.

To fully realize the promise of deep, structural understanding of protein behaviors, much more work will be needed. Still, we would never get there without the ingenuity of heroes like the team at DeepMind. It is now time for us to honor their achievement by executing on the societal transformation it makes possible.

The Benchmark

A publication by The Center for Growth and Opportunity at…