The Startup
Published in

The Startup

Reshaping the Future, Protein by Protein; Bit by Bit

Understanding Arguably the Greatest Breakthrough in Artificial Intelligence of the Past Decade.

On November 30, 2020, DeepMind Technologies, a British AI company that was acquired by Alphabet in 2014, completely reshaped the field of molecular biology, bioengineering, and the life sciences via their breakthrough in the protein folding problem. This is a truly disruptive technology that has the potential to tackle some of the biggest obstacles humanity faces. Let’s break it down.

COVID-19 has shone a new light on the healthcare and pharmaceutical industry, highlighting the diversity and importance of these fields. Three viable vaccines were approved for use within a year, with many more on their way — an incredible achievement. ⅔ of them were based on recreating the spike protein of the coronavirus.

What exactly are proteins? Proteins are the building blocks of life; a single cell can contain thousands of proteins, all of which have different features and functions. Proteins are comprised of one or more linear chains of twenty amino acids, which are analogous to the atoms of life. Amino acids are comprised of an alpha (α) carbon atom bonded to an amine group (NH₂), a carboxyl group (COOH), a hydrogen atom (H), as well as an R group that determines the identity of the amino acid. The R group is the passport of amino acids, and it varies across different amino acids in their chemical makeup, mass, polarity, and net charge. This thus allows different amino acids to have different properties.

A molecular overview of a simple amino acid. Image Credit: OpenStax Biology

The proteins in a cell are comprised of one or more polypeptide chains. These are linear chains made up of at least 50 amino acids (amino acids linked within a protein chain are referred to as residues) in a specific order. The chemical properties of the amino acids, which are determined by the side chain, as well as their order, are vital in determining the structure and function of the polypeptide, as well as the corresponding protein.

More specifically, the length and specific ordering of amino acids in a sequence is a major factor in determining how the proteins folds (i.e. its 3-dimensional structure, known as the protein tertiary structure). An amino acid sequence forms a 1-to-1 mapping to a 3D structure. Carrying on with our previous analogy, the shape of a protein is its passport. It is the shape — the exact folding — of a protein that determines its function and properties. It is such an important factor that the misfolding of a protein often results in disease. Protein folding is nature’s origami, providing the folds and creases that become the bedrock of life.

A protein depicted before and after it is folded.

Proteins are the driving force behind almost every biological process necessary to sustain life. There are structural proteins (tendons, cartilage, and hair), transport proteins (hemoglobin), hormonal proteins (insulin, growth), and so many more. They are the machinery that underpins nearly all biological processes: from providing structure and support for cells to catalyzing biochemical reactions; from carrying chemical messages to creating antibodies to defend against infection. All this possibility stems from a string of amino acids.

If we know how a specific amino acid sequence folds, then researchers can uncover what the protein does; we can start to understand and anticipate the protein’s properties and function. The issue, however, is that there are 10¹⁹⁸ ways for any sequence to fold (there are ~10⁸⁰ atoms in the observable universe). Levinthal’s paradox revolves around the fact that if a protein were to sample every possible ternary structure sequentially, it would take longer than the age of the universe to arrive at its correct conformation, even if a permutation was tried every picosecond. The paradox lies in the fact that most proteins fold spontaneously on a milli- or even microsecond time scale.

There are an estimated 200 million distinct proteins, of which we only know the 3D structure of about 170,000. The current methods of deciphering the tertiary structure are expensive, lengthy, and require a significant amount of trial-and-error. The most notable one is X-ray crystallography, which operates by firing incident rays and measuring their angle and intensity diffraction from the crystalline structure takes a year to complete and costs ~120,000 USD. Other methods include nuclear magnetic resonance and cryo-electron microscopy. These methods are too costly and have too much inherent uncertainty. Alphafold changes everything, fundamentally solving the protein folding problem: predicting how protein chains will fold given an arbitrary sequence of amino acids. The protein folding problem has been a grand challenge in biology for 50 years. And it’s been solved decades before many researchers antipiated.

Credit: xkcd

This has been a problem plaguing scientists and researchers for decades, with the advancements resulting from the marriage between biology and engineering. There has been a myriad of developments, from IBM’s Blue Gene spearheading supercomputing efforts to efforts from the scientific population, most notably Folding@Home and FoldIt. The key to the solution, however, lies in the field of deep learning.

Biologists are turning to AI methods as an alternative, with the ability to computationally analyze an amino acid sequence and generate a prediction for the structure of that protein accelerating research and allowing for extensive scalability. Thanks to the immense amount of data available, AI has blossomed in a non-traditional area.

Progress on the protein folding problem is measured at a biennial global competition, CASP (Critical Assessment of protein Structure Prediction). CASP is the golden standard for solutions, with everyone from academics to billion-dollar companies submitting solutions. This year (CASP 13), DeepMind’s AlphaFold placed first at the competition, with a global distance test (GDT) score, a measure of how similar the predicted structure is to the actual structure, of 92.5 i.e. an average error of 0.1 nanometers. For comparison, 90 is the equivalent for experimental methods. AlphaFold, equipped with algorithms and data, produced one of the most extraordinary results in structural biology and genomics.

AlphaFold’s predictions for two proteins, juxtaposed with the actual structure. Credit: DeepMind
Credit: DeepMind

AlphaFold uses a deep neural network to predict the distances between pairs of amino acids and the angles of the chemical bonds that connect these amino acids. These predictions are then combined into a single score that states how accurate the proposed structure is to its real-life counterpart. Gradient descent is an algorithm employed that reduces the error of the predicted protein to its real-life counterpart.

The process of the neural network. Credit: DeepMind

Neural networks are information processing systems comprised of nodes arranged in layers. Training data (which is past data), known as features, are passed into the input layer. This data is then processed and passed onto the hidden layers, which is where the primary processing happens, and finally to the output layers where the network spits out its prediction. This sequence of formulating a prediction is known as forward propagation. One layer is connected to the next via weighted edges, and the value a neuron holds is the weighted sum of the nodes and edges it is connected to, as shown in the image below.

The prediction is compared to the expected result (known as labels) using a loss function and then gradient descent is employed to minimize this loss function. During backpropagation, the weights on all the edges are updated to reduce the loss on subsequent predictions. This is the learning aspect of a neural network. The network generates a prediction, compares it to the desired result, and then adjusts the weights to create a more accurate prediction for a dataset.

A visualization of the prediction of the structure of a protein improving via gradient descent. Credit: DeepMind

This is the core architecture of a neural network: it is a universal function approximator. The crux of the field of deep learning, and AI as a whole, is using past data to generate accurate predictions for future scenarios. A neural network is nothing more than a mathematical function (albeit a very complicated one) that uses algorithms to computationally achieve the desired result. The structure and function of neural networks are motivated by their biological counterparts and learning process, so it is rather poetic that it has produced one of the largest breakthroughs in the field of AI.

Proposing a viable solution to arguably the greatest challenge in microbiology and genomics, as well as the whole of health sciences, has massive implications. Crucially, it will allow significantly accelerate attempts to understand the building blocks of all cells and a large portion of biological processes. This, in turn, would allow for quicker and more effective drug discovery, create healthier crops, and even allow for the creation of enzymes to capture carbon from the atmosphere. Most importantly, it lets us understand life itself better, uncovering the mysteries at the heart of biology.

“This will change medicine. It will change research. It will change bioengineering. It will change everything.” — Andrei Lupas, evolutionary biologist at the Max Planck Institute for Developmental Biology

DeepMind’s AlphaFold is arguably the greatest breakthrough in artificial intelligence and structural biology of the past few decades, with Lex Friedman anticipating it to be the first machine learning model to be awarded the noble prize. This is a breakthrough that has not gotten the attention it commands, but that changes the course of biology and bioengineering as we know it.

DeepMind’s AlphaFold has the capacity to provide the solution to some of the greatest obstacles we face, the obstacles of today and tomorrow. It showcases the necessity of AI and the immense potential it bears in all fields. Protein by protein, bit by bit, DeepMind has ushered in the dawn of a new era.



Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store