Implications and Speculations on Deep Mind’s AlphaFold

Dominic Suciu
12 min readFeb 11, 2019

--

Last month, Google DeepMind’s AlphaFold placed first in the ab initio portion of the CASP13 protein folding competition with their A7D entry. The very short paper can be downloaded here. To understand how remarkable this feat is, one needs a little background in protein folding and in the history of the competition.

First of all, for the noobs: Proteins are very complex molecules that do all the work in a cell. Here’s a simple analogy, if a genome is an operating system then the protein is the program that is copied from the drive, loaded into memory and run. Each genome has thousands of genes and each gene typically codes for a single protein. Usually, the more complex the organism, the more genes it has in its genomes. Bacteria typically have about 3000 genes while humans have at least 10 times that. While each base of DNA is coded by one of 4 bases, each protein residue is coded by one of 20 amino acid residues and these display a considerably greater variety of chemical and structural diversity. Proteins are made as chains of amino acids that range in (average) size from 100 to 500 residues, from simple to complex organism. In order to understand how proteins work, one needs to know what structure they adopt when they fold into their final conformation; however, due to the vast conformational space that can be adopted by any given protein chain, a computational solution to the Protein Folding Problem has been one of the greatest challenges in Molecular Biology.

X-ray Crystallography: The first structural solutions were not computational. They came in 1958 when x-ray crystallography, a technique originally developed to understand chemical crystal structures, was applied to Myoglobin. Since then, thousands of proteins have had their structures ‘solved’ by this process. One first purifies a protein, concentrates it and then tries hundreds of different solutions of salts and various chemicals to look for conditions that generate a crystal. Once those conditions are discovered, the structure can be solved by subjecting the crystal to x-rays. Since the sizes of the structures are below the resolution of light, only the diffraction pattern generated by bombardment with x-rays is able to reveal their structure. Exactly how one goes from a diffraction pattern to a 3-D structure . . . . defies easy explanation (start by googling ‘von Laue’). Before really fast computers, it would take about a year to do this. These days it takes a few weeks after the hit-or-miss process of growing crystals. Almost all the solved structures (currently numbering 46,167) are freely available in the protein data bank.

Rosetta: The most successful computational approach to the protein folding problem is Molecular Modeling. This was pioneered by David Baker at the University of Washington. Baker’s lab developed two very popular applications that you may have heard about: a grid computing platform called folding@home based on seti@home and a gameification of the process called Foldit, which allows users to hand-fold virtual proteins and compete against each other to solve their structures. What Baker’s lab is most famous for is the Rosetta protein folding suite of programs. Rosetta is able to compute the protein structure of a protein from its sequence and from various clues it uses to generate initial first guesses. Rosetta starts with these first guesses at the likely structure and then for each, it computes all the residue-residue attraction/repulsion energies of the entire molecule and determines a global energetic state for that conformation. Most of the development has focused on perfecting the energy model that Rosetta uses to compute the energetic state of different conformations that are evaluated. There are at least 6 distinct energetic relationships that are computed for each interaction. Which of these applies depends on the residues themselves, their context and the distance between them. On each computation cycle, the structure’s global energetic state is computed by summing these values across all the residues and a global score is computed and the structure is stored. Regions that have the highest energetic state (unfavorable) are adjusted and a new structure is generated and computed in the next cycle. This part of the process is guided by a form of gradient descent. This proceeds in parallel and after many such simulation cycles one can generate an energetic landscape from which the most favorable, lowest energy conformation is chosen. This is, of course, an oversimplification (for more depth, there is a great lecture series by Jeffrey Gray on youtube).

Energetic Landscape of a Protein Folding Simulation (source).

Rosetta, is by far the most successful solution that has been developed to date, but what is important to note is that the computation focuses only on the molecule itself and the Rosetta Energy function by which intra-residue interaction energies are evaluated, has been learned over many years and has been continually perfected through competition. Every 2 years since 1994, the best models are pitted against each other in the CASP competition and Rosetta itself is most often the backbone on which various improvements are implemented. Over the years, the competition became a way for all the groups across the globe that were working on the problem to focus on various improvements to the model.

Biological Diversity: There is another take on the protein folding problem and this one has long been seen as a really obvious approach. It is simply, to ask Mother Nature instead of your computer. I mean, of course you are using your computer to ask Mother Nature, etc etc, but you get the idea. Life on earth is very diverse. NCBI’s Taxonomy browser can let you see all the sequences that have been collected for every organism known to man. Go there click ‘Bacteria’ and then hit the ‘Nucleotide’ check button. That’s how many curated, annotated sequences have been collected for bacteria. At the time of this writing, that’s over 50 million sequences. There are over 100,000 sequenced genomes for Escherichia coli alone. If you don’t like bacteria have a look at the Influenza database. For the protein Neuramindase (one protein! . . . It’s important!), there are 700,000 distinct isolates, annotated by time and location going back to the 1918 Spanish Flu.

Craig Venter on Sorcerer II (Author’s Collage)

In 2000, Craig Venter, after completing the sequencing of the human genome, cribbing from Darwin, famously went sailing on a boat designed to be a modern day ‘Beagle’. His idea was to collect microbial bacteria living in the oceans in order to build rich collections of bacterial pan-genomes. The sailboat was equipped with a water sucking device that allowed him to filter the oceans for bacterial populations. These were frozen in liquid nitrogen and sequenced when he got home. I remember seeing an ad in Nature that year, for a Molecular Biologist who also had sailing skills. How I would have liked to be on that boat.

Ocean bacteria are actually amazing, not only do they do most of the carbon capture on the planet, but they also have the ability to clean up oil spills, and they can even be induced to produce diesel-like compounds themselves. However, not everyone was convinced that this was a useful approach. At that time, many people from the protein folding camps criticized this whole approach as a form of ‘stamp collecting’. Their view was that only the in silico modeled molecule could ever give you enough insights to determine its structure with any useful precision. For many years, this was a point of contention and mutual disdain between the two camps, the biophysicist protein folders and their ‘stamp collecting’ genomic botanists.
Now to be fair, it is true that a great many meta genomic barques (real and figurative) sailed forth in those heady years to conquer that fair mermaid named Biological Diversity, and it is possible that maybe we did not need quite so many: I remember meeting this guy at an environmental metagenomics conference in Seattle. I was signed up for one poster session, but I had snuck my poster up for another right next to his, and he proceeds to berate me for daring to have a commercial poster at an academic conference. He was studying (I am not making this up!) the bacterial metagenome of a 1 meter squared plot of land near a lake in his native Finland. ‘Yes, but the deeper you go, the meta genome completely changes! Its completely different!’
So yes, stamp collecting, indeed! I get the patent!

A Multiple Sequence Alignment (Miguel Andrade at English Wikipedia)

Multiple Sequence Alignment (MSA): The idea of looking to Mother Nature has its merits. The intuition is simple and can be summarized like so. If you want to solve the structure of a protein, as a first step, find all the versions of the protein in the tree of life, align all the sequences and then see which residues are conserved. This is a very obvious, first principle approach that, on its surface, is expected to work very well. But, OK, so now, what if they are important? How do you get a structure from that? There are other insights you can learn by looking at the evolution and functional instances of a protein across the tree of life, but they are not always obvious. Here’s one notable example: In 2003 a really clever paper came out from Rama Ranganathan’s lab at UTSW Medical Center. By looking at really rich protein multiple sequence alignments for a very well-studied protein (Chymotrypsin), he was able to find residues that had complementary relationships (i.e. positively/negatively charged) which, over the course of evolution across the tree of life, had essentially switched positions. From these pairs, he was able to guess that they were binding partners in the structure, and this let him infer a primitive, though correct adjacency matrix for the structure. I remember reading this paper and being blown away at how clever the approach was. Protein alignments had seemed like a dead end, and this guy had come in and pulled a rabbit out of the hat. And not just any rabbit, but a correct rabbit! This is only one of many inferences that can be gleaned from multiple sequence alignments, but few would have predicted how much could be learned from purely statistical approaches like these. So it is telling that AlphaFold took, what I will call, for the sake of closure, the stamp collecting route.

AlphaFold Learning Process. (From the A7D entry CASP13)

AlphaFold: Going by the very sparse paper that they have published so far, I can provide some of the details: For each protein that they are trying to solve, they have gone out, gotten all its homologs (close relatives from the tree of life) aligned them, and from this alignment, their neural network is able to predict not just the adjacency matrix [which would be great in and of itself] but the actual distances between the residues themselves as well as their torsional angles. From this, they easily compute the 3-D structure. That is remarkable!
The training method is important, because in order to do that, you need to featurize the sequence alignment for each gene family in your training set and link them to available 3-D structures. The 3-D structure is converted to a distance adjacency matrix and the two are used to train a residual convolutional neural network. It is still a mystery (to me) as to how exactly this really works, there aren’t many details in the paper. resNets were initially developed for computer vision, where they are used to find things like cats in pictures. A more likely candidate would have been graph convolutional networks, since these networks are fed by adjacency matrices. This is important since the alternative approach of feeding a 3-D structure into such a network would be coordinate frame dependent, meaning that the same molecule, if fed in different rotations would look and train completely differently (Koes talk). Some of this can be fixed by data augmentation, where the structure is fed in multiple times with rotations, but this requires a much larger memory footprint. Graph conv networks, by working with transformed adjacency matrices, are able to resolve into a rotation invariant representation. In fact, DeepMind had recently released a paper on them (here’s a Medium summary), but there’s no mention of them in the paper. It’s possible that the ‘residual’ part is simply being applied to a graph convolutional net, as this modification allows them to be made much deeper, and the convolutions themselves can be transformations that are specific to adjacency matrices, as they are in graph convolutional networks, but I’m really speculating. I eagerly await the full paper to know for sure. [And . . . . maybe we don’t have to wait quite so much. See the 2016 and 2018 Jinbo Xu papers linked below].

Cat Pictures: It would be more than a little ironic if networks that were originally developed for image processing were used here. In 2010, as I was exiting a biotech startup that was going down in flames, desperate to find a new job as a bioinformaticist, I happened upon a job description from Amazon, I didn’t do it on purpose! I did it by searching for the term ‘bioinformatics’ and lo and behold, they were actually serious. It actually made a lot of sense, since many Machine Learning techniques have been used in Bioinformatics and visa-versa. Amazon and many others reasoned correctly that desperate, underpaid bioinformatics post-docs would jump at the chance to work on (slightly) less interesting though far better paid topics. Many chose that route and (at least for me) it has worked out pretty well. Many of the post-docs left behind in academia could not help but to sniff at their cat-picture-detecting former colleagues that went off to places like Google to enjoy free massages, free food, foosball, water slides and whatever other manner of perk provided for them. I myself have endured some of this sniffing whilst toiling away in far less exotic data-workers’ paradises, so I can imagine it must be much worse for the Googlers.

Now before you start saying how awful the Amazons and Googles of the world are for plundering the bioinformatics talent pool, let me relate just one more story: I was at an Xconomy Seattle biotech meeting a few years ago and there was this guy on one of the panels, who worked for one of the local venture funds and he was saying how much he admired the dedication of the workers at this famous biotech they were backing, because these guys had worked all night and did not ‘ride bikes to work’. He actually pointed that out — the non-bike-riding. He was saying something like ‘look how dedicated they are. These guys don’t ride their bikes to work’. I found it a little funny since those bioinformaticists may well have been very dedicated to his venture, but I wonder just how dedicated the venture capitalist was to them? This same venture fund was also a backer of a totally ridiculous internet startup called IhazCheeseburger. I am willing to bet that the pay scales at those two companies, being what they are in this market, would demonstrate that the venture fund had a distinctly pronounced excess of ‘dedication’ to the machine learning cat-picture-detectors over the non-bike-riding-to-work cancer fighting sleuths.

Conclusion: The Deep Learning techniques pioneered by those cat-picture detectors were far better developed in places where there was a vast supply of data, where the computational resources were essentially limitless, and where the founders recognized the importance of broad basic research programs. It is a testament to the power of these deep neural networks to learn so well, that alphaFold’s success is so startling. And now, this technique is set to have a profound impact on Biology and Medicine. So I will conclude in the most self-serving way possible, by pointing out that if you were to start a machine-learning, protein-folding, cancer-fighting biotech, and if you want to build a team of researchers who know how to use these new methods, you would do very well by treating your bioinformaticists at least as well as Google does.

Want to learn more?

Medium’s article.

Even Siraj Raval has a post on this.

Press: Guardian, Venturebeat, Techcrunch.

Here’s an excellent blogpost by Mohammed Al Quraishi titled “What Just Happened ?” He was at CASP13 and he relates many of the discussions that went on there.
Here’s a quote:

“. . . this has long been a pet peeve of mine. While companies like Alphabet, Facebook, Microsoft, Intel, and IBM have real research groups with billions of dollars spent on fundamental R&D that has led to Nobel or Turing-grade research, pharmaceuticals engage in “research” so narrowly defined that it rarely contributes to our understanding of basic biology.”

From more reading: The idea of relying more on MSA’s had been growing in the community for the last few years. A very similar approach using resNets was used in this paper by Jinbo Xu of Toyota Technological Institute of Chicago, which posted Dec 21, 2018. He also has another paper from 2016 in which he is able to correctly infer an adjacency matrix using deep resnets, but there were no distances or torsion angles. The 2016 paper shows the deep network architecture that was used. AlphaFold probably followed the path laid out in that paper. Jinbo Xu has answered a few questions on this topic on Quora.

--

--