How Deep Learning solved the Mystery of Biology
This story is a history about development Deep Learning tools, that helped people predict three-dimensional structure of proteins.
Since the beginning of the second half of the 20th century, a new acute problem has arisen — to predict the 3-D structure of a protein, knowing only its sequence (that is, the primary structure). The problem is very complicated because there are too many factors affecting the results. In addition, the difficulty lay in our accumulated knowledge — in the 90s there were only 700 records in the PDB (Protein data bank), which is negligible for any analyses. And the combination of many other factors has made this puzzle of Computational Biology practically unsolvable.
The biological concept
Actually what are this proteins and their structures?
It’s no secret that each cell consists of different molecules, and if we remove all the water from the cell, then 50% of the remainder will be proteins. In fact, proteins are almost everything, thanks to their extraordinary plasticity and incredible functionality. If you want to imagine any active process taking place in a cell, then some protein will surely participate in it. For example, the movement of cells is only a continuous reduction of protein ribbons, and the process of eating large particles or even small microorganisms (bacteria or viruses) is only the capture and enveloping of this compartment by bending proteins.
However, all this non-imaginary functionality of proteins is available only thanks to the three-dimensional structure of polypeptides. The process by which the polypeptide chain is folded, turning into a biologically active protein in its natural three-dimensional structure, is called protein folding. This is a very complex process in which the protein is transferred to completely different areas of the cell, it is modified by other proteins, and even in different environments and under different conditions, this process does not differ. In fact, this is still not a completely studied process, over which science will still have to break its head.
In fact, proteins have 4 different folding stages: Primary, Secondary, Tertiary and Quaternary. The primary structure of a protein, its linear amino acid sequence, determines its native conformation. The formation of a secondary structure is the first step in the folding process. A characteristic feature of the secondary structure are structures known as alpha helices and beta sheets, which fold rapidly. With the secondary structure, hydrophobic amino acid residues also remain, which form the tertiary structure of the protein. Also some folded proteins (in the tertiary structure) they are able to unite with each other in an aggregate structure — quaternary.
The computational solution
Machine Learning implementation
At the beginning of the 21st century, some smart people thought why not try to solve this issue with computers. In the next 10 years, the idea developed so well that in 2015 the first attempts to use machine learning models for data analysis and forecasting began. But the turning point was made in 2020.
Alphabet’s/Google’s DeepMind DNN (Deep Neural Network) has started building a protein folding model. In 2018, the model took 13th place in the overall standings of the Critical Assessment of Techniques for Protein Structure Prediction (CASP) competition, and in 2020, this model has already taken first place in predicting the three-dimensional structure of protein. It was a real revolution, because DeepMind significantly outperformed everyone, scoring more than 90 points for about two-thirds of the proteins in the global CASP Remote Test (PVE).
On July 15, 2021, the AlphaFold2 paper was published in Nature as a pre-access publication along with open source software and a searchable database of species proteomes. Currently, the main protein databases (Uniprot, swissprot, PDB) use structures predicted by AlfaFold2, along with protein structures obtained by old classical methods. It is worth noting that over the past 5 years, many other deep neural networks have also appeared that perform the same task, and that simply shows, that this concept definetly works!
Finally
I want to say that if you are interested in this topic, you can read my other articles related to machine learning in bioinformatics.
This 50-year-old grand challenge was brilliantly solved with the help of technology and science. This is a great example of how technology and ideas should work in our world.
Be human, do science 🕊