The Possibility Scale of Characterizing Proteins + AlphaFold’s Journey

Noorish Rizvi
Visionary Hub
Published in
8 min readOct 31, 2021
Protein Molecule

Well, we are finally here. By here I mean we’ve figured out how to identify the formation and structure of a cluster of spherical molecules bunched up together. WOW! These clusters are called proteins and scientists have learned more about their structure and formation over the years in many ways, this includes some pretty unpredictable things, that will be entitled in the article, for now, let’s see what we already know!

Up until now, it was known that proteins are clusters of amino acids, the building blocks of our cells, and it is because of proteins that we are able to produce energy and carry our daily tasks. Recently, our views shifted to the more complicated aspect of proteins: how they are actually formed. When research was going into this there was a lot of conversation on how we would find out how each protein folds, because of the millions of combinations. With recent advancements, we have arrived at this point, but why exactly is this good, and how did we learn the formation of proteins?

Proteins

Proteins are large and complex molecules that are key in helping our bodies maintain health and nutrition. In order to understand proteins on a deeper level, let’s look at a car. A car is a mechanism that has many different parts that are all needed to help the car function. Cells work similarly in that they require organelles to help them function. Proteins are located in organelles called ribosomes. Similar to how cars use gas or electricity to function, protein in these ribosomes power the cell and allow it to carry out tasks.

There are also numerous types of proteins as proteins are species-dependent, as well as organ-specific! All these different proteins fall under three broader categories: essential, non-essential, and conditionally essential. Essential proteins, where we must consume foods such as meats, dairy, and eggs. Non-essential where our body produces these proteins naturally. Finally, conditionally essential, are only essential to the body if you are unwell, and require extra strength but under normal circumstances; if your body is healthy these are produced naturally.

Proteins are made up of amino acids, which there are only 20 of in total, so a little more simple right? Not exactly… you see it’s important we visualize a pearl necklace when we are talking about proteins. The shiny white pearls are the spherical amino acids, and the string is the peptide bond that keeps the spheres together. Now one necklace is not a protein; it is simply a string of amino acids. A protein is specifically comprised of ten or more amino acids that have folded into each other multiple times. Simply a cluster of many amino acids bunched up together, that can fold in millions of different ways depending on the type of protein, which is determined from how the amino acids overlap. So, although we only have 20 amino acids, proteins are all unique because of how these amino acids fold and overlap with each other.

Protein Folding

What we described above is known as protein folding, and it is extremely complex, for many reasons. According to Levinthal’s paradox, due to the freedom, these amino acids have in essentially choosing to bend, there is an extremely large amount of ways in which the proteins can be folded. We’re talking about a number as big as 10¹⁴³ big! Another issue that comes along with proteins is that a misfold in a certain type of protein, can lead to disease. Let’s take a look at collagen for example. The amazing stuff that creates our bones, skin, and cartilage and gives its structural integrity, is a group of structural proteins. If there was a misfold in this protein it would result in devastating effects, such as brittle bone disease. So, although the folding of these amino acids is in essence “random,” it is actually quite specific.

Protein Effects

Because proteins control so many of our bodily functions, many diseases or disorders can often be traced back to these proteins.

The sequence of amino acids is imperative in order to understand a protein and its function.

By learning how proteins specifically fold for different tasks, scientists can develop better ways to eliminate errors in protein formation. Having the sequence would be game-changing, as we would be able to understand proteins like never before. Spatially and chemically we would see why collagen proteins give us the much needed structural integrity in our bones allowing us to walk and not get injured easily. But how do we do this? If the amino acid sequences can come in millions of different formations for each different protein molecule, how can we find out which one is correct for collagen?

For years, this was a question no one could answer immediately because, with the current methods we had, it would take at least a year to find the exact sequence for this protein, but Google DeepMind has changed this!

Proteins — Serving a Greater Purpose

We know that each protein is different depending on where it comes from, as this determines how the structure of the amino acids themselves form the protein molecule. It is mindblowing to see that with recent advancements we are now able to determine the exact structure of proteins.

This advancement is absolutely huge for a number of reasons! First off, learning the way in which the amino acids are structured in the protein helps us to understand proteins spatially as well as scientifically. Learning how proteins form can revolutionize healthcare as we could understand why and how they work for us. However, as we discussed above, proteins are extremely unique and come in many formations and so this would have been extremely difficult if Google DeepMind had not already solved it!

This isn’t to say it was not complex though, it took quite a bit of time, and ended up solving synthetic biology’s 50-year challenge with an AI that they call AlphaFold, which has essentially the ability to predict the structure of the proteins!

Prior to the launch of AlphaFold 2 in 2020, an earlier generation was constructed called AlphaFold 1 that laid the foundation for the future version.

AlphaFold 1 was constructed very well and was able to give protein sequences fairly accurately. The AI could either act using machine learning where it learned through trial and error or through human intervention. So, here are the steps!

  1. A convolutional neural network, which is a neural network that finds the most common and appropriate output by using images, has residual amino acid sequences inputted into the neural network.
  2. There are also other inputs, such as evolutionary alignment amino acid sequences. With these inputs, the network can learn how amino acids are sequenced allowing it to make future predictions.
  3. This then causes the output in the neural network to be a distance matrix, with the input being the amino acid sequences. It is valuable to know the distance between the amino acids as their overlapping and folding can be best concluded when the distance between 2 amino acids in the 3D sequence is known; it simplifies the protein formation prediction.
  4. You then have a non-learning based gradient descent, which essentially a gradient descent is an algorithm in the AI, that makes for an effective and cost-efficient way of finding the classification of the function it needs to carry out. In our case, the folding of amino acids from each other so is devoted to not exactly finding the exact sequence, but making sure that it is actually programmed to find folds and not a width for example. It does this in the most optimal method, which would be quick, and cost-effective. So, this works on the folding in the 3D structure of the protein.
  5. Finally, from here we have our distance matrix and gradient descent, in which the gradient descent will allow the AI to figure out the folding, and try to match it to the sequence through the distance matrix in the AI.

Recently, DeepMind released the research paper on AlphFold 2 which won the CASP14 (14th Critical Assessment of Structural Prediction) competition in 2020. AlphaFold 2 is far more accurate and efficient and can give details down to the atomic level.

AlphaFold 2 Network
  1. So, firstly the system takes multiple amino acid sequences and inputs them into the neural network, which outputs a base for the protein. This makes what is called a multiple sequence alignment.
  2. The multiple sequence alignment in the network will then work to try and find similar sequences of existing proteins and will try to identify similarities in the sequences that were inputted into it.
  3. The AlphaFold 2 network using this will essentially make a prototype representation of the protein, called the “pair representation sequence,” which can tell the amino acids closest and farthest from each other.
  4. After this, the system will input the multiple sequence alignment and the prototype protein structure into a transformer. The transformer then decides what data is most relevant to the protein sequence. By doing this the MSA(multiple sequence alignment) will continue to become more detailed and accurate as it goes through the network. Helping to make the pair interactions of amino acids more clear, as well as the protein's structure.
  5. After this, the pair representation and the MSA which are now slightly more detailed are taken and used to make a 3D model of the protein. Here we can see how much more efficient AlphaFold 2 was from 1 as it does not need an algorithm any longer. It produces a highly accurate 3D structure in simple steps.

Why This is Important + TL;DR

Knowing the sequences of 3D protein structures was a huge breakthrough for a reason!

  • Proteins are the basis of life, they start by being created in the nucleolus through ribosomes, and move from organelle to organelles until reaching the mitochondria to be broken down and produce energy for our body.
  • We learned that the specific way in which the amino acids appear in the protein determines its function, so if we learn the correct way they are supposed to be folded, we can also learn that any other way is incorrect! The root cause of so many diseases can be determined through this.
  • Seeing how proteins are so specific, it is a hope in the future we can potentially program proteins to become disease killers, for sicknesses such as cancer, where chemotherapy attacks not only the bad cells but healthy cells as well, as they can potentially be told to attack only specific cells. Learning the sequences has brought us one step closer to this!

--

--

Noorish Rizvi
Visionary Hub

Hi there! My name is Noorish, a 15-year-old biotechnology enthusiast I am incredibly interested in biomedical innovation in developing countries.