Optical Random Features versus SARS-CoV-2 Glycoprotein Trajectory: Round#2
As of today, the scientific community is still fighting tooth and nail against the COVID-19 disease. As part of this effort, new Molecular Dynamics (MD) trajectories have been published by D.E. Shaw Research . So we are back today to explore how optical random features can be used to analyse such trajectories!
What Do You Mean, Back?!
Earlier this year, we explained how the No-prior-knowledge Exponentially Weighted Moving Average (NEWMA) algorithm  can be used jointly with optical random features to accelerate SARS-CoV-2 MD studies. If you missed the post, here is a quick recap of what we discussed.
The goal of MD studies is to simulate the behavior of proteins with given initial conditions. To do so, trajectories of atoms are followed over a given period of time. Typically, we wish to sample all possible shapes — or, conformations — in which a molecule may exist. The catch is that certain states are separated by high-energy barriers, making transitions from one to another extremely difficult. Therefore, it could take years or even centuries for conventional MD simulations to explore the whole conformational space! I don’t know about you, but personally I get impatient when my simulations take more than a few minutes.
Fortunately, computational chemists feel the same, and developed methods to make sampling more efficient. The basic idea behind the majority of these strategies is to flatten the energy landscape so that the system cannot go back to states it has already visited, and does not get stuck for too long in a given shape.
To picture this, let me introduce you to Rémy the Rat [no identification with actual persons (living or deceased) is intended or should be inferred]. Rémy walks in a land filled with holes, and we would like to be able to map those holes and know their depth (that is, we want to know all possible shapes and associated energies). Now, imagine that Rémy’s land is in a tank, and we can only see it from a bird’s-eye view. That makes it very difficult to see the holes or make any height measurements.
Our solution is to attach a GPS tracker to Rémy’s back — see Figure 1. Why would we do that, you may be wondering? Well, as Rémy walks around, we can track his position on the x-axis. Then, we simply say that if his position doesn’t vary much for a given time, that means that he fell in a hole. We are going to assume that Rémy is a strong swimmer. To set him free, we pour water at his location, until we see that he’s able to move freely once again. By keeping track of the amount of water we’ve poured, we can know how deep the hole was. After a while, Rémy will be able to move across his entire land: we have explored all the conformations of our molecule.
Now, the strategy we developed in our previous blog post was to use optical random features as a tool to make that process even more efficient. Saying that Rémy is stuck because he hasn’t moved that much lately is well and good but:
- What “moving that much” means? In practice, that means defining a threshold. Do we really want our mapping to be dependent on a user-defined quantity?
- This method implies that we are monitoring Rémy’s position at all times. W̶h̶a̶t̶ ̶i̶f̶ ̶w̶e̶ ̶n̶e̶e̶d̶ ̶a̶ ̶b̶a̶t̶h̶r̶o̶o̶m̶ ̶b̶r̶e̶a̶k̶?̶ Using a GPS is also extremely expensive. Just imagine: it requires launching satellite constellations, making sure they are carrying stable atomic clocks, synchronizing them, computing triangulation on the ground, etc. Correspondingly, sampling methods in MD require lots of computational resources. Not only trajectories have to be computed, but then they need to be analyzed online to determine if the molecule is stuck in a conformation or not.
Do we really have to do all those things? No! We assumed that we had a GPS tracker available for Rémy, so let’s also assume that we have an accelerometer. Accelerometers don’t use satellites as GPS trackers do: they are much simpler and cheaper. Using this tool, we can detect if Rémy is suddenly falling, and because we are smart we know that this means he is falling into a hole. We can save some time and some energy! We don’t have to track Rémy at all times, only to put an alarm when the accelerometer says that Rémy fell. And this is essentially what using the No-prior-knowledge Exponentially Weighted Moving Average (NEWMA) algorithm  does for our MD simulations. By computing online a clever statistics, relying on optical random features, NEWMA detects conformational changes.
Is It THAT Easy to Detect Conformational Changes?
As explained in our previous blog post, NEWMA relies on computing the distance between two Exponentially Weighted Moving Average (EWMA) statistics with different forgetting factors, as well as an adaptive threshold. Time frames for which the former is larger than the latter are flagged as change-points. Figure 2 outlines the mathematics behind the algorithm.
We use random projections as the embeddings Ѱ to update the two EWMA statistics. LightOn’s Optical Processing Unit (OPU) shines in this pipeline as it is faster and has a lower memory footprint than CPU algorithms. This is particularly important as we study molecules or proteins with a larger and larger number of atoms.
What Does it Have To Do with SARS-CoV-2?
SARS-CoV-2 is the virus responsible for the COVID-19 worldwide pandemic. In our previous blog post, we explored trajectories of SARS-CoV-2 in complex with a drug, lopinavir, published by Cespugli, Durmaz, Steinkellner, and Gruber  using NEWMA on OPU. It was a success: we were able to detect changes in the shape of the complex studied.
Since March 27th, D.E. Shaw Research has been releasing MD trajectories related to SARS-CoV-2 using their Anton 2 supercomputers . Moreover, videos visualizing the simulated behaviors are also available! In particular, two 10μs simulations of the trimeric SARS-CoV-2 spike glycoprotein caught my eye. They go by the charming names of DESRES-ANTON-10897136 and -10897850. The first trajectory — (a) — was initiated in the closed state (PDB entry 6VXX), and remained stable. The second trajectory — (b) — was initiated in a partially opened state (PDB entry 6VYB) and showed a part of the protein detaching a bit and then detaching even more. This is exciting as it gives us a comparison point for NEWMA.
The singular aspect of these computations is their sheer size: the glycoprotein counts no less than 46,851 atoms, or 140,553 features as each atom has three coordinates! With such scales, memory does become a problem if we consider using an algorithm such as FastFood (NEWMA FF on CPU)  to compute Random Projections (RP). The memory issue goes away when we use optical random projections (NEWMA RP on OPU). But enough chitchat, let’s dive in the results we obtained!
We start by observing trajectory (a) — initiated in the closed state. While it is mostly stable, some ever-so-slight variations in the structure can be observed in Video (a). Those are difficult to perceive with the naked eyes, but, fear not! NEWMA is here. The careful observer will notice that, indeed, NEWMA RP on OPU detected a small change in the structure. If you are not a careful observer, you may want to focus on the blue part in the upper-right blob of the glycoprotein.
Next, we have trajectory (b) — initiated in a partially opened state. In Video (b), the observations of NEWMA RP on OPU match strikingly with manifest conformational changes! Indeed, we “see” through the observation of NEWMA statistics the glycoprotein splitting, and then, its right part detaching even further.
For the sake of performance comparison, we performed the same calculation with NEWMA with the Fourier Feature algorithm implemented on a CPU. We added this new point to the performance plot shown in the first blogpost. For that many atoms, using optical random features is notably better than CPU-computed random features: the computation time was ten times faster on an OPU than on a CPU. More importantly, it still took less than a minute on an OPU. By comparison, Anton 2, the supercomputers used to produce trajectory (a) and (b), can simulate around 80μs/day for a system of this size . This translates to approximately 3 hours to produce one of such trajectories, which we can analyze… in around 40s. Pretty neat, huh?
Go Big or Go Home
Ten times faster for 140,000 features is great but it’s not earth-shattering. What about performing computations in a regime where it becomes extremely prohibitive to do so on either a CPU or a GPU ? D.E. Shaw Research also released all-inclusive versions of trajectories (a) and (b), that include water and ion atoms, in addition to the glycoprotein. All in all, this humongous computation features more than 715,439 atoms resulting in more than 2,146,317 features! Yes, you read right, more than 2 million! Note that current computations in High Performance Computing (HPC) for MD regularly goes to those numbers and beyond.
At these scales, just reading the full duration of the trajectory becomes somewhat challenging — see Figure 4.
Performing an analysis on a CPU or a GPU is out of the question: it would take too long and most importantly will require a very large memory footprint, so overall would require too much energy. NEWMA RPs implemented on the OPU can still be used to study the conformational changes of these all-inclusive versions, and do so in about two minutes. For perspective, it took me about two hours to download the 62 GB trajectory on LightOn Cloud servers, and about five minutes to read the trajectory files. So, yep. Two minutes is pretty good! Figure 5 shows the results we obtained for the first microseconds of the all-inclusive version of (b): it matches the behavior observed on the glycoprotein-atoms-only trajectory (b)!
Keep in mind that the strength of the NEWMA algorithm is that it can be used online, detecting conformational changes just as they occur in the simulated trajectory. In that case, we do not even need to store the entire trajectory: just update the NEWMA statistics using RP on an OPU on the fly. Using optical RPs allow the use of NEWMA on high to very high dimensional space seamlessly.
The takeaway message of this post is that the use of optical RPs makes it possible to analyze MD trajectories with powerful tools such as NEWMA, no matter how large the number of atoms. In the context of MD simulations, using optical RPs with the OPU allows the full HPC infrastructure to solely focus on the actual computation of trajectories.
Want to try our optical random features ? The code used to generate the results of this blog post is publicly available here, and you can reproduce our results or make your own calculations with the latest-generation Aurora OPUs on LightOn Cloud.
LightOn is a hardware company that develops new optical processors that considerably speed up Machine Learning computation. LightOn’s processors open new horizons in computing and engineering fields that are facing computational limits. Interested in speeding your computations up? Try out our solution on LightOn Cloud! 🌈
 D. E. Shaw Research, “Molecular Dynamics Simulations Related to SARS-CoV-2,” D. E. Shaw Research Technical Data, 2020.
 Nicolas Keriven, Damien Garreau, and Iacopo Poli. “NEWMA: a new method for scalable model-free online change-point detection” (2020). IEEE Transactions on Signal Processing. PP. 1–1. DOI: 10.1109/TSP.2020.2990597
 Marco Cespugli, Vedat Durmaz, Georg Steinkellner, and Christian C. Gruber. “6 molecular dynamics simulations of coronavirus 2019-nCoV protease model in complex with different conformations of lopinavir” (2020). DOI: 10.6084/m9.figshare.11764158.v2
 Q. V. Le, T. Sarlós, and A. J. Smola. “Fastfood — Approximating Kernel Expansions in Loglinear Time.” In: International Conference on Machine Learning (ICML), volume 28, 2013.
 D. E. Shaw et al. “Anton 2: Raising the Bar for Performance and Programmability in a Special-Purpose Molecular Dynamics Supercomputer” (2014). SC ’14: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, New Orleans, LA, pp. 41–53.