Accelerating SARS-COv2 Molecular Dynamics Studies with Optical Random Features

Published in

Quantonation, Quantum Investors

11 min readMar 30, 2020

Figure 1: Visualising the spread of the Coronavirus disease 2019 (COVID19). Picture courtesy of A. Chatelain.

Today, about 60% of high-performance computing (HPC) workload performs computational chemistry and material sciences tasks. These undertakings open the door to the discovery of new materials, enhance our understanding of biological mechanisms and potentially allow for the discovery of new drugs.

One popular and straightforward technique for performing such computations is Molecular Dynamics (MD). MD computations follow trajectories of atoms over an extended period of time thereby providing detailed physical information as well as quantitative data on the chemical system at hand. Developed in the early 1950s, the technique has since increased in systems complexity with the increase in computational power afforded by Moore’s law. Larger and larger molecular trajectories can be achieved over longer periods of time.

HPC and computational chemistry developments have been intertwined over the years. On a generic hardware level, HPC architectures have evolved tremendously over the past 60 years as an answer to larger system computations. In parallel, the discovery of new algorithms has permitted ever faster implementations of these simulations (eg. Fast Multipole Method). Eventually, specific elements of the computation pipeline have been required to speed up some aspects of these simulations (e.g. Gravity Pipe (GRAPE) or more recently Anton). While the bulk of the computation is focused on obtaining trajectories, a real insight can only come through the analysis of said trajectories which in and of itself may require large resources.

In this post, we use LightOn’s Optical Processing Unit (OPU) to analyze MD trajectories computed by Cespugli, Durmaz, Steinkellner, and Gruber [1] in the particular case of the coronavirus responsible for the COVID-19 disease.

Molecular Dynamics 101

“…if we were to name the most powerful assumption of all, which leads one on and on in an attempt to understand life, it is that all things are made of atoms, and that everything that living things do can be understood in terms of the jigglings and wigglings of atoms.”
Richard P. Feynman, The Feynman Lectures on Physics

Biomolecular systems such as viruses are indeed made of atoms constantly jiggling. They can take different shapes called conformations. Identifying transitions from one conformation to another one is of biological interest. Indeed conformational changes link the structure of a biomolecule to its function within an organism. Thus, knowing these changes are key in the design of new drugs.

As the number of atoms studied has grown bigger, larger data streams are being produced at every time step of these simulations. As a result, conformational changes can be tricky to identify, especially for large molecular systems, as changes can sometimes be very local. Processing and memory limitations are also in the way of understanding these changes as they require full trajectories to be post-processed. As a consequence, detecting these changes is expensive and is performed by batch thereby requiring many computational resources. This situation calls for a more agile online process.

In the next sections, we examine the detection of SARS-CoV-2 (responsible for COVID19) conformational changes in an online fashion using LightOn’s OPU. Our approach not only speeds up the detection of conformational changes compared to methods using CPUs, but it does so with a lower memory footprint.

Sampling the Free Energy Landscape: What is It Good For?

MD is used to explore the conformational space of large molecules i.e all its possible structures. The challenge is sampling the free energy landscape of our molecule that varies with changes in conformations. MD simulations solve Newton’s laws of motion for a group of many atoms. Since they all move around constantly, their bonds constantly fluctuate and some can do so at high frequency. Hence, the time step Δt used in integration solvers is limited by the highest frequency vibration of the system. In most cases, the Carbon-Hydrogen bond stretching puts an upper bound in that frequency thereby requiring integration steps Δt’s to be of the order of the femtosecond: that’s 10⁻¹⁵ s!

In the lower frequency range, the different conformations of a protein — or metastable states — are separated by high-energy barriers. As a result, transitions between such structures are rare events that occur on a timescale of the microsecond and sometimes up to the millisecond. In order to see a single change in the free energy landscape, we potentially need to perform a sequential integration of Newton’s equations more than a billion times!

To counter this issue, enhanced sampling methods, such as metadynamics [2], have been developed by theoretical chemists. The idea behind this strategy is to modify the potential energy of a molecule by adding an extra term called bias. This bias will decrease the energy barrier between conformations, forcing the system out of the metastable state where it is located. Such algorithms make use of collective variables (CVs) that are one-dimensional coordinates assumed to describe the system in its current state. These CVs enable the “guiding” of the simulation as shown in the procedure featured in Figure 2. By adding a bias, we can push a system out of its “prison” through the addition of an extra force.

Figure 2: Schematic representation of metadynamics. The free energy of the system (solid blue line) is represented as a function of one CV. At first (a), the system (red bullet) is trapped in a metastable state. Hence, the simulation is only able to sample the region around the local minimum. Hence, the bias (yellow line) builds up in that region (b). It grows larger and larger until the system can get out of the conformational state ( c). A smaller bias is then added to this new region. Finally, the bias is large enough, so that the simulation can sample the entire energy landscape (d). Free energy can be recovered by inverting the bias.

Diffusion Maps to Identify Collective Variables

You may be wondering at this stage: what are exactly those CVs, how do we find them? Typically, they can be physical coordinates such as the angle between certain bonds in molecules. In that case, they can be chosen using intuition. Unfortunately, it is not always that simple, especially in very large systems! Several methods have been developed to automatically detect such coordinates.

In particular, Trstanova, Leimkuhler, and Lelièvre [3] proposed a method based on a dimensionality reduction algorithm called diffusion maps (DMaps). This algorithm, first introduced by Coifman and Lafon [4], relies on the computation of the eigenvalues and eigenvectors of a diffusion operator on the data. As an output, we obtain a family of embeddings of our data set into Euclidean space, called diffusion coordinates. It has been argued that those diffusion coordinates directly correlate to CVs [3]. Pretty neat!

The strategy proposed in Ref. [3] goes as follows.

1. Treat batches of m = 2000 time frames as data points. As we are dealing with molecules in three dimensions, their number of features will be N = 3 x n_{atoms}, where n_{atoms} is the number of atoms of the structure.
2. Apply the DMaps algorithm to these points.
3. If the eigenvalues of the diffusion operators have converged, it means the system has reached a metastable state.
4. Using the diffusion coordinates, identify CVs and use them to enhance the sampling (for example, following the procedure illustrated in Figure 1).
5. Detect conformational changes as changes in the spectrum.

However, using the DMaps spectrum to detect conformational changes does not really seem optimal. How do we choose the hyperparameters of the DMaps algorithm? How do we characterize the changes in the spectrum? Do we really have to expand expensive computational power to extract eigenvalues of the DMaps matrix at every m time step — each representing an O(N³) cost?

The answer is: probably not. Indeed, the authors of Ref. [5] propose a method called no-prior-knowledge exponentially weighted moving average (NEWMA) that can solve most of these intricacies and expensive computational issues.

In the next section, we propose a new strategy to enhance sampling in MD simulations by relying on the NEWMA algorithm for the detection of conformational changes. To showcase our method, we study the case of the virus responsible for the Coronavirus outbreak.

NEWMA to Detect Conformational Changes in Molecular Dynamics

Figure 3: Structures of the SARS-CoV-2 in complex with lopinavir before simulations [1].

On March 11, 2020, the World Health Organization (WHO) has declared the ongoing coronavirus disease COVID-19 a pandemic. This infection is caused by the virus “severe acute respiratory syndrome coronavirus 2” (SARS-CoV-2). To face this challenge, “the greatest since World War Two” according to Angela Merkel, a significant portion of the scientific community has been devoted to the research on the virus. In particular, simulations of the SARS-CoV-2 virus have been conducted to determine the efficiency of certain drugs [1].

We propose to explore the use of the NEWMA method for conformational change detection. This corresponds to replacing the step (5) of the strategy of Ref. [2] by NEWMA, while keeping the rest unchanged. Figure 4 outlines this technique.

Figure 4: Strategy to enhance sampling in molecular dynamics, using the NEWMA algorithm to detect conformational changes.

The idea behind NEWMA is to compute a detection statistics as the difference between two moving averages. This quantity is then compared to an adaptive threshold. If the statistics are above-said threshold, the algorithm flags that point as a changepoint. This method has the advantage of requiring no prior knowledge about the changepoints and can be computed online. Furthermore, the hyperparameters of the algorithms are selected by a heuristic computed only on the window size — that is, the number of recent samples that are to be compared with older ones. More details about the algorithm as well as the heuristics for its hyperparameters search can be found in Ref. [5].

The detection statistics are calculated using random projections (RP). These RP can be computed on CPU using methods such as Random Fourier Features (NEWMA RFF CPU) [6] or FastFood (NEWMA FF CPU) [7]. They can be computed using LightOn’s OPU (NEWMA RP OPU) as it has the advantage of being mostly insensitive to the number of features — hence, to the number of atoms! — of our dataset. LightOn’s OPU has also a much lower memory footprint as the random matrix does not need to be stored.

Figure 5: NEWMA RP OPU detection statistic (solid blue line) and adaptive threshold (solid yellow line) as a function of time frames, applied to the trajectory of the virus SARS-CoV-2. When the detection statistic is larger than the threshold for a significant number of steps, we flag those steps as changes of conformation (red dots). The changes of conformation as found by the changes in the spectrum of the diffusion operator are indicated by the vertical dotted black lines.

MD simulations introduced by Ref. [1] of the virus SARS-CoV-2 do not include any ground truth. That is, we do not know when exactly conformational changes are happening. To circumvent that issue, we compare in Fig. 4 the transitions detected using the NEWMA algorithm to the transitions detected using changes in the spectrum of the diffusion operators (DMaps). As shown in Fig. 5, the results obtained by the two algorithms match.

By using the change points detected by NEWMA, we can reduce the computation of the spectrum of the diffusion operator by a factor 4 (see Fig. 4). This is substantial as the spectrum of the diffusion operator requires the computation of pairwise distances between all atoms in the simulation and also requires the diagonalization of that matrix. As trajectories typically consider a much larger number of time frames, this effect is amplified. By computing the diffusion coordinates only at change points detected by NEWMA, much computational time can be saved!

In summary, in the new MD calculation pipeline shown in Fig. 4 and proposed in this blog post, NEWMA based detection can be efficiently computed at every time step while the coarser-grained, low-resolution DMaps algorithm can be used to obtain the coordinate representation in the manifold dataset when it is needed. This approach leads the way to faster global sampling studies, such as metadynamics.

Where Does LightOn’s OPU Shine in this New Computational Pipeline?

Finally, we also compare the performances of NEWMA using the different methods presented above to compute random projections. In addition to the SARS-CoV-2 trajectories [1], we benchmarked NEWMA using further MD simulations of molecular systems of various sizes provided by authors of Ref. [8]. Results of this investigation are shown in Fig. 6. While NEWMA FF CPU is very effective for smaller molecular systems, at 4000 atoms, computations performed using LightOn’s OPUs are already 40% faster than CPU based methods! Larger systems of atoms should see larger gains.

Figure 6: Computation times of NEWMA RP OPU (blue dots and line), NEWMA FF CPU (orange crosses and line) and NEWMA RFF CPU (green stars and line) as a function of the number of atoms of the various systems. The standard deviations are shown by vertical bars.

In summary, using the NEWMA algorithm in tandem with LightOn’s OPU is a very effective way of detecting change of conformations for large to very large molecular systems. While the largest system considered in this blog post had 5335 atoms, some systems such as lipid bilayers may be composed of tens of thousands of atoms (or more!). For such a large system, memory will become an issue for CPU based computations such as NEWMA RFF and NEWMA FF, while NEWMA performed on LightOn’s OPU will remain manageable.

In the future, studying conformational changes for SARS-CoV-2 in complex with drugs may become crucial to the discovery of a cure.

Want to identify conformational changes fast in your own molecular dataset?

The code used to generate the results of this blog post is publicly available here.

You can reproduce our results or make your own calculations with an OPU through our LightOn Cloud, available shortly. You can subscribe here.

LightOn and LightOn Cloud support Research. Please apply to the LightOn Cloud for Research Program on LightOn Cloud website.

About Us

LightOn is a hardware company that develops new optical processors that considerably speed up Machine Learning computation. LightOn’s processors open new horizons in computing and engineering fields that are facing computational limits. Interested in speeding your computations up? Try out our solution on LightOn Cloud! 🌈
Please follow us on Twitter at @LightOnIO , subscribe to our newsletter and/or register to our workshop series. We live stream, so you can join from anywhere. 🌍

The author

Amélie Chatelain, Machine Learning Engineer at LightOn AI Research.

Acknowlegments

We thank Igor Carron, Victoire Louis and Iacopo Poli for reviewing this blog post.

References

[1] Marco Cespugli, Vedat Durmaz, Georg Steinkellner, and Christian C. Gruber. “6 molecular dynamics simulations of coronavirus 2019-nCoV protease model in complex with different conformations of lopinavir” (2020). DOI: 10.6084/m9.figshare.11764158.v2

[2] Alessandro Laio and Francesco L. Gervasio. “Metadynamics: A method to simulate rare events and reconstruct the free energy in biophysics, chemistry and material science”. In: Reports on Progress in Physics 71.12, 2008. ISSN:00344885. DOI: 10.1088/0034–4885/71/12/126601.

[3] Zofia Trstanova, Ben Leimkuhler, and Tony Lelièvre. “Local and Global Perspectives on Diffusion Maps in the Analysis of Molecular Systems”, 2019. arXiv: 1901.06936.

[4] R.R. Coifman, S. Lafon, A.B. Lee, M. Maggioni, B. Nadler, F. Warner, and S.W Zucker. “Geometrics diffusions as a tool for harmonic analysis and structure definition of data: Diffusion maps”. In: PNAS.102(21):7426–7431, 2005. DOI: 10.1073/pnas.0500334102

[5] Nicolas Keriven, Damien Garreau, and Iacopo Poli. “NEWMA: a new method for scalable model-free online change-point detection”, 2018. arXiv: 1805.08061.

[6] A. Rahimi, and B. Recht. “Random Features for Large Scale Kernel Machines”. In Advances in Neural Information Processing Systems (NIPS), 2007.

[7] Q. V. Le, T. Sarlós, and A. J. Smola. “Fastfood — Approximating Kernel Expansions in Loglinear Time.” In: International Conference on Machine Learning (ICML), volume 28, 2013.

[8] Kresten Lindorff-Larsen, Stefano Piana, Ron O. Dror, and David E. Shaw. “How Fast-Folding Proteins Fold”. In: Science 28 Oct 2011, Vol. 334, Issue 6055, pp. 517–520. DOI: 10.1126/science.1208351