How to Vectorize Antiviral Structure for Machine Learning Use Against the Novel Coronavirus

Miles Agus
Noble.AI
Published in
10 min readMay 30, 2020

This post is part of the “Immunity Initiative” series, describing datasets and AI tools that could provide the key to beating COVID-19. Noble.AI is making all data and AI in this series available for free to researchers and scientists working to defeat the SARS-CoV-2 virus. Learn more at www.immunity-initiative.org

Introduction

As the world starts to reopen, social distancing remains critical as researchers continue to look for ways to stop the novel coronavirus. These research efforts primarily focus on two categories: treatments and vaccines. Antivirals, small molecules that bind to the virus to prevent replication, are one promising type of treatment. Antivirals use many different mechanisms to inhibit viral replication. Antivirals are structurally different from one another, but they share some commonalities as carbon-based molecules that sometimes include metal ions. In order to accelerate the discovery and testing of new antivirals, we need to find ways to model small molecule structure for machine learning, given that most machine learning algorithms take inputs in the form of numerical arrays like images or vectors. The central question revolves around finding the best way to represent a complex 3D shape with charges unevenly distributed around it for use in a machine learning algorithm.

In this article, I explain how to transform the representation of any small molecule from a string to a numerical vector. I describe the effectiveness of a variety of vectorization techniques of two different datasets. Next, I qualitatively assess the effectiveness of the different techniques using T-SNE¹ in one dataset, and I use a neural network model to quantify the effectiveness of the vectorizations of another dataset. Finally, I find that a representation that takes into account the 3D structure of the small molecule in addition to its charge distribution proves to be the most effective.

Data

I wanted to compare different approaches of generating vectorizations to assess how well each vectorization method encodes the activity of a molecule against certain viruses. To do this, I gathered three datasets describing the binding of molecules to proteins in two different strains of coronavirus. Each of these datasets was provided by MIT AI Cures². All molecules in the datasets were expressed in Simplified Molecular-Input Line-Entry System (SMILES) format, a string representation of the structure of a molecule.³

The PL Pro (N=233,891; 697 hits) dataset consists of experimental data corresponding to the inhibition of the SARS-CoV PL Protease, a protease vital in the reproduction of the SARS-CoV virus, using yeast models. The SARS-CoV virus first appeared in 2004 and is considered a precursor to the SARS-CoV-2 virus.⁴ (https://github.com/yangkevin2/coronavirus_data/blob/master/data/PLpro.csv)

Figure 1. A list of the SMILE string of compounds, with a binary 1 or 0 representing whether or not that compound inhibited the SARS-CoV PL Protease. A 1 indicates that the compound stopped virus replication, and a 0 indicates that it did not.

The MPro XChem (N=880, 78 hits) dataset consists of experimental data corresponding to the inhibition of the SARS-CoV-2 3CL Protease, which is vital to the reproduction of the SARS-CoV-2 virus. To acquire the data, researchers used x-ray crystallography to determine whether the compounds bind to the protease and inhibit the operation of the 3CL protease. (https://github.com/yangkevin2/coronavirus_data/blob/master/data/mpro_xchem.csv)

Figure 2. A list of the SMILE string of compounds that are screened for whether or not they inhibit the 3CL Protease of the SARS-CoV-2 virus. A 1 indicates the compound inhibits the protease and 0 indicates that the compound fails to inhibit the protease.

The Coronavirus Literature Index (N=101) dataset consists of a list of the SMILES representation of all FDA-approved antivirals considered relevant to COVID-19 because they have been mentioned in generic coronavirus literature. (https://github.com/yangkevin2/coronavirus_data/blob/master/data/corona_literature_idex.csv)

Figure 3. A list of the SMILE string of all FDA-approved antivirals relevant to COVID-19

Vectorizations

Natural Language Processing

In my first test, I used Mol2Vec. This is a natural language processing algorithm that is based on Word2Vec. Instead of using sentences, Word2Vec uses the SMILES strings as an input to generate output vectors of length 300 that encode each molecule. I leveraged a pre-trained model to generate these encodings.

E-State Fingerprint

The E-State fingerprint is made up of two subarrays. The first subarray represents the elemental composition of the molecule. The second subarray is a sum of the E-State value, an index related charge, of all atoms of each element. The following publication explains the actual derivation: https://pubs.acs.org/doi/abs/10.1021/ci00028a014?journalCode=jcics1.

I synthesized the two arrays in two ways. The first representation of the E-State fingerprint (length 158) represents the concatenation of the two parts. I refer to this representation as E-State Fingerprint. The second representation of the E-State fingerprint (length 79) represents the average E-State value of each type of atom. I refer to this representation as the E-State Fingerprint Average.

Coulomb Matrix and EigenValues

A Coulomb Matrix is a global descriptor that represents the electrostatic interaction between the nuclei of two atoms in a molecule. Since this matrix represents the intermolecular interactions between atoms, it can be used to understand how the charge of a particular atom influences bonding with other compounds. A further derivation of a Coulomb Matrix can be found here: Coulomb Matrix — DScribe 0.3.5 documentation

I used two representations of this matrix. The first representation involved simply flattening the matrix into a vector. The second representation leveraged the eigenvalues of the matrix. In the end, a 136x136 matrix with 18,496 floats was represented by a single eigenvector of length 136 that represents the charge of each atom and how much it interacts with other substances.

Extended Connectivity Fingerprint (ECFP4)

The ECFP4 fingerprint is a circular topological fingerprint that characterizes each substructure of a molecule. I represent the ECFP4 as a binary vector of 0’s and 1’s of a fixed length. For this project, I used a length of 128 for the small fingerprint and 1024 for the large fingerprint. A further derivation can be found here: Extended Connectivity Fingerprint ECFP — Documentation

Custom Similarity Vector

In order to best represent how a proposed antiviral may bond to COVID-19, it may be valuable to see how that candidate antiviral compares to other FDA-approved antivirals against similar viruses e.g. SARS-CoV. In order to do this, I used the Coronavirus Literature Index. This dataset provides 101 compounds that can serve as baseline comparisons for each proposed antiviral. In order to use these antivirals to develop a vector for a small molecule, I used 4 fingerprint similarity types from RDKit: RDK Fingerprint, MACCS Keys, Atom Pair Fingerprint, and Morgan Fingerprint. I could then compute a float value for the similarity between any small molecule and an antiviral. For each small molecule, I computed the scalar similarity among all of the COVID-19-related antivirals and concatenated them into a 101-length array. A derivation of molecule fingerprint and molecule similarity can be found here: Fingerprints in the RDKit

I found that MACCS Key Fingerprint computed the most relevant similarity vector. The resulting MACCS Key Fingerprint vector is length 101 and represents the similarity of a small molecule to different antivirals. I used this to help represent similarities to other antivirals that are known to work.

Summary

In this section, we discuss nine methods of generating vectorizations from SMILES strings. I applied these techniques to both the MPro XChem dataset and the PL Pro dataset. These vectorizations for the SMILE strings were then exported to a .csv which will be made available through the Immunity Initiative platform.

Results

I used a T-SNE plot to analyze nine different vectorizations of MPro XChem dataset. Due to the large size of the PL Pro dataset, I use a machine learning model for analyzing the nine different vectorizations and 15 combinations of those embeddings.

T-SNE

In order to evaluate how the data aggregated with respect to its activity in datasets like the MPro XChem dataset, I used a T-SNE plot to group the data. For this T-SNE plot I used perplexity 27 and 5,000 iterations. A T-SNE plot is commonly used to reduce a high-dimensional embedding vector, like the ones I created, into 2D space. Visualizing this information in 2D enables grouping by similarity.

Figure 4. This T-SNE plot represents the grouping of the mol2vec Natural Language Processing. The coloring refers to the activity with respect to the MPro XChem dataset. Red means the compound fails to inhibit the enzyme and blue means the compound inhibits the enzyme. As you can see, the groupings don’t necessarily follow the ability to inhibit an enzyme.
Figure 5. This T-SNE plot represents the groupings assigned to Figure 4 for use in Table 1.

Figure 4 is a T-SNE plot of the embedding vectors of the antivirals from the MPro XChem dataset. In Fig. 4, while there is evidence of grouping on the T-SNE plots, the grouping doesn’t correlate with the activity in the MPro XChem dataset. Instead, it seems that the molecules are grouped based on similarity. To confirm this assumption and calculate the molecular similarity, I used a Morgan Fingerprint that compares the similarity of substructures between two molecules. This was one of the fingerprint similarity methods I used when creating my custom similarity vector. In order to investigate this further, I created Table 1 that represents the average molecule similarity (using the Morgan Fingerprint) between each of the groups labeled in Figure 5.

Table 1. Represents the average Morgan Fingerprint similarity score between molecules in the two groups. This was calculated by taking the average of every pair of molecules in the two groups. For the main diagonal (yellow), it was calculated by taking the average similarity between molecules in the same group.

In Table 1, the highlighted groups have a much higher Morgan Fingerprint similarity score than other groups, suggesting that the molecules are being grouped by structural similarity. Since this grouping did not correspond to the activity, I needed to find a different method for ensuring that each vectorization still retains the important information that is relevant in determining if the antiviral will correctly inhibit a specific protease. To test which vectorization carried this information, I built a very simple machine learning algorithm to attempt to predict the activity of molecules.

Machine Learning Validation

I trained a fully-connected, feed-forward neural network with one hidden layer (n=64) for 250 epochs to predict whether or not a particular antiviral works on a virus to test my vectorizations. I repeated five-fold cross validation five times on the PL Pro dataset. In order to correctly represent the data imbalance, I balanced the active and inactive molecules in the train set and didn’t balance them in the test set. For each test, I used a 80/20 train/test split and then added another 2,000 inactive compounds to test. The size of the training set was 1,120 (560 active, 560 inactive) and the size of the test set was 2,280 (140 active, 2,140 inactive) to better represent the class imbalance in the dataset. Using this, I was able to get the data in Table 2. The data show how molecular information is encoded in each vectorization or combination of vectorization. (In order to combine vectorizations, I simply concatenated the input arrays.)

Table 2. This table shows the average accuracy and F-1 Score of each vectorization using the machine learning model across 5-time repeated 5-fold cross validation.

Table 2 suggests that a concatenation of the embeddings acquired using either of the Coulomb Matrix EigenValues combined with the ECFP4 Fingerprint is the best encoding for the antiviral compounds. This is because using this encoding for the neural network model results in the best performance on the test dataset, suggesting that this representation retains all the relevant information to antiviral binding. I believe that the reason that this vectorization combination works best is two-fold. The ECFP4 Fingerprint vectorization reflects the 3D structure of a molecule. The Coulomb Matrix EigenValues vectorization represents the charge distribution of each atom around the molecule. Thus, this encoding of the molecule provides a robust representation, and the simple artificial neural network I constructed manages to leverage this information well.

Conclusion

As an example of how to use machine learning in chemistry, I developed and discussed ways of transforming a string-based representation of a molecule into an embedding vector while retaining the features that are important in determining if a molecule can effectively bind to and inhibit an enzyme.

COVID-19 has immensely disrupted public life. But, at the same time, it has driven researchers to find ways to combat the ever-changing virus we face. Due to the need for urgency, there is an opening for life-sciences researchers to use modern and novel machine learning techniques to accelerate their discovery process.

To apply existing antiviral datasets for data-driven, machine-learning research requires properly constructing numerical representations of the molecules that retain important information about the molecule. These techniques enable researchers to leverage data-driven approaches to perform in-silico screening of a very large number of antiviral candidates and filter that down to a much smaller list of candidates for further testing.

For example, the American Chemical Society has provided the CAS COVID-19 Antiviral Dataset, which consists of 50,000 potential antivirals encoded by their SMILE string. We may be able to use machine learning to filter down this large list of possible antivirals to a much smaller list to be further researched in a lab. By accelerating this process and requiring, in theory, fewer tests before finding valid antivirals, we can accelerate the timeline to new treatments against viruses like COVID-19.

Miles Agus is a sophomore at MIT studying Computer Science and Molecular Biology and an intern at Noble.AI.

About Noble.AI

Noble.AI is an industry leader in AI-powered software to accelerate science and help researchers make important discoveries more quickly. Founded in 2017 and based in San Francisco and Los Angeles, the company has raised more than $12 million in venture backing.

--

--