Machine-Learning Driven Drug Repurposing for SARS-CoV-2

We developed artificial intelligence to identify antiviral compounds that merit further study as possible pharmaceutical treatments for COVID-19.

Semih Cantürk

Published in

Zetane

17 min readMay 15, 2020

Contributing authors:

Semih Cantürk, Aman Singh, Patrick St-Amant, Jason Behrmann, PhD and fellow colleagues of Zetane Systems

Contact the authors by email at info@zetane.com.

This is a short-form version of a research article you will find here on arXiv.

All results here are preliminary and have yet to undergo external peer-review. The findings in this article should not be used to guide clinical decision-making, nor do these findings identify a definitive treatment for COVID-19.

Our work aims to discover the underlying associations between amino acid sequences of viral proteins and antiviral agents that are effective against them using the artificial intelligence technology of artificial neural networks (ANN). We then use the patterns uncovered by our ANN to identify potential antiviral agents that may be effective against comparable amino acid sequences found in SARS-CoV-2, the virus at the centre of the worldwide COVID-19 pandemic. We used public data sources to make a dataset that pairs amino acid sequences with antivirals known to associate with defined viral amino acid sequences. This dataset served to train long short-term memory networks (LSTM) and convolutional neural networks (CNN). Preliminary results from our AI model produce outputs of possible safe-in-human drug candidates for treating SARS-CoV-2, and thus merit further investigation. Our preliminary results suggest Brincidofovir, Tilorone, Rapamycin, Artesunate, Cidofovir, Valacyclovir, Lopinavir and Ritonavir are of notable interest given that some of these results complement recent findings from noteworthy clinical studies, such as the “Triple combination of interferon beta-1b, lopinavir–ritonavir, and ribavirin in the treatment of patients admitted to hospital with COVID-19: an open-label, randomized, Phase II trial”, recently published in The Lancet.

Background

Artificial intelligence (AI) technology is a recent addition to bioinformatics that shows much promise in streamlining the discovery of pharmacologically active compounds (Stephenson et al., 2019). The subdomain of AI, machine learning, provides particular benefits in identifying how drugs effective in one context might have utility in an unknown clinical context or against a novel pathology (Napolitano et al., 2013). The technology works by finding patterns in how a pharmaceutical molecule exerts its activity by binding to defined regions of a biomolecule, such as a segment of a protein.

Past research now provides a sizable bank of information concerning drug-biomolecule interactions. Training machine learning models with these findings can uncover patterns, patterns which then serve to make inferences and predict future outcomes. Using drug repurposing as an example, we can now train machine learning algorithms to identify patterns in how antiviral compounds bind to proteins from diverse virus species. We aim to train an AI model so that when presented with the proteome of a novel virus, it will identify the presence of protein segments that are similar to those identified in past studies. The final output from the AI model is a best-fit prediction as to which known antivirals are likely to associate with those familiar protein segments.

The application of AI in biomedical research provides new means to conduct in-silico exploratory studies and high-throughput analyses using information already available. In addition to deriving more value from past research, researchers can develop AI technology in relatively short periods of time.

These benefits are of particular interest for the current COVID-19 health crisis. The novelty of the SARS-CoV-2 virus requires that we execute health interventions based on past observations. Grappling with an unforeseen pandemic with no known treatments or vaccines and when every passing day is met with the loss of thousands of lives, time is a precious commodity in short supply.

The potential for rapid innovation from AI technology is of utmost significance. The ability to conduct many complex analyses with AI enables us to research insights quickly that can help steer us in the right direction for future studies likely to produce fruitful results. Predictions made by AI also can provide complementary evidence when paired with less-robust studies that are faster and more practical to complete. This can offer greater support for sound decision-making as we wait to complete lengthy, though necessary and rigorous, clinical trials for therapeutics and vaccines. As a company specializing in AI, we present here our attempts to develop AI models that can guide efforts to repurpose current antiviral drugs as therapeutics against COVID-19.

Data

Sourcing & Preparation

We used two main data sources for this investigation. The first database was the DrugVirus database (Andersen et al., 2020); DrugVirus provides a database of broad-spectrum antiviral agents (BSAAs) and the associated viruses they inhibit. The database covers 83 virus species and 126 antiviral compounds, providing information on the status of each compound-virus pairing. These statuses fall into eight categories representing the progressive experimental to clinical drug trial phases: Cell cultures/co-cultures; Primary cells/organoids; Animal model; Clinical trial Phase I; Phase II; Phase III; Phase IV and Approved.

Matrix spreadsheet showing tentative and known antivirals correlated with their inhibited viruses. — **Figure 1:** A section of the DrugVirus database displayed in a pivot table from the source https://drugvirus.info/ .

The second database is the National Center for Biotechnology Information (NCBI) Virus Portal; as of April 2020, this database provides approximately 2.8 million protein (amino acid) and 2 million nucleotide sequences from viruses with humans as hosts. Each row of this database contains an amino acid sequence specimen from a study, as well as metadata that includes the associated virus species.

By merging these two databases, we paired each amino acid sequence with the list of antivirals identified as effective against the species from which the amino acid sequence originates. We considered a drug-virus pair as “effective” only if it has attained Phase II or further drug trials, signifying some success with human subjects.

**Table 1:** A section of the merged database used in model training. Each sequence is associated with a virus species and paired with a length-126 binary vector of antiviral drugs (only the first six elements of the vector are visible), where a “1” denotes effectiveness linked with a given amino acid sequence.

Preprocessing

Upon inspection of the data, we found several duplicate or near-identical viral amino acid sequences, which is expected given that we expect our models to exploit these similarities between sequences from the same virus species. To reduce this exploitability and pose a more challenging problem, we removed the duplicate sequences that belonged to the same species and had the exact same length. This reduced the size of the dataset by approximately 98%. This duplicate removal process also reduced the number of SARS-CoV-2 amino acid sequences from 481 to 98.

Our main database also contained a class imbalance in the number of times certain virus species appeared in the database. We oversampled rare viruses (e.g., West Nile virus: 175 sequences) and excluded the very rare species which compose less than 0.5% of the available unique samples in the dataset (e.g., Andes virus: 4 sequences), and undersampled the common viruses (e.g., Hepatitis C: 16,040 sequences). This produced a more modest database of 30,479 amino acid sequences, with each virus having samples in the 400–900 range. We kept the size of the dataset small both to enable easier model training and validation in early iterations and to handle data imbalance more smoothly.

The class imbalance problem also presented itself in the antiviral compounds. Even with balanced virus classes, the number of times each drug occurred within the dataset varied greatly because some drugs are broad-spectrum and thus apply to more viruses than others. To alleviate this, we computed class weights for each drug as a dict, which we then provided to the models in training. This enabled a fairer assessment and a more varied distribution of drugs in predicted outputs.

The final step of data processing involved generating the training and validation sets. We decided to set up the splits in datasets in two different ways, resulting in two different experimental setups. Experiment I is based on a standard, randomized an 80% training/20% validation split on the main dataset.

For a more “challenging” setup, which we refer to as Experiment II, we split the data on virus species; in this case, we forced our models to predict drugs for a species that it was not trained on and to determine familiar substructures in the protein sequences in order to suggest drugs. In this latter setup, we also guaranteed that the COVID-19 sequences were always in the test set in addition to three other viruses randomly picked from the dataset.

Models

A growing number of studies demonstrate the success of using artificial neural networks (ANN) in evaluating biological sequences in drug repositioning and repurposing (see Donner, Kazmierczak, and Fortney, 2018; Zeng et al., 2019). Previous work on training neural networks on nucleotide or amino acid sequences have been successful with recurrent models such as gated recurrent units (GRU), long short-term memory networks (LSTM) and bidirectional LSTMs (biLSTM), as well as 1D convolutions and 2D convolutional neural networks (CNN) (Lee and Nguyen, 2016; Hou, Adhikari, and Cheng, 2018). We focus here on these network architectures, where we conducted our experiments with an LSTM with 1D convolutions and bidirectional layers, as well as two CNNs that differ in their embedding layers. Our CNN results were obtained from the CNN without the embedding layer due to performance advantages.

LSTM & 1D Convolutions

A tokenizer (keras.preprocessing.text.Tokenizer) was used to encode the FASTA sequences (which are sequences of chars) into vectors consumable by the network. The sequences were then padded with zeros or cut off to a fixed length 500 to maintain a fixed input size.

Once trained, we save the model and send it to the Zetane environment, where we can inspect the model architecture and filters in an interactive setting (Figure 2 provides a video and image summary of the LSTM model):

**Figure 2:** Visualizations of MaxPool, Conv1D and ReLU filters of the LSTM model in Zetane.

Figure 3: LSTM Model trained in Keras and its inputs code snippet used to send the model to Zetane.

Convolutional Neural Network

For the CNN, the input features were one-hot encoded using methods similar to the tokenizer approach where each amino acid sequence is assigned an integer, except the integers are predetermined by the order of the FASTA alphabet/charset: this assists in interpretability when examining the 2D input arrays as images. The inputs are also fixed at a length of 500 amino acid characters, resulting in 500x28 images, where 28 is the number of elements in the FASTA charset.

We again save the trained model and send it to the Zetane environment, where we can inspect the model architecture and filters in an interactive setting, as demonstrated in Figure 4 (a video of the model is available here):

**Figure 4:** Inspection of 2D filters for Conv2D and Elu layers in Zetane.

Figure 5: CNN Model trained in Keras and its inputs code snippet used to send the model to Zetane.

**Figure 6:** Intermediate layer visualizations for multiple amino acid sequences shown in Zetane. The uniform coloured sections in the Conv2D & Elu layers correspond to the 0-padded sections of the sequences when they are shorter than 500 characters.

Experiment Setup

We conducted the experiments using Keras models & TensorFlow. We used binary cross entropy (BCE) loss, Adam optimizer, and precision & recall as metrics since accuracy tends to be an unreliable metric given the class imbalance and the sparse nature of our outputs. After training and validation, predictions were done on the validation set and the results were post-processed for interpretability. In post-processing, we applied a threshold to the sigmoid function outputs of the neural network, where we assigned each drug a probability of being a potential antiviral for a given amino acid sequence. After experimenting with different values, we settled on a threshold value of 0.2.

**Table 2:** A section of sample outputs for amino acid sequences and their associated antivirals. *Post-processing outputs a list of drugs that were selected along with the respective probabilities of the drugs being “effective” against the virus with the given amino acid sequence.*

Post-processing outputs a list of drugs that were selected along with the respective probabilities of the drugs being “effective” against the virus with the given amino acid sequence.

We also conducted some hyperparameter tuning in our experiments. A subset of hyperparameter values tested are:

Learning rate: 1e-2, 1e-3, 1e-4
Batch size: 32, 64, 128
Threshold: 0.9, 0.7, 0.5, 0.2
Number of epochs: 5 to 20
Maximum length of sequences: 500

Note that in regard to this first iteration of experiments, the scope of hyperparameter tuning, as well as the ANN architectures, are limited. There is room for improvement on this front.

Preliminary Results

Experiment I: 80%/20% Train-Test Split

In the regular setup, we performed an 80%/20% train-test split on our data of 30,479 amino acid sequences. We base our best results on validation F1-score and plots of relevant metrics, which appear in Figure 7.

**Figure 7:** Metrics table and plots for Experiment I on LSTM & CNN.

Our models handled the regular task successfully, achieving 0,958 F1-score in a multi-label multi-class problem setting. This means that the models aptly recognize the species of the virus from the amino acid sequence of protein substructures and appropriately assign the inhibiting antiviral drugs with great accuracy. These satisfactory results led to us implementing Experiment II.

Experiment II: Predictions on Unseen Virus Species

In this more challenging setup, we asked the models to predict inhibiting drugs for virus species that are absent from the training dataset. This meant the models were unable to recommend drugs by “recognizing” the virus from the amino acid sequence, and therefore had to rely only on protein substructures in the sequences in assigning drugs. In the results summarized in Figure 8, the test set consists of COVID-19, Herpes simplex virus 1, Human Astrovirus and Ebola virus, whose sequences were removed from the training set.

**Figure 8:** Metrics table & plots for Experiment II on CNN. For the table on the right, “Count” represents how many times each drug was detected as potentially effective in associating with herpes-simplex-virus-1 sequences, and “Average %” denotes the average confidence over all instances of the drug.

We see here that the CNN had issues with convergence.The accuracies are clearly below their counterparts in the regular setup, though this is certainly expected. We now turn to the actual predictions on the sequences and attempt to interpret them.

Examine, for instance, the drug predictions for Herpes simplex virus 1 (HSV-1). Here we see that our CNN is quite successful. In the DrugVirus database, all drugs that have seen Phase II or further drug trials for the virus are the most predicted by the CNN, which we consider very encouraging given that our model has not seen HSV-1 sequences before (see Table 3).

**Table 3:** All six drugs in the database that are used for Phase II and further trials for HSV-1 are predicted by our model. Three of the top five predictions are approved antivirals for HSV-1 and the only remaining approved antiviral is predicted 11th among 126 antivirals.

Predictions for SARS-CoV-2

With some variation between the two, both the LSTM and the CNN seem to converge on a number of drugs: ritonavir, lopinavir (both Phase III for MERS-CoV) and tilorone (Approved for MERS-CoV) are the top three candidates in both, while brincidofovir, rapamycin, cidofovir, valacyclovir and ganciclovir rank high up in both lists.

Most of the remaining drugs are present in both lists as well. The LSTM is more conservative in its predictions than the CNN, and the overall counts for SARS-CoV-2 are significantly lower than for Herpes simplex virus 1 for both, pointing a relative lack of confidence on the models’ part in predicting SARS-CoV-2 sequences.

Discussion and Future Work

By merging common techniques in machine learning and biomedical sciences, the research community gains additional strategies to advance innovation in health at ever-growing speed. We provide but one example of numerous and diverse studies underway across the globe happening in a concerted effort to mitigate a sudden public health crisis. Here we attest to the notable strengths of employing AI to identify tentative antivirals. With limited resources, our team at Zetane Systems with expertise in machine learning, software development and biomedical sciences were able to make use of public datasets to advance knowledge of possible treatments for a previously unknown virus — all within a month-and-a-half of concerted effort. Such speed and resourcefulness demonstrate how small research groups can implement AI technologies during periods of rapid change and uncertainty.

We note that the current discussion of our results will be narrow; time constraints limit our abilities to evaluate our findings following an exhaustive review of the scientific literature concerning antiviral treatments for coronaviruses. Instead, we compare our findings to a handful of recent publications in order to demonstrate the strengths and limitations of this study. Regardless, the preliminary results herein show promise and merit further investigation.

To begin, we note that our AI models predict that some antivirals that show promise as treatments against MERS-CoV may also be effective against SARS-CoV-2. These include the broad-spectrum antiviral tilorone (Ekins et al., 2020) and the drug lopinavir (Yao et al., 2020), the latter of which is now in Phase 4 clinical trials to determine its efficacy against COVID-19 (Basha 2020). This makes sense given that both are coronaviruses and thus share a high degree of similarity in their genetic sequences and protein structures. Such observations suggest with much confidence that our AI models can recognize reliable patterns between particular antivirals and species of viruses containing homologous amino acid sequences in their proteome.

Additional observations that support our findings have come to light from a study in The Lancet published a week prior to this article (Hung et al., 2020). This open-label, randomized, phase-2 trial observed that the combined administration of the drugs interferon beta-1b, lopinavir-ritonavir and ribavirin provides an effective treatment of COVID-19 in patients with mild to moderate symptoms. Both of our AI models flagged two of the drugs in that trial (note that interferon was not part of our datasets). In terms of number of occurrences (“Count”), the CNN model ranked ritonavir at 8th place (tied with letermovir) and ribavirin at 9th place; the LSTM model ranked ritonavir at 7th place (tied with lopinavir, cyclosporine and rapamycin) and ribavirin at 9th place (tied with 7 other compounds). Also high on our lists is the antimalarial drug, artesunate, which is now in Phase 2 clinical trials for COVID-19. Such observations are encouraging. They suggest that AI models can have value in identifying potentially therapeutic compounds that merit priority for advanced clinical trials. These observations add to growing observations that support using AI technology to streamline drug discovery. From that perspective, our AI models suggest that the broad spectrum antiviral brincidofovir, for instance, may be a top candidate for COVID-19 clinical trials in the near future.

The list of promising antivirals identified here has some notable discrepancies with emerging research findings. For instance, our AI models did not highlight the widely available anti-parasitic drug ivermectin. One research study — published while completing this manuscript — observed that ivermectin could inhibit the replication of SARS-CoV-2 in vitro (Caly et al., 2020). Further investigations will need to assess how in silico studies using AI compare to other research methods; indeed, no one research method is the be-all-end-all.

Another study providing a large-scale drug repositioning survey for SARS-CoV-2 antivirals (Riva et al. 2020) demonstrates that predictions made by our AI models have notable limits. This study screened a library of nearly 12 000 drugs and identified six candidate antivirals for SARS-CoV-2 that merit further clinical evaluation. These include the PIKfyve kinase inhibitor Apilimod, cysteine protease inhibitors MDL-28170, Z LVG CHN2, VBY-825, and ONO 5334, and the CCR1 antagonist MLN-3897. It comes as no surprise that our AI models did not identify these six compounds because our datasets did not contain them. Here we observe a well-known fact concerning AI: the technology is as good as the data used to build it. Future efforts to strengthen our AI models will thus require us to include a growing bank of novel data from emerging research findings into our model training protocols.

In terms of our AI models, better feature extraction can improve predictions drastically, enabling our models to detect finer protein subsequences which might associate well with antivirals. This step involves improvements through better data engineering and working with domain experts who are familiar with applied bioinformatics to better understand the nature of our data and find ways to improve our data processing pipeline. Here are proposals for future work that could strengthen the performance of our AI model:

Including additional analyses by domain experts may lead to a better understanding of the antivirals and the protein sequences they target. This would simultaneously assist in understanding what peptide structures the drugs suggested by our models target and lead to a better understanding of the decision-making process of our models.
Better handling of duplicates or “pseudo-duplicates” can improve the quality of the datasets. At the moment, duplicates are flagged only on virus species and amino acid sequence length, even though sequences of the same length belonging to the same species are not necessarily identical. We can improve our current approach by using string similarity measures such as Dice coefficient, cosine similarity, Levenshtein distance, etc.
A possible approach may be using a vectorizer (e.g. a Tf-Idf vectorizer) to extract features as n-grams (small sequences of chars), which has achieved success in similar problems (Szalkai and Grolmusz, 2017). Other unsupervised learning methods, such as singular value decomposition, have also been used to encode sequence information and may be applicable to our study (Wu et al., 1995).

We welcome constructive feedback concerning this study and offer an open call to partner with organizations that would like to use our AI technology to advance their own research. Time is of the essence in our efforts to counter COVID-19. The rapid results obtained here and knowing that numerous investigations are underway should provide us with some solace in knowing that treatments for the current pandemic are well within reach. Without question, our concerted efforts will ensure that it’s going to be okay, or as we say in our neck of the woods, ça va bien aller.

Declared conflicts of interest

The authors declare they will not obtain any direct financial benefit from investigating and reporting on any given pharmaceutical compound. The following study is funded by the authors’ employer, Zetane Systems, which produces software for AI technologies implemented in industrial and enterprise contexts.

Authors’ contributions

Cantürk wrote the manuscript; Cantürk, Singh and St-Amant conducted all research and developed the AI technology; Behrmann completed an internal peer-review and brief literature review, edited and contributed to the writing of the final manuscript.

Acknowledgements

We would like to thank the administrators of the DrugVirus and the NCBI Virus Portal for providing the datasets that are central to this study. We appreciate comments on preliminary drafts of this manuscript from Dr Tariq Daouda from the Massachusetts General Hospital, Broad Institute, Harvard Medical school.

References

[1] N. Stephenson et al., “Survey of Machine Learning Techniques in Drug Discovery,” Current Drug Metabolism, vol. 20, no. 3, pp. 185–193, 2019, doi: 10.2174/1389200219666180820112457

[2] F. Napolitano et al., “Drug repositioning: a machine-learning approach through data integration,” Journal of Cheminformatics, vol. 5, no. 30, June 2013, doi: 10.1186/1758–2946–5–30

[3] P. I. Andersen et al., “Discovery and development of safe-in-man broad-spectrum antiviral agents,” International Journal of Infectious Diseases, vol. 93, pp. 268–276, Feb. 2020, doi: 10.1016/j.ijid.2020.02.018

[4] Y. Donner, S. Kazmierczak, and K. Fortney, “Drug Repurposing Using Deep Embeddings of Gene Expression Profiles,” Mol. Pharmaceutics, vol. 15, no. 10, pp. 4314– 4325, Oct. 2018, doi: 10.1021/acs.molpharmaceut.8b00284

[5] X. Zeng, S. Zhu, X. Liu, Y. Zhou, R. Nussinov, and F. Cheng, “deepDR: a network-based deep learning approach to in silico drug repositioning,” Bioinformatics, vol. 35, no. 24, pp. 5191–5198, Dec. 2019, doi: 10.1093/bioinformatics/btz418.

[6] T. K. Lee and T. Nguyen, “Protein Family Classification with Neural Networks”, 2016. URL: https://cs224d.stanford.edu/reports/LeeNguyen.pdf.

[7] J. Hou, B. Adhikari, and J. Cheng, “DeepSF: deep convolutional neural network for mapping protein sequences to folds,” Bioinformatics, vol. 34, no. 8, pp. 1295–1303, Apr. 2018, doi: 10.1093/bioinformatics/btx780 .

[8] S. Ekins, TR. Lane, PB. Madrid, “Tilorone: a Broad-Spectrum Antiviral Invented in the USA and Commercialized in Russia and beyond,” Pharm Res, vol. 37, no. 4, March 2020, doi: 10.1007/s11095–020–02799–8

[9] TT. Yao, JD. Qian, WY. Zhu, Y. Wang, GQ. Wang, “A systematic review of lopinavir therapy for SARS coronavirus and MERS coronavirus — A possible reference for coronavirus disease‐19 treatment option,” Journal of Medical Virology, vol. 92, no. 6, pp. 556–563, February 2020, doi: 10.1002/jmv.25729

[10] SH. Basha, “Corona virus drugs: a brief overview of past, present and future,” Journal of PeerScientist, vol. 2, no. 2, April 2020, doi: 10.5281/zenodo.3747641

[11] I. FN. Hung, KC. Lung, E. YK. Tso, R. Liu, T. WH. Chung, MY. Chu et al. “Triple combination of interferon beta-1b, lopinavir–ritonavir, and ribavirin in the treatment of patients admitted to hospital with COVID-19: an open-label, randomised, phase 2 trial,” The Lancet, early release, May 2020, 10.1016/S0140–6736(20)31042-4

[12] L. Caly, JD. Druce, MG. Catton, DA Jans, KM Wagstaff, “ The FDA-approved drug ivermectin inhibits the replication of SARS-CoV-2 in vitro,” Antiviral Research, vol. 178, pp. 104787, June 2020, doi: 10.1016/j.antiviral.2020.104787

[13] L. Riva et al., “A Large-scale Drug Repositioning Survey for SARS-CoV-2 Antivirals” bioRxiv, pre-print, April 2020, doi: 10.1101/2020.04.16.044016

[14] B. Szalkai and V. Grolmusz, “Near Perfect Protein Multi-Label Classification with Deep Neural Networks,” arXiv:1703.10663 [cs, q-bio, stat], Mar. 2017, Accessed: Mar. 30, 2020. [Online]. Available: http://arxiv.org/abs/1703.10663.

[15] C. Wu, M. Berry, S. Shivakumar, and J. McLarty, “Neural networks for full-scale protein sequence classification: Sequence encoding with singular value decomposition,” Mach Learn, vol. 21, no. 1–2, pp. 177–193, 1995, doi: 10.1007/BF00993384.