Which animal will cause the next pandemic? AI gives some hints

AI model have found 20,000 unknown associations between the viruses and animal species

Receptor.AI Company
Receptor.AI
Published in
4 min readJul 13, 2021

--

Image from: https://www.genengnews.com/wp-content/uploads/2019/09/Sep25_2019_GettyImages_918201600_VirusCloseup.jpg

Most of the human viruses, including the most dangerous ones, have the animal origin: HIV is likely to originate from monkeys, the most dangerous strains of flu are coming from pigs and birds, finally the SARS-CoV-2 coronavirus evolved form the viruses commonly found in bats.

Our prompt reaction on the threat of future pandemics depends on our understanding of the links between viruses and their animal hosts. Although an importance of this association is known for decades, we still don’t know the natural hosts of many thousands of viruses. Even the mammals — our direct relatives on the tree of life — are studies insufficiently in this respect. According to the most reliable estimates, less than 1% of mammalian viral diversity has been discovered to date.

Many viruses are strictly species-specific and can only infect just one particular species or few closely related species of mammals. In contrast, other viruses are cosmopolitan and share a broad range of hosts. Such broad species range is especially alarming, because if the virus is not very picky in terms of the host it might as well infect the humans after gaining just a few unfortunate mutations. Indeed, SARS and MERS coronaviruses are believed to have originated in bats, but they can infect other unrelated mammals: palm civets, camels, cats, minks and eventually humans.

The obvious problem here is our anthropocentrism. We are mostly interested in humans, domesticated animals and pets, rather than in wild animals. We currently know 274 human viruses and only about 5-7 per any other primate species. When pangolins become suspected as reservoirs of the current devastating coronavirus the scientists immediately found three new virus species in these exotic animals. Before this nobody was ever interested in pangolin viruses and no resources and financing were allocated to research them.

Despite this bias the number of known viruses and known virus-animal associations is large enough and it is possible to infer previously unknown links based on available data. However, the amount of data and its complexity is overwhelming and beyond the analytical abilities of human intelligence. Fortunately, this is not a problem for AI.

In the recent paper published in Nature Comminications, the authors extracted all know virus-mammal associations from the ENHanCEd Infectious Diseases Database. This resource performs data mining in genetic data annotations in Genbank (>7 millions of entries) and in the titles and abstracts of publication in PubMed (>8 million of entries). An extensive filtering and post-processing of this data resulted in 6331 associations between 1896 viruses and 1436 terrestrial mammals species.

All these associations were presented as a sparse graph. The features for the machine learning where chosen from mammalian perspective (what is the probability of an association forming between this mammal and each of the virus species?), from the virus perspective (what is the probability of an association between this virus and each mammal species?) and from the network perspective (how similar are the topological network features to already known associations?).

An example of the network motifs used to train the AI. Figure from https://www.nature.com/articles/s41467-021-24085-w#Sec2

The authors used a number of classification algorithms: Model Averaged Neural Network (avNNet), Stochastic Gradient Boosting (GBM), Random Forest, eXtreme Gradient Boosting (XGBoost), Support Vector Machines with radial basis kernel and class weights (SVM-RW), Linear SVM with Class Weights (SVM-LW), SVM with Polynomial Kernel (SVM-P), and Naive Bayes. All ML algorithms were trained and cross-validated carefully.

This AI-based approach allowed to identify 20,832 previously unknown unknown potential associations between mammals and known viruses. Quite predictably, the evolutionary and ecological similarities between the host animals are the most reliable predictors for sharing the same virus. Less obvious is that on average each virus infects 14.33 mammal species. For RNA viruses this number is even higher and reaches 21.65. Another important conclusion is that each species of wild mammals has on average 13.45 virus species,which were never observed experimentally.

Although this study has some limitations it shows how AI and ML may help us to identify the most dangerous animal and virus species, which could be the source of the next devastating pandemic. It is impossible to predict when and where the animal virus will jump to humans next time, but it is possible to the direct the research efforts to the most alarming species. This will increase our chances to meet a new pathogen with better basic knowledge about its evolution and biology and to minimize its negative impact on our civilization.

This example shows the power of the Knowledge Graph approaches, which excel in finding correlations and complex causation patterns in heterogeneous data from different sources.

Receptor.AI is also uses the Knowledge Graph approach as a technological basis for our drug discovery products and services. We are developing comprehensive graph which includes heterogeneous data from dozens of databases and online resources, such as Pubmed, Gene Ontology, Protein Ontology, Cell Ontology, Human Phenotype Ontology, Human Disease Ontology, chemical databases, Cellular Toxicogenomic Database, Reactome, Uniprot, clinical trials databases and other sources.

In order to extract an information from all these diverse sources we are using advanced Natural Language Processing techniques, which can detect different types of entities including genes, proteins, diseases, chemical compounds, drugs, metabolic pathways, cell and cell organelles, etc. We are using a combination of approaches for identifying semantic relations between different biological entities and for determining how strong the connection between the entities is.

This graph serves as a universal knowledge base shared by all our products, services and in-house software modules. It allows us reaching seamless integration of all pipeline components and facilitates unified flow of semantically relevant data between them.

--

--

Receptor.AI Company
Receptor.AI

Official account of RECEPTOR.AI company. We make the cell membranes druggable to provide new treatments for cancer and cardiovascular diseases.