Part VI: Exploring COVID-19 virus variants on GCP to predict if new vaccine might need to be developed in future

Vadim Astakhov, Jignesh Mehta, Wei Hsia, Prasad Alle

The emergence of new covid variants shows how important it is to have easily reprogrammable vaccines. But as vaccine makers have repeatedly discovered, the multilayered human immune system is very tricky to control and the Healthcare system might have to get beyond relying on just vaccines and develop therapies and cures that directly destroy viruses.

As some researchers believe, the long-range solution to the fight against viruses might be the use of CRISPR to guide a scissors-like enzyme to chop up the genetic material of a virus, without ever having to enlist the patient’s immune system.

In any case, existing and future therapies have to rely on knowledge of how the virus will mutate and if it might lead to emergence of a new dangerous strain. (Striking example can be a recently identified strain which potentially can target kids.)

In this post, we will use Google Cloud Platform to set-up a bioinformatics pipeline to simulate virus mutation. Google recently announced a massive Google-funded COVID database which tracks variants and immunity. This COVID-19 dataset contains detailed information on over five million anonymized cases from over 100 countries.

We will demonstrate use of AI Platform Notebooks to use some Covid-19 BQ Datasets and run multiple simulations to explore various mutation scenarios. Finally, we will leverage results of simulation to generate peptide vaccine candidates for potential new virus strains using AI/ML models developed in the previous blog of this series.

This work is meant as education materials for bioinformatics class to demonstrate use of Google Cloud Platform for exploring virus evolution and performing computations experiments to explore new candidates for peptide vaccines.

Introduction

When the severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) known as COVID-19 infection spread around the globe, experts wondered how the virus emerged and how it might be changing as it passed from person to person. Several studies published recently provide some clues on evolution dynamics under selection which might lead to emergence of COVID-19.

Like any other organism on the planet, the SARS-CoV-2 virus is prone to the natural genetic diversity that arise from mutations. Since the virus hit the world, there have been fears that mutations could lead to more infectious or even more lethal strains of the coronavirus disease COVID-19.

But the first results from laboratories across the world are encouraging. The coronavirus presents little variability (approximately seven mutations per sample).

Compared with HIV, SARS-CoV-2 is changing much more slowly as it spreads. Common influenza has a variability rate that is more than double.

Currently, there are six strains of coronavirus identified across the globe. The original one is labeled as the L strain, which appeared in Wuhan in December 2019. Its first mutation — the S strain — appeared at the beginning of 2020, and since January 2020, two other strains V and G were identified. Strain G mutated into strains GR and GH in February 2020. Strain G and its related strains GR and GH are by far the most widespread and most frequent across Europe and Italy while GH strain is most widespread in North America. Besides these six main coronavirus strains, researchers identified some infrequent mutations that, at the moment, are not worrying but should nevertheless be monitored.

(‘A catalogue of coronavirus mutations’ -Nextstrain, an effort to analyse SARS-CoV-2 genomes in real time). -Sources: L. Van Dorp et al. (http://go.nature.com/3GSRNH6); Refs 2, 11, 12; B. E. Young et al. Lancet 396, 603–611 (2020)

As an example, in 2020 one particular mutation got a strong attention from the researchers community. Virologists are calling it the D614G mutation. It is in the gene encoding the spike protein, which helps virus particles to penetrate cells. Researchers claimed that the mutation appeared again and again in samples from people with COVID-19 and might be leading to more virus spread in the population.

There are still more questions than answers about coronavirus mutations. Thus studying mutations in detail could be important for controlling the pandemic. It might also help to pre-empt the most worrying of mutations: those that could help the virus to evade immune systems, vaccines or antibody therapies.

Simulation of virus evolution

Cloud based bioinformatics computation platform can provide researchers with a tool to simulate various scenarios of virus evolution under certain conditions. Such simulation can provide expectations and predict if virus might become more virulent and/or deadly. The results can be used to plan for new vaccine or antiviral drug candidates.

As we know, virus mutation is guided by underlying biological processes such as replication, recombination, point mutations, insertion-deletions, and selection. Realistic simulation should explore them under various fitness models and population size dynamics. That requires researchers to run multiple parallel computation experiments with different sets of conditions.

In this blog, we demonstrate how Google Cloud can provide an scalable and cost effective computational platform to run virus evolution in silico experiments and virus mutation genomics analysis by leveraging services such as Workflows, Life Science API and Cloud Run.

Virus Mutation Pipeline on GCP

There are several open source software packages to simulate virus evolution in silico and explore potential mutations. In this blog we leverage

SANTA-SIM, a software package to simulate the evolution of a population of gene sequences forwards through time.

We are leveraging Google service — AI Platform Notebooks to run in silico virus evolution simulation.

AI Platform Notebooks service lets you take advantage of open source data tools for batch processing, querying, streaming, and machine learning. AI Platform Notebooks helps you create environment quickly, manage them easily, and save money by turning off when you don’t need them. With less time and money spent on administration, you can focus on your jobs and your data.

You can deploy new JupyterLab instances with one click and start analyzing your data immediately. Each instance comes pre-configured with optimized versions of the most popular data science and machine learning libraries including TensorFlow, Keras, PyTorch, fast.ai, RAPIDS, NumPy, scikit-learn, pandas, and Matplotlib.

You’ll go from data to a deployed machine learning model without leaving AI Platform Notebooks. You can start small and scale up by adding CPUs, RAM, and GPUs.

Provision your first GCP AI Platform Notebook Instance

Before we can set up an AI Platform Notebook, we will have to set up an account and billing: go to GCP AI Platform and click ‘go to console’ and be sure to click ‘Enable API’ below to access notebooks.

Once the API is enabled, go to the instance page and select the hardware of the virtual machine to run on.

Then click on the ‘New Instance’ button and select ‘Python 3’ or ‘PyTorch 1.7’

This will open up a menu where you’ll select the region you’d like to use. Once you have a region selected, you can use ‘Customized instance’ or ‘ADVANCED OPTIONS’ and pick up a machine with the pre-defined RAM and vCPU.

You can scale the hardware up and down with RAM and vCPU later if needed. Important pricing consideration is that this instance will only generate fees when it is running and it can be paused at any time!

Deploy software

After instance starts, open Jupyter notebook and make sure you have Java and Ant installed. If not, download it.

!sudo tar -xzvf jdk-14.0.1_linux-x64_bin.tar.gz

!sudo tar -xzvf apache-ant-1.9.15-bin.tar.gz

# Set `PATH` to include java and ant binary directory

PATH=%env PATH

%env ANT_HOME=/home/jupyter/AIHub/apache-ant-1.9.15/

%env JAVA_HOME=/home/jupyter/AIHub/jdk-14.0.1/

%env PATH={PATH}:/home/jupyter/AIHub/jdk-14.0.1/bin:/home/jupyter/AIHub/apache-ant-1.9.15/bin

# Read GCP project id from env.

shell_output=!gcloud config list — format ‘value(core.project)’ 2>/dev/null

GCP_PROJECT_ID=shell_output[0]

print(“GCP project ID:” + GCP_PROJECT_ID)

Download SANTA-SIM software, unzip santa-sim-master.zip and build it with “ant” command. Steps can be found in example Jupyter notebook published on GitHub.

Scaling Simulations with Google Cloud Run

There are many ways to deploy the simulations at scale — one repeatable one is to embed the simulation into a container and use a serverless deployment such as Google Cloud Run.

Google Cloud Run will scale to and from zero resources as needed. The containers will listen for a request and execute the appropriate application as requested. The fully managed serverless architecture will scale up to meet demand and only charge for resources consumed.

You can choose to create your own application but for the purposes of this demonstration we approached it naively and put the software embedded in the container with no modifications.

We chose Python as the language of choice for the container.

The Google click to deploy container is a lightweight Debian OS container from Google Cloud Marketplace.

In the container, we set the commands for the necessary libraries and the application code. Finally, we set the entry point for the container.

(Note for the container, there are many optimizations you can make, the below is a sample that shows the steps needed but not optimized for layers and best practices.)

# Debian Parent Image

FROM marketplace.gcr.io/google/debian9:latest

# Set the working directory.

WORKDIR /usr/src/app

# Copy the file from your host to your current location.

COPY santa.jar .

COPY cloudRunListenAndDeploy.py .

# Expose port

EXPOSE 8080

# Entry point

CMD [“python3”,”cloudRunListenAndDeploy.py”]

# Build Python and Java

RUN apt-get update && \

apt-get install -y apt-utils && \

apt-get install -y python3 && \

apt-get install -y python3-pip

# Install OpenJDK-8

RUN apt-get update && \

apt-get install -y openjdk-8-jdk && \

apt-get install -y ant && \

apt-get clean

# Fix certificate issues

RUN apt-get update && \

apt-get install ca-certificates-java && \

apt-get clean && \

update-ca-certificates -f

# pip install flask

RUN pip3 install flask && \

pip3 install install google-cloud-storage

# Setup JAVA_HOME — useful for docker commandline

ENV JAVA_HOME /usr/lib/jvm/java-8-openjdk-amd64/

RUN export JAVA_HOME

We used Flask, a lightweight web application framework, to listen for requests to execute the application code on Google Cloud Run.

@app.route(‘/execSantaSim’, methods=[‘POST’])

def execSantaSim():

current_sim = santa_sim(request.json)

current_sim.copy_object_to_local()

current_sim.exec_sim()

current_sim.copy_objects_to_google_cloud()

current_sim.clean_up_local()

return (“Done”)

if __name__ == “__main__”:

app.run(debug=True, host=’0.0.0.0', port=int(os.environ.get(‘PORT’, 8080)))

Similar to the container itself, there are many ways to send requests to Google Cloud Run. We opted to use a Python script that will generate the XML files in accordance to the specifications. From there, you can simply call the HTTP url many times.

The architecture looks like this:

Run simulation to explore virus mutation scenarios

SANTA-SIM package has a folder /example which has examples of various mutation scenarios. You can use the xml files as a starting point to model COVID-19 virus evolution. You will have to edit the sequence tag to provide a virus RNA fragment which you are interested to explore.

<sequences> ATGTT…</sequences>

Also, you will have to genome coordinates, population size and fitness model as well as mutation rate and transition bias.

<coordinates>1–3822</coordinates>

<populationSize>10000</populationSize>

<mutationRate>1.0E-4</mutationRate> <transitionBias>2.0</transitionBias>

An example can be found here.

Now you ready to run simulation experiment:

!java -jar /home/jupyter/AIHub/santa-sim-master/dist/santa.jar examples/covid.xml

And you should see the output

….

INFO: Replicate 1

Starting epoch: (unnamed)

Initial population: fitness = 1.0, distance = 0.0, max freq = 10000, genepool size = 1 (0 available)

Generation 100: fitness = 1.0, distance = 35.9001, max freq = 12, genepool size = 7319 (3299 available)

Generation 200: fitness = 1.0, distance = 71.7807, max freq = 10, genepool size = 7276 (3342 available)

….

Results of simulation will be stored in multiple covid_…csv and covid_…nex files where .nex files will contain mutated RNA sequences.

Finally, the container can be deployed automatically to CloudRun Google Cloud Workflows using this template.

Next Step: Variant Transformation

Google provides deep variant solutions as well as few other Life Science Pipelines to identify variants in genome sequences which have been produced by virus mutation simulation.

The next step of the workflow is to identify parts of RNA responsible for various proteins. There are a few open source packages which can be employed here:

DeepVacPred Code for designing and sieving the vaccine datasets, training the DNNs and predicting the vaccine subunits.

GalaxyWEB, a service to predict protein structure which is an essential step in vaccine and antiviral drug development.

To identify virus proteins, We leverage a few customized scripts to perform RNA translation.

Original script was taken from open source libraries:

Explore proteins

Run those scripts to explore proteins emerged in mutated viruses.

!python ../COVID19/corona/corona-mutate-np.py

Nucleocapsid Phosphoprotein → MSDNGPQNQRNAPRITFGGPSDSTGSNQNGERSGARSKQRRPQGLPNNTASWFTALTQHGKEDLKFPRGQGVPINTNSSPDDQIGYYRRATRRIRGGDGKMKDLSPRWYFYYLGTGPEAGLPYGANKDGIIWVATEGALNTPKDHIGTRNPANNAAIVLQLPQGTTLPKGFYAEGSRGGSQASSRSSSRSRNSSRNSTPGSSRGTSPARMAGNGGDAALALLLLDRLNQLESKMSGKGQQQQGQTVTKKSAAEASKKPRQKRTATKAYNVTQAFGRRGPEQTQGNFGDQELIRQGTDYKHWPQIAQFAPSASAFFGMSRIGMEVTPSGTWLTYTGAIKLDDKDPNFKDQVILLNKHIDAYKTFPPTEPKKDKKKKADETQALPQRQKKQQTVTLLPAADLDDFSKQLQQSMSSADSTQA

!python ../COVID19/corona/corona-mutate-mg.py

Membrane Glycoprotein → MADSNGTITVEELKKLLEQWNLVIGFLFLTWICLLQFAYANRNRFLYIIKLIFLWLLWPVTLACFVLAAVYRINWITGGIAIAMACLVGLMWLSYFIASFRLFARTRSMWSFNPETNILLNVPLHGTILTRPLLESELVIGAVILRGHLRIAGHHLGRCDIKDLPKEITVATSRTLSYYKLGASQRVAGDSGFAAYSRYRIGNYKLNTDHSSSSDNIALLVQ

!python ../COVID19/corona/corona-mutate-ep.py

Envelope Protein also known as small membrane → MYSFVSEETGTLIVNSVLLFLAFVVFLLVTLAILTALRLCAYCCNIVNVSLVKPSFYVYSRVKNLNSSRVPDLLV

!python ../COVID19/corona/corona-mutate-spike.py

Spike Glycoprotein → MFVFLVLLPLVSSQCVNLTTRTQLPPAYTNSFTRGVYYPDKVFRSSVLHSTQDLFLPFFSNVTWFHAIHVSGTNGTKRFDNPVLPFNDGVYFASTEKSNIIRGWIFGTTLDSKTQSLLIVNNATNVVIKVCEFQFCNDPFLGVYYHKNNKSWMESEFRVYSSANNCTFEYVSQPFLMDLEGKQGNFKNLREFVFKNIDGYFKIYSKHTPINLVRDLPQGFSALEPLVDLPIGINITRFQTLLALHRSYLTPGDSSSGWTAGAAAYYVGYLQPRTFLLKYNENGTITDAVDCALDPLSETKCTLKSFTVEKGIYQTSNFRVQPTESIVRFPNITNLCPFGEVFNATRFASVYAWNRKRISNCVADYSVLYNSASFSTFKCYGVSPTKLNDLCFTNVYADSFVIRGDEVRQIAPGQTGKIADYNYKLPDDFTGCVIAWNSNNLDSKVGGNYNYLYRLFRKSNLKPFERDISTEIYQAGSTPCNGVEGFNCYFPLQSYGFQPTNGVGYQPYRVVVLSFELLHAPATVCGPKKSTNLVKNKCVNFNFNGLTGTGVLTESNKKFLPFQQFGRDIADTTDAVRDPQTLEILDITPCSFGGVSVITPGTNTSNQVAVLYQDVNCTEVPVAIHADQLTPTWRVYSTGSNVFQTRAGCLIGAEHVNNSYECDIPIGAGICASYQTQTNSPRRARSVASQSIIAYTMSLGAENSVAYSNNSIAIPTNFTISVTTEILPVSMTKTSVDCTMYICGDSTECSNLLLQYGSFCTQLNRALTGIAVEQDKNTQEVFAQVKQIYKTPPIKDFGGFNFSQILPDPSKPSKRSFIEDLLFNKVTLADAGFIKQYGDCLGDIAARDLICAQKFNGLTVLPPLLTDEMIAQYTSALLAGTITSGWTFGAGAALQIPFAMQMAYRFNGIGVTQNVLYENQKLIANQFNSAIGKIQDSLSSTASALGKLQDVVNQNAQALNTLVKQLSSNFGAISSVLNDILSRLDKVEAEVQIDRLITGRLQSLQTYVTQQLIRAAEIRASANLAATKMSECVLGQSKRVDFCGKGYHLMSFPQSAPHGVVFLHVTYVPAQEKNFTTAPAICHDGKAHFPREGVFVSNGTHWFVTQRNFYEPQIITTDNTFVSGNCDVVIGIVNNTVYDPLQPELDSFKEELDKYFKNHTSPDVDLGDISGINASVVNIQKEIDRLNEVAKNLNESLIDLQELGKYEQYIKWPWYIWLGFIAGLIAIVMVTIMLCCMTSCCSCLKGCCSCGSCCKFDEDDSEPVLKGVKLHYT

Generate new peptide candidates

Those protein sequences can help us identify new epitopes as prospective peptide vaccine candidates. Run script to generate new peptides:

!python ../COVID19/corona/corona-mutate-peptides.py

+++++++++++++++++++++++++++++++++++++

+++++++++++++++++++++++++++++++++++++

Spike Glycoprotein peptides MFVFLVLL

Spike Glycoprotein peptides FVFLVLLP

Spike Glycoprotein peptides VFLVLLPL

Spike Glycoprotein peptides FLVLLPLV

Spike Glycoprotein peptides LVLLPLVS

Spike Glycoprotein peptides VLLPLVSS

Spike Glycoprotein peptides LLPLVSSQ

Spike Glycoprotein peptides LPLVSSQC

Spike Glycoprotein peptides PLVSSQCV

Spike Glycoprotein peptides LVSSQCVN

……..

+++++++++++++++++++++++++++++++++++++

+++++++++++++++++++++++++++++++++++++

These candidates can be added to Big Query warehouse as discussed in Part-III of this series.

New peptide vaccine candidates

We will be using AutoML model created in previous blog of this series (Part-III), to predict if new mutated epitopes might be candidates for new vaccine

SELECT

predicted_Qualitative_Measure, predicted_Qualitative_Measure_probs

FROM ML.PREDICT(MODEL `corona.Classification_model_P2`, (

SELECT Qualitative_Measure, Description, Allele_Name, Quantitative_measurement

FROM `bigquery-public-data.immune_epitope_db.mhc_ligand_full`

WHERE length(Description) IN (9,10)

AND organism_name like ‘%coronavirus%’

AND rand() < 0.0009))

AutoML model will return predicted binding score.

Binding score can be used to classify if a peptide is a good candidate for vaccine testing. The reason for that was discussed in Part I-II.

Conclusion and Future work

We demonstrated use of Google Cloud Platform for exploring virus evolution and to set up computations experiments to predict virus mutations and explore new candidates for peptide vaccines.

We are working on adding new components to compliment this pipeline.

I don’t speak for my employer. This is not official Google work. Any errors that remain are mine, of course.

--

--