Data analysis of Coronavirus and its Host, basic bioinformatics methods with code

Published in

Analytics Vidhya

5 min readFeb 6, 2020

I do bioinformatics — and finally decided to look what can I do with DNA sequencing data of new coronavirus. I got some interesting results using 1 day of my time and open source software. Will be happy to answer any of your questions and hope that it will be useful for all of us — maybe you will ‘blast’ something cool and interesting. I tried to put all needed commands so you can reproduce the results. Also you will be able to find files attached to that article

I know that there is coronavirus genome published and in modern science when you publish something you need to make raw data open, so I went to Sequence read archive — place, where you can find everything what was sequenced (DNA reading process) to check if there is something about coronavirus

It was easy to find latest data with DNA sequenced to “to find out the possible etiologic agents associated with the severe human respiratory disease in Wuhan, China” as study description says. The date is 2020–01–27

Studies : Browse : Sequence Read Archive

I downloaded those two .fastq files — its file format to handle short nucleotide sequences. Here is what’s inside :

one read — 4 lines: name, nucleotides, + , qualities

We can see that we have ~ 28 millions of reads of length ~ 142 nucleotides

There is high percentage of duplicates which is both bad (we could have more data) and good (maybe reads number is big enough to get full info from the sample)

So I went to the process called de novo sequences assembly, using SPAdes [http://cab.spbu.ru/software/spades/], great tool to de novo assembly genomes

It worked whole night on my desktop and finally I got the results. The most interesting results here are in file assembly_graphs_with_scaffolds.gfa, and contigs.fasta. They both contain assembled sequences, let’s look what we have here:

We see that the first contig with length ~ 30 000 (looks like a virus!) is covered 262 times which is really good. I like to look at the results of assembly using bandage [https://rrwick.github.io/Bandage/] program:

Here is how results look if ve filter good covered ( > 50x) contigs and see how do they look next to each other. The great functionality here is the possibility to select any contig and blast it using NCBI database — so you can understand that things you was able to assembly are similar to something seen before

Here is what I could find with an example:

I put node 1 (29900nt) to https://blast.ncbi.nlm.nih.gov/ and hit ‘BLAST’ here is what we see:

we have a version of coronavirus genome from someones lungs. What we can do next — look for protein coding genes in the genome to see if there is something interesting there. I decided to use https://github.com/hyattpd/Prodigal — tool

I t takes our sequence (30 000 nt) and runs through it using hidden-markov model, pre-trained to detect genes in genomes, here they are: amino acid sequences of coronavirus:

again loading them to ncbi blast (this time protein blast) https://blast.ncbi.nlm.nih.gov/Blast.cgi

We have 11 protein sequences, for each of them we see almost the same picture for taxonomy for all of them:

Than I tried to blast this small DNA ring:

as its length is ~16 000 nucleotides looks like its human mitochondrial DNA

funny that SRA says that ‘The SRA runs have been pre-filtered by NCBI to remove contaminating human sequence’ , but blasting the sequence we see that it is human mitochondrial DNA. I decided to check what is the haplogroup of this DNA sample. For this reason I mapped reads to rCRS [https://www.ncbi.nlm.nih.gov/nuccore/251831106] mitochondrial sequence and called variants.

here is the code to get mitochondrial variants

those variants were then loaded to HaploGrep — web service for mitochondrial DNA https://haplogrep.i-med.ac.at/ . It worked fine and gave me information that sample has D5b1b2 mitochondrial haplogroup

At the moment I decided to see how many reads can be mapped to human genome and got that 43 out of 60 million reads can be mapped to human genome. I called variants in that genome and then used my own tool to see where on the world map can be found people with similar genomes

than I went back to blasting and got several ideas of what to do next

better study blast results — mostly for that strange big scaffold, formed from different bacterial species
try to make AlphaFold / or open analogues like https://github.com/dellacortelab/prospr work and predict proteins 3d structure (so it will be possible even to do screen for some drugs)

I uploaded results of assembly and protein annotation to the cloud, here are the links:

Predicted proteins — cool for blasting: https://yadi.sk/d/tkDn9bfvzQZIDA

Scaffolds — better look in Bandage software: https://yadi.sk/d/ZsRcKLfxIjWO3w

Hope it was useful and you know now what can be done by 1 bioinformatician in 1 day

Data analysis of Coronavirus and its Host, basic bioinformatics methods with code

Studies : Browse : Sequence Read Archive

Written by Vladimir Naumov