Data analysis of Coronavirus and its Host, basic bioinformatics methods with code

Vladimir Naumov
Feb 6, 2020 · 5 min read

I do bioinformatics — and finally decided to look what can I do with DNA sequencing data of new coronavirus. I got some interesting results using 1 day of my time and open source software. Will be happy to answer any of your questions and hope that it will be useful for all of us — maybe you will ‘blast’ something cool and interesting. I tried to put all needed commands so you can reproduce the results. Also you will be able to find files attached to that article

I know that there is coronavirus genome published and in modern science when you publish something you need to make raw data open, so I went to Sequence read archive — place, where you can find everything what was sequenced (DNA reading process) to check if there is something about coronavirus

It was easy to find latest data with DNA sequenced to “to find out the possible etiologic agents associated with the severe human respiratory disease in Wuhan, China” as study description says. The date is 2020–01–27

I downloaded those two .fastq files — its file format to handle short nucleotide sequences. Here is what’s inside :

one read — 4 lines: name, nucleotides, + , qualities

We can see that we have ~ 28 millions of reads of length ~ 142 nucleotides

There is high percentage of duplicates which is both bad (we could have more data) and good (maybe reads number is big enough to get full info from the sample)

So I went to the process called de novo sequences assembly, using SPAdes [], great tool to de novo assembly genomes

It worked whole night on my desktop and finally I got the results. The most interesting results here are in file assembly_graphs_with_scaffolds.gfa, and contigs.fasta. They both contain assembled sequences, let’s look what we have here:

We see that the first contig with length ~ 30 000 (looks like a virus!) is covered 262 times which is really good. I like to look at the results of assembly using bandage [] program:

Here is how results look if ve filter good covered ( > 50x) contigs and see how do they look next to each other. The great functionality here is the possibility to select any contig and blast it using NCBI database — so you can understand that things you was able to assembly are similar to something seen before

Here is what I could find with an example:

I put node 1 (29900nt) to and hit ‘BLAST’ here is what we see:

we have a version of coronavirus genome from someones lungs. What we can do next — look for protein coding genes in the genome to see if there is something interesting there. I decided to use — tool

I t takes our sequence (30 000 nt) and runs through it using hidden-markov model, pre-trained to detect genes in genomes, here they are: amino acid sequences of coronavirus:

again loading them to ncbi blast (this time protein blast)

We have 11 protein sequences, for each of them we see almost the same picture for taxonomy for all of them:

Than I tried to blast this small DNA ring:

as its length is ~16 000 nucleotides looks like its human mitochondrial DNA

funny that SRA says that ‘The SRA runs have been pre-filtered by NCBI to remove contaminating human sequence’ , but blasting the sequence we see that it is human mitochondrial DNA. I decided to check what is the haplogroup of this DNA sample. For this reason I mapped reads to rCRS [] mitochondrial sequence and called variants.

here is the code to get mitochondrial variants

those variants were then loaded to HaploGrep — web service for mitochondrial DNA . It worked fine and gave me information that sample has D5b1b2 mitochondrial haplogroup

At the moment I decided to see how many reads can be mapped to human genome and got that 43 out of 60 million reads can be mapped to human genome. I called variants in that genome and then used my own tool to see where on the world map can be found people with similar genomes

than I went back to blasting and got several ideas of what to do next

  • better study blast results — mostly for that strange big scaffold, formed from different bacterial species
  • try to make AlphaFold / or open analogues like work and predict proteins 3d structure (so it will be possible even to do screen for some drugs)

I uploaded results of assembly and protein annotation to the cloud, here are the links:

Predicted proteins — cool for blasting:

Scaffolds — better look in Bandage software:

Hope it was useful and you know now what can be done by 1 bioinformatician in 1 day

Analytics Vidhya

Analytics Vidhya is a community of Analytics and Data…

Sign up for Analytics Vidhya News Bytes

By Analytics Vidhya

Latest news from Analytics Vidhya on our Hackathons and some of our best articles! Take a look.

By signing up, you will create a Medium account if you don’t already have one. Review our Privacy Policy for more information about our privacy practices.

Check your inbox
Medium sent you an email at to complete your subscription.

Vladimir Naumov

Written by

genomic data scientist

Analytics Vidhya

Analytics Vidhya is a community of Analytics and Data Science professionals. We are building the next-gen data science ecosystem

Vladimir Naumov

Written by

genomic data scientist

Analytics Vidhya

Analytics Vidhya is a community of Analytics and Data Science professionals. We are building the next-gen data science ecosystem

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store