Studying the Genetics of COVID-19

This blog presents the necessary background on biological aspects, a brief literature review, and the types of data relevant to conducting large-scale research on this topic

--

If you have decided to research the genetics of the virus SARS-CoV-2, after a brief literature review, you are most likely to discover The National Center for Biotechnology Information NCBI with its impressively large-scale bank of genes. In this blog, I will give you a basic relevant biological background that you might need to gain intuition about the data format (it is called a Fasta file) that you (may be data scientists, engineers, bioinformatics and so forth) might need in order to develop large-scale machine learning efforts on this topic.

It is more joyful to work on data when you understand its meaning; doing so makes the whole project “alive” and not just a bunch of numbers, graphs, tables etc .

With no further intros, let’s jump into details…

Biological Background

From DNA to the generation of proteins

Deoxy-ribonucleic acid DNA and ribonucleic acid RNA are composed of four basic molecules called nucleotides. Nucleotides contain phosphate, sugar and one of the four bases denoted A,C,G,T in DNA, or A,C,G,U in RNA molecules (Fig. 1, left). The structure of DNA is represented as a double helix (Fig 1, left), while RNA is a single-stranded. The expression of the genetic code involves the translation of a nucleotides sequence into proteins. Since DNA and RNA are chemically and structurally similar (Fig. 1, left), the DNA can act as a direct template for the synthesis of RNA by a complementary base pairing (Transcription). An amino acid (the basic unit of a protein) is encoded by a group of three bases (called a codon) which leaves us with 64 possible base triplets (4³), however only 20 different amino-acids are commonly found in proteins (Fig. 1, right). The entire translation process is called gene expression.

Figure 1: The structure and bases of DNA and RNA (left), and the primary structure of an amino acid sequence, while in next phases its structure would be developed into three dimensional (tertiary) and a complex quaternary structure protein (right)

SARS-CoV-2 structure

SARS-CoV-2 has a single stranded RNA genome that is about tens of thousands nucleotides in length¹. Its genome encodes 27 proteins including RNA and four structural proteins. The four structural proteins including the spike surface glycoprotein, small envelope protein, matrix protein and nucleocapsid protein. Looking at the virus structure (Fig. 2), it is quite clear why the spike protein, due to its shape and location on the surface of the virus, mediates receptor binding. The other proteins are necessary for functions such as encasing the RNA and assembling the proteins — budding, envelope formation etc.

Figure 2: Illustration of the SARS-CoV-2 structure (drawn by the author)

How do we get infected?

We humans, like animals and other organisms, are hosts. The virus becomes alive in a host- where it finds the conditions to reproduce itself and spread. Thanks to Kary B. Mullis, the inventor of the Polymerase Chain Reaction PCR (awarded the Nobel prize for this in 1993) we can also diagnose the presence of the virus based on the host’s DNA sequence. The rationale behind that machine is using the exposure of the genetic sample under high temperatures (“baking” the sequence) to let the RNA reproduce, explore its multiplications and discover the virus structure in those “newly born” proteins invariants.

Literature Review

I have gathered some of the topics that I feel most represent the main questions nowadays, with respect to the data collected in each one of them, it gives you “the big picture” of the virus and its related data. In order to understand the following, the terminology should be clear, please be sure that the previous section is well understood.

  • Diagnosing COVID-19

The article: “Diagnosing COVID-19/ The Disease and Tools for Detection” discusses how one is diagnosed as suffering from COV-19 nowadays, or what are the steps towards detecting the virus in a person in genetic terms?

  • Evolution and invariants

These articles: “On the origin and continuing evolution of SARS-CoV-2” and “Severe Acute Respiratory Syndrome Coronavirus as an Agent of Emerging and Reemerging Infection” discuss the genetics invariants between different viruses (SARS-CoV, RaTG13 and so forth) and pose the question: could scientists have predicted this SARS-CoV-2 virus based on the evolution of those viruses?

  • Who’s to blame? (the host gene receptor of SARS-CoV-2)

Among genes, who’s the host receptor of this virus and what is its expression and function in different populations? The article: “Compares the gene Comparative genetic analysis of the novel coronavirus (2019-nCoV/SARS-CoV-2) receptor ACE2 in different populations” compares the genetic invariants between individuals from different populations and focuses on the ACE2 receptor to study whether different populations differ in susceptibility to this virus?

  • Methods

This article: “Host and infectivity prediction of Wuhan 2019 novel coronavirus using deep learning algorithm” uses a neural network that predicts the host of some viruses. The authors took a database of tens of thousands of viruses for training and testing and even simulated viruses using a genetic virus generator (they made synthetic viruses). Their supplementary materials describe their experiments in detail — worth reading, in addition to reading the paper.

  • SARS-CoV-2 structure

This section is the most important for our aim to understand the Fasta format and content. The following articles represent the SARS-CoV-2 structure: proteins such as spike, membrane, envelope and so forth are well described. Fig. 2 is an illustrator of the virus basic structure.

  1. A new coronavirus associated with human respiratory disease in China
  2. Genomic characterization and epidemiology of 2019 novel coronavirus/ implications for virus origins and receptor binding

Fasta File

After reading all that, this section should be the easiest for you: you might find it not technical, but practical!

Assuming you have read at least one of the articles mentioned above, let’s take this article as an example: “A new coronavirus associated with human respiratory disease in China”. The following steps might be helpful for accessing the Fasta file:

  1. Open GenBank
  2. Type the accession number that is mentioned in the paper (for instance I found the following: 1, 2, 3, 4) in the search section, and press enter to proceed, now you can see the content of the file organized in sections
  3. OK, now you have the treasure in hand: I marked some interesting sections such as: date, the title of the paper that this sample is presented, the host, country and so forth.. (Fig. 3, upper)
  4. At the end of the file, the DNA bases (A,C,G,T) of the host (Fig 3. lower left)
  5. As I mentioned in this blog, the virus consists of some proteins, I have marked it for you in Fig. 3, lower right
  6. Now most important: enjoy your Fasta, explore the data …
Figure 3: Fasta ingredients: Titles (upper), DNA sequence (lower left) and some proteins (lower right)

Summary

Exploring the Fasta file content can lead us towards understanding the basic biological aspects, as questions arise naturally in our minds when we seek rational explanations for the consequences of what we are experiencing in this time period. We find ourselves with a sense of “responsibility” to explore, or at least try to understand this experience to the best of our knowledge.

I agree that our scientific knowledge is still incomplete and that there are a lot of open questions (we are still just barely scratching the surface), but at least we are asking the questions, turning them upside down over and over again, doubting, starting from scratch and still standing on the shoulders of giants, letting our minds seek the answers for ourselves, or for the sake of our world that we live in (hope peacefully).

Discussion

In this blog, I have shown you what a Fasta file is and guided you through understanding of its content and which questions should be asked based on this. I have tried to convince you that revealing the meaning of the content requires an essential understanding of the biological background. I also have explained some of the topics related to the genetic aspects of the COVID-19 and which data has been analyzed to achieve valuable experiments.

Closure …

Dear readers, thanks for reading all that. Any of your thoughts would be greatly appreciated, you are more than welcome to communicate with me via Linkedin or email (miritrope@gmail).

--

--