Multi-Omics and the Language of Life
The language of life is made up of 4 letters. A, T, C, and G are encoded in the DNA or the Genome. But it does not stop there. A, T, C, and G are just what the blueprint of life is written with. The complete language of life is far more complex and requires understanding the field of multi-omics.
I will run a series of posts that will build upon itself to decipher the complexity of biology in a way that a general audience can understand.
What are A, T, C, and G?
Human DNA consists of 3.2 billion base pairs of nucleotides bonded in a double helix structure. Each base pair is an AT or GC combination of letters on each side of the bond. The sequence can consist of A, T, C and G coded similarly to ATCAGCTGATAGGCC.
If you have an A on one side of the double helix, it needs to be complemented with a T on the other. A G on one side needs to be complemented with a C on the other side of the double helix.
- A stands for Adenine
- T stands for Thymine
- C stands for Cytosine
- G stands for Guanine
Coding and Non-Coding Regions
Among the 3.2 billion base pairs, only 1.5% (Or about 20,000 genes) of the DNA contains genes that can express themselves into a protein. This is called the coding region of the Genome (DNA). The remaining is called the non-coding region.
The non-coding region contains very cool instructions on how to build or regulate a protein from the coding region. These are called cis-regulatory elements or modules that control the entire protein biosynthesis process.
What is the difference between Genetics and Genomics?
Genetics deals with hereditary characteristics of living organisms inherited from parents to children (one generation to another). Genetics deals with the study of a limited number of genes.
Genomics, on the other hand, is the study of all the genes in a DNA strand and the interactions of those genes with each other and the environment.
What is the field of Multi-omics?
Roughly speaking, to create a protein, a set of genes on DNA needs to be first “transcripted” to mRNA, and the said mRNA needs to get “translated” to proteins.
Given this high-level understanding:
- The study of genes on the DNA (blueprint) and its expressions and regulations is called “Genomics.”
- The study of transcriptions to mRNA and mRNA regulations is called “Transcriptomics.”
- The study of the translation of mRNA to Protein and different protein functions is called “Proteomics.”
- Also, the study of different small molecules, such as sugar, amino acids, fatty acids, etc, is called “Metabolomics.”
- Finally, the study of the environmental effect on all of the above “omics” is called “Epigenomics.”
Collectively you can call them “Omics Sciences” or “Multi-Omics.”
Human Diseases
Most human diseases are multifactorial, caused by variations in many genes and environmental factors.
Understanding and predicting gene expressions and protein synthesis are critical to be on top of diseases. This is a highly complex problem because of a combinatorial explosion of factors that plays “in vivo” (in the biological host) and cannot be simulated “in vitro” (in the labs) or “in silico” (inside computers).
To understand the nature of the problem:
- There are 30 trillion cells from 200 cell categories that make up a human.
- Each cell carries the same set of chromosomes. However, each cell’s gene expression (creating mRNA) differs based on the cell type, how much ATP it carries, chemical signals from other cells it receives, mechanical signals from the extracellular matrix, and nutrient levels.
- This expressed gene creates a pre-mRNA structure regulated by cis-regulatory elements that can mutate or repress the transcription.
- The pre-mRNA can be spliced to create different variants of mRNA.
- The mRNA needs to go through processing to be exported to the cytosol (outside the nucleus), which can be regulated (mutated/affected).
- The finished mRNA needs to be translated to primary protein structures that can be regulated.
- The folding of the primary protein structure to the secondary and tertiary structure can be mutated.
- The tertiary functional protein can have post-translation modifications.
Imagine all of the above simultaneously happening across 30 trillion cells constantly. (For brevity, this is only 5% of the combinatorial complexity).
Can you imagine solving this by pipetting lab experiments, applying “statistical knowledge,” or single-asset machine learning solutions for focused problems? You will call me insane. Nevertheless, this is happening today (due to limitations in technology).
Protein biosynthesis (High level)
The protein biosynthesis pathway from a DNA sequence to a proteome is 100x more complex. To understand this, imagine that every gene contributes to 100x more diversity in creating a protein form. Or in other words, there can be 2,000,000 protein forms (Proteomes) that are created out of the 20,000 genes.
A part of this complexity occurs directly due to regulation in the transcription of DNA to pre-mRNA, slicing of pre-mRNA to mRNA variants, and regulations in mRNA translation to primary protein structures.
Most of the remaining functional diversity of the proteins occurs due to post-translation modifications on the protein. There can be about 200 different modifications that can occur on the protein forms creating additional functional diversity.
Scientists predict that there are anywhere between 2 million to a billion different variations of proteomes! It is hard to work on every proteome in the labs as we do not have innovations similar to PCR that work on DNA.
This means most of the work in protein identification, structural complexity, drugability, protein modifications, function prediction, and disease forecast straight away falls into the “in silico” bucket (Computer Models).