Genetics 101: What Exactly Is A Gene?
And How Do Programmers Read Them?
Genetics is primarily the study of DNA (deoxyribonucleic acid), the great instruction manual of living beings. Our DNA shapes and forms everything about us. On the microscopic level, it directs the activity of each of our trillions of cells. On the macroscopic level, this results in having hands attached to our arms and feet attached to our legs. Perhaps more importantly, our DNA contains the instructions for building brains that allow us to process life’s complexities. For better or worse, we pass our DNA on to our children hoping that they get our best traits and are spared from our worst.
From tiny bacteria like E. coli to Homo sapiens, this tiny molecule is the source of Earth’s incredible biological diversity. While the implications are complex, the structure of DNA itself is not very complicated at all. How is this possible? What is DNA, and how can it do so much?
The Basic Structure of Life
DNA is very small, but if you could zoom way in on the nucleus of a single cell in your body you’d see a twisting ladder of sorts, wound into the shape of a helix. The ladder is built rung by rung, by relatively simple molecules called nucleotides. There are four types of nucleotides in DNA: adenine, thymine, cytosine, and guanine; typically shortened to A, T, C, or G, respectively. Each nucleotide has a partner; the A nucleotide on one side of the ladder is always paired with a T on the other side of the ladder to make a stable rung. Similarly, a C nucleotide is always paired with G. In DNA, these bound partners are called “base pairs”.
Pair by pair, these little molecules combine to form DNA. A single continuous DNA molecule is called a chromosome. The collection of all the chromosomes of a single individual is called their genome. The number of chromosomes and the number of base pairs in a genome varies quite a lot from species to species, but remains consistent among individuals within a species.
Humans have roughly 3 billion base pairs spread out across 46 chromosomes. E. coli K-12 has about 4.5 million base pairs and a single circular chromosome. A human mitochondrion (That’s right, our mitochondria have their very own genomes!) has only 16,000 base pairs, also in a single chromosome. There are other differences too, for example humans get 2 copies (one from mom, one from dad) of 23 kinds of chromosomes — bacteria reproduce asexually so all of their chromosomes come from a single parent cell.
If you’re curious about different chromosome counts by species check out this list. If you’re interested in base pair count per species try this one.
An animation of a small section of DNA. The red and yellow parts are the DNA backbone — they are like the sides of the ladder. The parts you can see in purple and gray (between the two backbones) are base pairs binding with each other to make rungs.
In 1953 three scientists named Watson, Crick, and Franklin[1] discovered DNA’s “double helix” structure wherein the two strands are parallel, and joined in the center. The exact molecular form of each nucleotide isn’t terribly important to the rest of this article, but the chemically inclined and curious might enjoy this short video about nucleotides’ molecular structure.
When scientists perform genetic analysis, important genetic information (genes) can be found be on either of the two parallel strands. Interestingly, because of the biological machinery in our cells, the genetic information is always encoded in the same direction regardless of which strand it is on. Each strand of DNA has two ends, one end is called the “three prime” end and the other is called the “five prime” end (commonly written as 3’ and 5’). The proteins, enzymes, and organelles that read DNA — such as an DNA polymerase — always read in the same direction: from the 5’ end to the 3’ end.
Consider this simple example:
5’ ACTG 3’
3’ TGAC 5’
In this format, we’ve indicated which strand is in the 5’ to 3’ direction and vice versa by bookending the nucleotide sequence with the 5’ and 3’ symbols. The genetic data that can be used by our bodies in this example is either ACTG or CAGT — reading either strand from the 5’ end to the 3’ end. The sequence TGAC is not a part of this strand of DNA because reading that sequence would require traversing the DNA from the 3’ end to the 5’ end.
Here is a chemical diagram of the same sequence:
These short examples show the parallel structure of DNA. The two strands are a kind of mirror for each other. Because of this property, it is common to only represent one side of the DNA in digital representations of genetic information in order to save space — the other side can simply be reconstructed if necessary.
There are, of course, exceptions to the rule. DNA is complex, and can exhibit irregularities such as mismatched base pairs or an extra (loner) nucleotide being inserted on one side. These irregularities can have functional consequences. For now, though, let’s set aside these awkward mutations and focus on how DNA encodes information when it’s behaving as expected.
Coding DNA, Non-Coding DNA, and Codons
Broadly speaking, scientists have divided DNA into two groups known as non-coding and coding sections. Any section of DNA could be classified as either of the two choices; and a single chromosome (one continuous DNA molecule) can have both coding, and non-coding sections. A single coding section of DNA is called a gene. Once again, it varies quite a lot from species to species, but humans are estimated to have around 19,000 genes. Those genes make up only 2% of the human genome, the rest of our DNA is non-coding.
Coding DNA is so called, because it contains the “code” for the creation of proteins. At a molecular level, proteins do just about everything, and are extremely important to the biological functions that take place at a chemical level. From enzymes (a special type of protein) that break down chemical compounds to the complex structures that form muscles, bones, and other tissue; proteins make it possible. Genes control which proteins are created; proteins can only be created by the body if a sequence for that protein can be found in that organism’s DNA.
DNA also contains instructions for how many proteins to create, and when a cell should increase, decrease, start, or stop production of particular proteins. In a very real way, every individual is defined by the collection of proteins that it’s DNA knows how to create.
Non-coding DNA was once called “junk DNA”, but recent research suggests that it actually plays an important role in the development of life. For example, researchers from UC San Diego have found a significant link between non-coding DNA and autism. However, because of the importance of proteins and protein synthesis, a majority of genetic and genomic research (currently) focuses on coding DNA. Similarly, the rest of this article focuses on coding regions of DNA.
So, coding regions of DNA are called genes, and genes code for proteins. Proteins are made of molecules called amino acids, which are joined together in a particular order to make a long chain. That chain of amino acids is then twisted into a unique shape, and the resulting shaped amino acid chain is called a protein. This process is called protein synthesis, and the instructions for creating a protein are stored in our DNA. In coding sections of DNA, base pairs are grouped into triplets called codons. Each of the 20 amino acids that are used to create proteins can be represented by a sequence of 3 nucleotides.
There are 4*4*4 = 64 ways to combine our 4 nucleotides into a sequence of 3 (e.g. AAA, GAT, CTG, TCG … ), but there are only 20 amino acids. This means that some amino acids are represented by multiple codons. For example, arginine is represented by 6 different codons, but tryptophan is represented by just 1. If you’ve studied computer science, you might compare DNA’s overlapping codons to a hash table where the keys are codons and the values are the amino acid — the amino acids with multiple codons might remind you of hash collisions (20 buckets, 64 values).
There are also start and stop codons, which indicate the start and end of a single gene. Between a start-stop pair are the instructions for creating a single protein. Let’s consider an example: Trp Cage is the smallest known protein, comprised of only 20 amino acids, specifically:
Asparagine, leucine, tyrosine, isoleucine, glutamine, tryptophan, leucine, lysine, aspartate, glycine, glycine, proline, serine, serine, glycine, arginine, proline, proline, proline, serine.
Because writing the whole name of every amino acid is a bit tedious (and wasteful in terms of data size), each amino acid also has a one letter code. Translating the above sequence to the one letter codes we get:
NLYIQWLKDGGPSSGRPPPS
Because multiple codons map to the same amino acid, the number of genetic sequences that can represent Trp Cage is 73,383,542,784. This is the smallest known protein and yet there are nearly 80 billion ways to write it, this is just one reason why bioinformatics is computationally intense. We can compute this number for any amino acid chain by repeatedly multiplying the number of codons for each amino acid. In the case of Trp Cage:
N L Y I Q W L K D G G P S S G R P P P S
2*6*2*3*2*1*6*2*2*4*4*4*6*6*4*6*4*4*4*6 = 73,383,542,784
(Attached to this article is a function that can compute how many different DNA sequences there are for any given amino acid sequence)
Here are 2 of those 78 billion base pair sequences; these sequences both encode the Trp Cage protein:
ATGTTAAATATATAAGTTACCAATTTTCTACCACCAGGAAGAAGACCAGCAGGAGGAGGAAGAACT
ATGTTGGACATGTATGTCACCGACTTCCTGCCCCCCGGCTCGTCGCCCTCCGGCGGCGGCTCGATC
Depending on the context, geneticists might prefer the nucleotide sequence, or the amino acid sequence. For example, if you’re trying to compare the functional aspect of two genes it makes more sense to process the genes into their amino acid sequences; because there are 73 billion ways for DNA to say “Trp Cage” comparing the nucleotides directly risks “missing the forest for the trees”. For example, notice that the previous 2 nucleotide sequences contain significant differences (note the bolding). On the other hand, when tracking heredity or determining how closely related two species are, the nucleotide sequence matters, because when organisms reproduce it is the specific nucleotide sequence that is passed on.
According to Dr. Rebecca Mackelprang, a fellow at UC Berkeley’s CLEAR project (and full disclosure, my sister-in-law):
When looking at relationships, evolution, or mutations within a species or between closely related species, it makes sense to look at variation among nucleotides. When comparing genes or proteins between species, it is more useful to look at the amino acid sequence. Many mutations in DNA actually don’t change the amino acid sequence, so there isn’t selective pressure on those nucleotides to stay the same. When comparing species, we don’t care about those differences between base pairs if the amino acid stays the same. So, we just look at the amino acid sequence instead of focusing on the sequence of nucleotides.” — [hyperlink mine]
Dr. Sister-In-Law’s point about mutations stems from the fact that a single amino acid can be represented by multiple codons. If we care about gene expression in the form of protein synthesis then we really only care about the codons, not the individual base-pairs.
Genetics & Programming
Because genetic data sets can be enormous, computers have become an integral part of genetic research. Since I’m a computer programmer, I wanted to contextualize these foundational concepts writing a few Python functions. For those who already know genetics but are just learning how to program, I hope this code can help you learn a little more about programming by bringing it to a realm you already understand.
Specifically, I wrote code that can do 5 things:
- Generate a random sequence of base pairs, generated codon by codon, which is capped on either end by a start/stop codon.
- Convert a sequence of base pairs to its corresponding amino acid sequence (assuming the sequence’s length is a multiple of 3).
- Convert a sequence of single letter amino acids, like “NLYIQWLKDGGPSSGRPPPS” for Trp Cage, into the full amino acid names. (Mostly so that I didn’t have to translate it by hand for Trp Cage mentioned earlier in the article)
- Given a single letter amino acid sequence, randomly generate a string of base pairs that maps to that amino acid sequence. And finally,
- Given a single letter amino acid sequence, report how many different combinations of codons can be used to encode that amino acid sequence.
In order to do those things, I had to create a few dictionaries by hand. One mapping amino acids to codons as nucleotide sequences; one mapping codons to amino acids; and one mapping the single letter amino acid representations to their full names.
These programs are not particularly fast or computationally efficient; nor would they be useful to a practicing bioinformaticist (there are better, faster, easier to use tools). Rather, they are a learning tool; by expressing the genetic concepts in the language of a computer program I hope you can learn more about both. I’ve tried to keep the code clean and readable, and have added many comments to explain what the code is doing. Even if you’re just getting started with programming I hope you’ll be able to learn something from this code.
If you’re looking for a challenge that can help you better understand the relationship between DNA, codons, amino acids, and proteins then take this code and add a function to this code that can programmatically generate a single nucleotide mutation that doesn’t change the amino acid sequence, or report definitively that such a mutation does not exist. The curious reader may also want to explore popular data formats for storing genetic data, such as the FASTQ format.
Here’s the code:
This article is part of Teb’s Lab, checkout the website for more details or sign up for my weekly newsletter or join me on Patreon in my quest to never stop learning.
[1]: If you like historical drama, the story of Rosalind Franklin is an interesting one, and a classic example of women scientists not getting the credit they deserve. See also the story of Marilyn vos Savant, or the movie Hidden Figures.