Don’t Model Humans with SNPs
The first step in understanding something is to identify the right level of description for the problem . For example, imagine an engineer who is working on a prototype for a new car. The engineer needs to draw up a set of blueprints describing the design. Should she write down the molecular structure of the car to illustrate all of the bonds between its constituent molecules? Of course not — that would be the wrong level of description. Instead, she should write down the structure of the car in terms of its natural components like its wheels, engine, and windshield.
Now imagine a computational biologist who wants to build a model to predict the prognosis of a patient with Alzheimer’s disease. What variables should he choose? The patient’s DNA sequence, transcriptome, or proteome? Maybe it’s not individual genes that matter but biochemical pathways and cellular processes?
Unfortunately, despite all of our progress in biology we don’t know the right level of description for the field. And this ignorance is holding back progress in medicine. Before getting in to the details, however, it may be helpful to step back and provide some background on molecular biology. This section gives a small glimpse into the complexity of biological systems. Feel free to skip ahead if you don’t need an intro to molecular biology.
The most basic introduction to molecular biology
A person’s genetic code is a sequence of nucleotides (the chemicals adenine, thymine, cytosine, and guanine) that are strung together like letters in a document. A document probably contains multiple sentences; and each sentence should contain multiple words.
A genetic word is a sequence of three nucleotides called a codon. A genetic sentence is a sequence of codons called a gene. Each gene provides instructions on how to make a molecular machine called a protein. Humans have approximately 20,000 genes providing instructions on how to make approximately 20,000 proteins.
The process of manufacturing one of these molecular machines begins in the nucleus of each cell. The DNA in a cell is tightly wound around proteins called histones, folded up like a crumpled piece of paper. To read a sentence, the cell has to stretch out that portion of the DNA by unwinding it from the histones.
Once a stretch of DNA is unwound, the cell points to a particular sentence it wants to read. A protein called a transcription factor binds to the DNA at the start of this gene. This initiates a process called transcription where the cell copies the instructions in the gene into a new sequence called a messenger RNA.
The cell copies the message for two reasons. The first reason is that the assembly line that makes the molecular machines is located in a different part of the cell. Therefore, the sentence has to be copied and taken to the assembly line for manufacturing. The second reason is that each copy of the message will be used to produce a fixed number of copies of the desired machine. Therefore, the cell can control how many of each machine it makes by controlling how many copies it makes of each message.
Once the messenger RNA reaches the assembly line, an assembly line worker called a ribosome reads the instructions and creates a protein. A protein is a molecule constructed from 20 chemicals called amino acids. Each of the codons in the messenger RNA represents an amino acid. The ribosome reads each of these words in the sentence and strings the corresponding amino acids together to create a protein.
Proteins are the molecular machines of the cell and are responsible for most cellular activities. They transmit signals by changing shape when they bind to specific molecules; and they facilitate chemical reactions that produce materials the cell needs.
What do we measure?
Instead of being driven by what we should measure, most models in biology are driven by what we can measure. For example, some common types of genomic data include:
Single Nucleotide Polymorphisms (SNPs): If each genetic word is a sequence of three nucleotides that tells the ribosome which amino acid to use next when building a protein, then changing one of the letters in a word could make the ribosome use a different amino acid. Therefore, one level of description for biology could be to catalog all of these one letter changes in the entire DNA sequence. That would be a really long catalog, so biologists often restrict the analysis to single letter changes that occur in at least 1% of the population. These are called Single Nucleotide Polymorphisms (or SNPs).
It’s pretty easy to make a catalog of the SNPs in an individual person. SNP based tests are now used identify ancestry and predict the relative risk of some diseases. While there are statistical associations between some SNPs and diseases, it is very rare that a SNP is highly predictive.
Transcriptomics: All of the messenger RNAs created by a collection of cells is called a transcriptome. Unlike a person’s DNA sequence, which is effectively the same throughout life, a person’s transcriptome is constantly changing in response to his/her environment. A cell controls the number of copies of its molecular machines by controlling the number of copies of each machine-encoding message. Therefore we may be able to learn a lot about a cell (or collection of cells) by counting all of these messages. Today transcriptomics experiments are often done by sequencing all of the messenger RNA molecules with a technology known as RNAseq.
Proteomics: Since proteins are the molecular machines of the cell, it may make sense to directly measure the amounts of each protein in a collection of cells. Experiments that measure the amounts of different proteins are collectively called proteomics; there are several different techniques for making these measurements. In many cases only a small subset of all of the 20,000 possible proteins can be measured in any given experiment.
Metabolomics: Although proteins are the molecular machines of the cell, they are usually busy creating other chemicals needed by the cell. The levels of these chemicals called metabolites can also be measured. These types of experiments are collectively known as metabolomics.
What should we measure?
Does one of the types of measurements listed above stick out to you as the right level of description for biology?
My opinion is that we don’t know — and in all likelihood, none of these experiments are likely to capture the right level of description. SNP arrays, RNAseq, proteomics, and metabolomics are simply types of experiments that address what we can measure — not what we should measure. Why should nature be so simple that what we should measure also happens to be one of the things that is easy to measure?
Although the experiments we have today are certainly very valuable, they haven’t brought about the revolution in precision medicine that we were hoping for. To get there I think we need to pursue two parallel lines of research. In the short term, we need to develop machine learning based approaches that can integrate data from many types of experiments in a way that is robust to our ignorance. Below, I’ll argue that techniques from deep learning have the potential to solve this problem by automatically constructing features. In the long term, we need to invest in basic research that facilitates collaborations between experimental and theoretical biologists so that we can answer the question — “what is the right level of description for biology?”.
Discovering Abstractions with Machine Learning
Machine learning is the use of computer algorithms that learn to make predictions without any human guidance. Progress towards this goal has advanced rapidly in recent years due to a class of algorithms known as Deep Learning.
Teaching a computer to make accurate predictions is not that difficult if you know the right way to describe the problem. In machine learning, people say that you have to know the right features. In the past, human experts would draw up rules for combining raw data (e.g., gene expression data) into higher level features that reflected prior knowledge (e.g., the activation of biological pathway). However, this process of feature engineering requires a lot of effort and only works if you have a lot of accurate prior knowledge.
Deep Learning algorithms provide a technological leap forward because they learn to construct the right features all by themselves. To do this, a Deep Learning algorithm takes in the raw data and transforms it. Then it takes the transformed data and transforms it again, and so on. This process of repeated transformations allows the model to discover the right level of abstraction for the data.
The figure above shows an example of this process for a dataset consisting of images of faces. The first layer of transformations identifies edges and blobs of color in the images. The second layer of transformations combines the edges and blobs into facial features like the eyes, nose, and mouth. Finally, a third layer of transformations combines the facial features into representations of faces.
The same process can be used to discover the right level of description for problems in biology. Though, there are barriers to the application of Deep Learning in biology that are less of an issue in areas (such as image analysis) where Deep Learning has been most useful. In particular, datasets in biology are much smaller than in other areas where machine learning is applied (for example, clinical trials often have only a few hundred people in them) and systematic errors make it difficult to generalize results across datasets. As a result, new approaches to Deep Learning that can mitigate these problems will be required before we will see widespread adoption in biology.
The Past and Future of Theory in Biology
Biology has a reputation as a reductionist observational science where scientists catalog all of the small differences they’ve observed during experiments without necessarily tying it all together. That is, biologists have trouble seeing the forest for the trees. This reputation is really quite unfair. In fact, there is a rich history of theory and top-down thinking in biology.
The idea of a gene — a discrete unit of heredity — is one of the most powerful theories in biology. The discovery of the gene is a remarkable story (well told in The Gene: An Intimate History by Siddhartha Mukherjee). In the mid-1800’s, Gregor Mendel performed a vast series of experiments in which he crossed pea plants with different characteristics and observed how their traits were passed on. He discovered simple mathematical laws that suggested each parent had a particle of information describing a physical trait. The two alleles of the parents combined to determine the trait of their offspring. Nobody knew the mechanism — that genes are encoded in DNA — for nearly 100 years, but the theory of the gene and the mathematical rules on inheritance stood on their own.
The search for unifying concepts, expressed in mathematical rules, should be a hallmark of all sciences. As with the discovery of the gene, the quest for unifying concepts requires a tight coupling between theory and experiment and usually proceeds from the discovery of macroscopic laws to the uncovering of microscopic mechanisms.
We need to change the incentives in biology to favor research that will answer these basic questions. We need to encourage collaborations between theorists and experimentalists that answer simple questions using well-designed experiments with large sample sizes. Journals will have to accept papers with equations in the main text. The focus should be on making predictions, not ad hoc analyses that quantify changes with p-values. Funding agencies will have shift their priorities to basic science.
There are a lot of social barriers to the development of theoretical biology. But, if we don’t make the effort, we’ll drown in the details like an engineer trying to build a car out of individual atoms.