The battle for reading the proteome: Part 1
Accurately identifying and quantifying proteins has been a long-standing technical challenge. Startups are tackling it with audacious approaches.
What is the proteome?
Just as the genome is the collection of all genes present in an organism, the “proteome” is the collection of all proteins therein. The human proteome, hence, is the complete set of proteins that can be found in the human body.
Why do we need to read it?
Remember the central dogma of biology: DNA → RNA → Proteins. DNA is transcribed into RNA which in turn is translated into proteins.
Reading our genes, through the much popular technology called DNA sequencing allows us to peek at the biological equivalent of the source code. We can think of it as the instruction set of what was supposed to be. If it was really so simple, reading the source code alone should be able to diagnose all conditions that deviate from “normal” and serve as the singular test for the detection of most diseases or abnormal conditions of health.
As you would guess, it is NOT that simple. Biology is beautifully complex. It turns out that proteins can interact with other proteins, RNA, and DNA to influence how the source code is interpreted. As an example, let us consider a very well-known limitation of genomics: a phenomenon called post-translational modification or PTM. PTMs are chemical modifications made to the protein after it has been “made” or translated from RNA. There is a marked difference in the activity of any protein with and without its PTMs. As a result, DNA-level information doesn’t predict the final levels of abundance of a protein or even what its structure is going to be. Consequently, if reading DNA allows us to know what was supposed to be, reading proteins tells us what actually is.
In a way, it is a more accurate snapshot of the health of our organism.
What can we learn from the proteome?
Proteins are molecular machines that are involved in an intractable number of tasks in an organism.
Proteins act as catalysts in biochemical reactions.
Amylase in saliva breaks down starch to glucose making it available for energy. Without amylase, the same reaction will occur 10¹⁰ times slower. Ain’t nobody got time for that.
Proteins provide structure.
The most common role of proteins is to provide elements of structure. At a cellular level, proteins like elastin and collagen provide scaffolds for cells to attach to surfaces and grow. At a macro level, proteins like keratin make up hair.
Proteins transport small molecules.
Hemoglobin carries oxygen from the lungs to the tissues and carbon dioxide from tissues back to the lungs.
Proteins are involved in signaling.
Signaling is a way for cells to communicate with each other. This is a gigantic field in itself. At the risk of tremendous simplification, suffice it to say, that proteins are involved in all aspects of the signaling process:
a. the signal-generating complex that creates the molecular signal to be broadcast is a protein
b. the messenger molecule itself may be a small protein
c. the receiving sensor that detects the molecular signal and transmits it to internal cellular machinery is also a protein.
Proteins fight invaders.
Antibodies are proteins that recognize threats and stick to them, marking the threats out for clearance by the immune system.
As discussed above, proteins are heavily involved in all functions of maintaining the functioning state of an organism. Hence, it is no surprise then, to imagine that abnormal levels of specific proteins can indicate that the organism is drifting from a normal healthy state to a newer unknown state which may be a disease.
When we monitor a protein’s abundance in a sample (typically blood) to understand and predict if a patient is departing from a normal and healthy condition, that protein is called a “biomarker”. By measuring the biomarkers in our bodies, we can hence gain two crucial pieces of information
1. Is something wrong?
By monitoring the levels of certain proteins, we can predict that
a. Someone definitely has had an occurrence of an acute event (for example, cardiac troponin is used to detect acute myocardial infarction, “heart attack”)
b. Someone has a propensity towards a disease (for example, C-reactive protein is elevated in coronary heart disease)
c. A disease is recurring (for example, thyroglobulin in metastatic thyroid cancer after thyroid removal)
2. How do we fix what is wrong?
Proteins are involved in many pathways maintaining the health of an organism. As a result, any form of aberration in either the structure, activity, or level of abundance of a protein can be a therapeutic target. In fact, most drug discovery pipelines start with a process of identifying a molecular mechanism by which we can set right a protein that is not “working” as normal.
For decades, the discovery of biomarkers has been a domain of intense investigation in clinical research. On the whole, this is driven by the promise of proteomics. A blood draw is a standard procedure for clinical investigations. The concentrations of various plasma components are routinely used to diagnose disease and monitor recovery. In theory, if we knew which biomarker represented which clinical condition, we could diagnose and monitor thousands of diseases through a single blood draw.
Expectedly, 100s of biomarkers have been proposed over the years by researchers around the globe. Yet, the rate of clinically approved and accepted biomarkers discovered per year is only TWO (ref). This is because most claims of the discovery of clinical biomarkers never make it past the validation phase. This has been a long-standing bane of the field in general — the struggle with robustness and reproducibility. More often than not, entire biomarker panels (collections of numerous biomarkers) fail to clear statistical significance thresholds upon the slightest perturbation to the method: change of instrumentation, change in personnel handling the samples, change in reagents manufacturer.
To understand the difficulty in designing robust proteomic assays that make it past the validation phase, we need to understand where exactly lies the complexity.
Why is it difficult to identify proteins and measure their abundance?
Looking at the meteoric rise in throughput and accuracy of DNA sequencing over the years, one can be deceived into thinking that proteomics hasn’t simply received the same amount of attention. We have seen the cost of sequencing DNA drop exponentially over the years from a $ 1,000,000,000 to under $ 1000 over the course of two decades (ref). Why haven’t we seen such technological leaps in proteomics? I would argue against this line of thinking and explain exactly how much more difficult it is to deal with proteins and its measurements.
In comparison to DNA, the molecular class of proteins has three critical features which makes it extremely challenging to identify and measure them
- DNA is made up of 4 nucleotides. Proteins are made up of 20 amino acids. Over and above that, there are multiple levels of structure induced variations in the case of proteins. Proteins have a primary sequence, secondary, tertiary, and even quaternary structure. It doesn’t end there. Proteins have variations called proteoforms. Proteoforms are proteins derived from the same gene but differing slightly from each other due to a variety of factors at the genetic and then translational level.
- If we have less of a particular sequence of DNA, we can amplify it with high-fidelity and bring its abundance up to a level where existing techniques can sequence it with high accuracy. There is no way to amplify proteins. If there are specific species of proteins which are present in extremely low concentrations, too bad, we have to measure them at those concentrations itself.
- It is estimated that we have more than 10,000 species of proteins in the human proteome and the range of concentrations span 10 orders of magnitude from few picograms per ml to milligrams per ml. I am not aware of any other field, where we have had to build tools that will have repeatable and accurate performance over this large a dynamic range.
What is the state-of-art?
We will divide this discussion into two sections. The first part would deal with the mode of discovery of a biomarker and the second part would deal with translating it into a clinical context once the discovery has been validated.
A usual biomarker discovery attempt is based on investigating population-level differences of the proteome between a test cohort (people who have been diagnosed to have the condition you are investigating) and a control cohort (healthy people). The most commonly used analytical technology platform used for this is liquid chromatography-mass spectrometry (LC-MS). Mass spectrometry is an analytical tool useful for measuring the mass-to-charge ratio (m/z) of one or more molecules present in a sample. These measurements can often be used to calculate the exact molecular weight of the sample components as well. Typically, mass spectrometers can be used to identify unknown compounds via molecular weight, to quantify known compounds, and to determine the structure and chemical properties of molecules.
Now the problem is that we are never in a situation where we are analyzing a single known protein in a discovery experiment. The instrument needs to be able to deal with a host of proteins (remember, greater than 10,000 in number and spanning 10¹⁰ in concentration). And it can’t. So we resort to workarounds.
One commonly adopted technique is sample fractionation. Sample fractionation is a set of steps that are performed to selectively split up the sample into groups that can be analyzed separately. For example, if we are looking for a protein whose expected concentration is 10¹⁰ times lower than the most abundant protein, we can’t expect to see it at all. There is simply too much of one protein compared to the candidate protein we are interested in.
But what happens if we can selectively remove the most abundant proteins from the sample? If we can split up the sample into groups where a group of proteins has relative abundance between 1–100X of each other, all of a sudden, we can see them. This can be done in a number of ways, all of which require extensive knowledge, both of the proteins you want to remove and the ones you want to keep. Broadly, this is called chromatography — the science of separating components of a mixture based on some difference between their molecular properties. Such properties can be size, charge, shape, activity, etc.
The problem is, given that there are a large number of proteins in the proteome and once again the concentration range is massive, these sample preparation workflows become extremely cumbersome, non-scalable, and in some cases, bespoke. All of these have throttled throughput of proteomics discovery workflows.
One other workaround is to perform “targetted” studies. Here, the researcher is making an active choice to only interrogate a particular class of proteins grouped either by a mass range, chemical character, or known prior information on involvement in a particular condition. The problem here is one of bias. By not looking at the whole picture and by banking on prior knowledge, most of which carries its own biases, this discovery set can become very biased and may not stand scrutiny.
In the research setting, the question is more of a tradeoff between depth of information and the throughput. It is possible to identify about 4500 species of proteins, but it comes at the cost of an extremely complicated sample preparation workflow which has the undesirable traits of
a. multiple avenues for errors due to handling
b. the time required, up to weeks for completion
c. overall non-compatibility with automation
These problems described above have prevented LC-MS from being an analytical tool of choice that can deliver clinical robustness. As a result, the practice of clinical biomarkers as medical evidence is based on two more robust methods: enzymatic activities of certain plasma proteins or on antibody‐based immunoassays.
Both of these methods are extremely targeted. This means one test will only interrogate for one particular protein. To get information about multiple proteins, you will need to multiplex the setup. The state-of-art, offered by companies like Ayoxxa, Luminex, and Meso-Scale Diagnostics, can routinely perform identification and quantification in the range of tens of proteins.
That is tens out of more than ten thousand. Now, we begin to see the whitespace.
What do we need?
We need to develop workflows and technology platforms which are:
- Unbiased: all protein species present in a proteome should be captured by the analytical techniques used. The sample preparation step should not bias for the discovery of one class of proteins over another class based on some molecular differences.
- Sensitive across a wide range of concentration: all protein species irrespective of their abundance should be captured by the platform and quantified.
- High-throughput: the workflow should be amenable with a high degree of automation such that throughput is not throttled by steps that require extensive human intervention.
New kids on the block
This brings me to the most interesting part of this exploration. Many years of orthogonal progress in various different fields and technologies which were not necessarily related in their paths have condensed at this time. There are some truly disruptive approaches to quantifying the proteome being developed, commercialised and brought to market.
Over the next parts of this topic, I will be diving deeper into a few companies that I believe are poised to disrupt the market.
Noticeably, all with the exception of Seer Bio are building ground-up technologies not reliant on LC-MS at all. Seer Bio, on the other hand, is building the sample preparation workflow and the post data acquisition signal analysis toolbox to leverage the strengths of the LC-MS instrumentation that is present in all labs by default. This makes for a very different business model compared to the others it is competing with. Seer is employing some really nifty chemistry and the use of automation to shrink sample preparation and reduce human involvement in making active experimental choices. Everyone else on the list is building physical instrumentation that works on different principles than LC-MS and negates its limitations by, essentially, bypassing it altogether. These competing technologies will have an additional barrier to surmount — that of adoption and having to prove, beyond question, their merits over the reigning workhorse, LC-MS. An additional concern will be the cost per run. LC-MS instrumentation can be expensive depending upon system configuration but running costs are low. How will the others fare?
This is an interesting domain to explore and I am looking forward to dissecting the technologies and thinking about problems. I will be starting with Nautilus Bio in part 2. Nautilus Bio is an extremely high profile startup having raised a total of $ 100M so far from the marquee investors like Andreesen Horowitz, Vulcan, Bezos Expeditions among many more. I find their technology very non-intuitive and it should make for some interesting study.
If you know more companies that interest you and are innovating in this space, let me know. If you want to collaborate on this topic, please write to me.