My 10 favorite proteomics papers of 2019

Tanveer Singh Batth
16 min readJan 21, 2020

--

Hello and welcome to my first Medium post! 2019 has come and gone, along with it a great number of proteomics papers which may substantially push the field forward. Some of the themes I personally saw emerging this year are in the field of advanced search algorithms, specifically open search and de novo peptide sequencing, the implementation of machine learning for predicting peptide fragment spectra, and perhaps most exciting of all, first glimpses of novel protein sequencing technologies which could turn the field on its head (if successful). I’ll try to gather my favorite papers here in no particular order.

TagGraph — A powerful tool for de novo sequencing and open search for identification of post-translational modifications (PTMs).

Proteomic searches are typically performed by matching peptide and peptide fragment masses (fingerprints) from mass spectrometry (henceforth will simply be referred as “MS”) analysis to a human protein sequence database in order to infer protein identification. What this means is that most common search engines will search thousands to hundreds of thousands of mass spectra against up to hundreds of thousands in-vitro peptide sequences (depending on organism) to find the proper matches based on these masses and fingerprints. However, each time you add a potential side chain modification (PTM) on proteins (i.e. oxidation, acetylation, protein phosphorylation etc…) the number of theoretical matches to search against rises significantly. Imagine a common scenario where a peptide which has the possibility to be phosphorylated at multiple positions, be oxidized on a methionine, have the possibility to be acetylated, and have the possibility for missed cleavage from a protease such as trypsin; you end up in a situation where one single peptide can have multiple possibilities to be modified in many different combinatorial ways, multiply this by hundreds of thousands predicted unmodified peptides (from a specified protease, which often times are semi-specific, have lots of missed cleavages, and efficiencies can vary significantly based on sample preparation) as well as modifications and the search space explodes exponentially with each additional PTM. This is one of the great computational challenges of proteomics data analysis. Typical we limit search engines to handful common modifications so it is more manageable; however there are over 300 different protein modifications. TagGraph tackles this problem by developing a fast computational algorithm which is able to match peptides to a protein sequence database “without anticipating them a priori”. No need to specify a protease, no need to specify modifications or other potential variations to amino acids for a search. TagGraph not only accomplishes this fast compared to other algorithms but does it comprehensively and on a large scale.

Although this is an exciting development, it remains to be seen if this strategy can be applied towards smaller scale day to day projects in most proteomic laboratories. The elephant in the room is quantitative analysis, as this will most likely add to the computational time. However, the potential is appealing for a comprehensive search algorithm which can search for literally everything in a practical way. I am personally looking forward to see how far this strategy can go.

DeepMass and Prosit — Machine learning to develop better proteomics search software

These two publications achieve the same vision but in slightly different manner. 1) They both predict peptide fragmentation by training machine learning algorithms on peptide mass spectra, 2) they both use this information to increase identification rates in proteomics data, and lastly 3) they both deploy this strategy to build spectral libraries for data-independent analysis (DIA). The two papers however utilize different strategies to achieve these similar goals. Gessulat et al in the Prosit publication correctly make use of the large ProteomeTools synthetic peptide library (which you can read more about here) that contained over 500,000 synthetic tryptic (more on this later) peptides from which over 20 million peptide MS/MS spectra was generated to train their machine learning model. Meanwhile Tiwari et al from the Cox group utilized 25 public PRIDE proteomics datasets which contained over 60 million MS/MS spectra. Although the neural net strategies can be read in detail in the papers, the general conclusion from both emphasized the importance of peptide sequence as a key feature in predicting the intensities of the corresponding fragment ions in MS/MS spectra, which can ultimately aid in identifying peptides with greater confidence from MS analysis.

The limitation for these pair of publications was the exclusive reliance on tryptic peptides (generated by the common protease trypsin to digest proteins into small peptides so that they can be sequenced by MS). Gessulat et al do demonstrate that the existing model can be applied towards non-tryptic peptides as the existing dataset provided enough information to learn “general characteristics of peptide fragmentation” as they put it. However it can still be refined further, one limitation could be that there aren’t enough high quality dataset for other proteases (compared to trypsin) to train the models on. Secondly, the models have not yet been demonstrated for PTMs, for which it could be highly beneficial. Logical assumption would be that this is the next phase of the development, but it could have been nice to have that included as there are already enough datasets out there to not have this in the original publications. Nonetheless, I look forward to seeing this extended for PTMs where improvements in identification rates would be much welcomed, perhaps even more so than on unmodified peptides for proteome analysis. Lastly, one of the limitations is the bias for Orbitrap based mass spectrometers (Thermo Scientific) for both of these publications. Although, Orbitrap instruments do thoroughly dominate the market for high end MS analysis, I hope to see the strategy extended to different vendors and MS technologies which are just starting to compete with Orbitrap based MS.

Nanopore sequencing for proteomics

This paper was a late entry in 2019 (during the holidays) which could revolutionize to the field of proteomics. Although I lack the background and expertise to confidently comment on how Nanopore sequencing actually works, nonetheless, from my limited understanding it is this: Molecules (in this case amino acids from peptides/proteins) are pushed through a protein (called “nanopore”, a membrane transporter protein) that is localized in a lipid bilayer while a current is applied to one side of the bilayer. Each time an amino acid goes through the nanopore there is a blockade in the current. However, there continues to be something called a “residual current” which continues even though the current is blocked by the amino acid going through the nanopore. The authors observed that this residual current is uniquely different for 13 of the 20 amino acids, which means that if those amino acids are going through the nanopore, they can be identified with high confidence by measuring the residual current. For the remaining 7 amino acids, there were two groups which had similar residual currents and cannot be uniquely distinguished at the amino acid level. The authors further investigated the cause of the blockade current and demonstrate that 2 of the remaining amino acids (Methionine and Tyrosine) can be uniquely identified after chemical modification. It should be reiterated that all of this was accomplished on a wildtype aerolysin protein which serves as the nanopore. No doubt future engineering efforts can greatly improve the enzyme and thus provide an exciting future for alternative protein sequencing technologies. The obvious huge advantage of such a technology is that it can be performed on limited starting volume, can be extremely fast, robust, and with a very small footprint (some Nanopore sequencing devices are as small as a USB stick). Something proteomics as a field lacks at the moment with the current generation of mass spectrometers.

Of course one cannot go without discussing some of the challenges to this approach. Although I believe 13–15 amino acid specificity is likely to be sufficient to sequence the proteome (provided a reference sequence database is utilized), the limitation of the sequencing itself is still based on adding a polycationic carrier to single amino acids so that they can be pushed through the nanopore. This polycationic carrier is basically a polypeptide comprising of 7(!) chained arginine’s which provide sufficient charge required to be carried through, which I assume is, the current differential (perhaps the high charges are also required to cause the blockade in the current but I would refer to somebody more knowledgeable). The authors propose that future iterations to sequencing full length peptides would require 1) First anchoring proteins or peptides at their termini (at one end) at or very close to the nanopore, 2) cleaving off the terminal amino acid which I assume would be enzymatic (while anchoring the next amino acid to the nanopore or very close to it), 3) enzymatically adding the carrier charged peptide to that cleaved amino acid, 4) sequencing it in the nanopore. This seems to be quite ambitious, although I think some of those steps could be reduced by enzyme engineering, I am uncertain as to how far that can really go. Furthermore this is not even taking into account the complexity of the steric interactions which might occur in complex peptide/protein mixtures and the high dynamic range of the proteome itself. Of course this doesn’t mean it cannot be done, there is at least one company aiming to do some extreme protein engineering at a similar difficulty level although without nanopores (which you can read about here). There is no doubt that if this technology is successful, it will be this paper that will be cited as the pioneering work that kick-started it all.

Mapping protein targets of small molecule fragments based on stereoselectivity

I thought this was a really neat publication. The basic premise is this: small molecules and chemical probes are lacking for a large number of proteins. These probes and molecules can not only help us better understand and study different classes of proteins and their biological effects, but they can also be utilized as starting compounds for drug development. Often times you can have a small molecules which might have “mirror” versions produced during synthesis. These are called “enantiomers” and these are atomically similar in composition with each other. When small molecule drugs are produced, oftentimes they need to be enantiomer pure, this is undertaken during the synthesis and purification process that can be highly difficult in some cases (depending on the molecule). There are multiple cases where non-enantiomer pure compounds were given to patients causing severe side effects and death. A good example is Thalidomide which caused thousands of infant deaths, partially attributed to racemic mixture of enantiomers (some good read on the topic can be found here, here, and here). The authors in Wang et al probe the stereoselectivity of these enantiomers by studying how it affects protein selectivity of drugs and small molecules. They accomplish this by attaching probes to enantiomer pure fragments and adding these “enantioprobe pairs” separately to living cells. The probes themselves consist of a “photoreactive group” when excited with a certain wavelength of light will crosslink to proximal protein side-chains (aka amino acids) if the fragment is selective. The other group in the probe is the “alkyne handle”, which is utilized for click chemistry reactions and is selectively reactive with azide groups which can be used to purify the enantioprobes if they are specifically attached to proteins. So if a fragment is stereospecific for certain proteins upon adding these enantioprobe pairs to cells, this can be read out with MS analysis after the pulldown of the probes. They tested 8 different enantioprobe pairs and found 176 proteins which had “stereoselective interactions”. Overall I thought this was a very nice demonstration of how proteomics can be utilized for chemical biology and drug design.

Accurately measuring large proteins and protein complexes masses using single-particle analysis on Orbitrap instruments.

Although currently a bioRxiv pre-print and yet to be formally published in a journal, this study address a significant hurdle of measuring large biological “macromolecules” with commercial mass spectrometry instrumentation. Mass spectrometers measure the mass to charge of molecules, this information is utilized to determine the mass of “molecular elephants” (large biomolecules). The problem with measuring large molecular elephants in the mega to kilodalton plus range (which is typically occupied by large proteins such as antibodies, protein complexes, and viruses) is that they contain a lot (hundreds or more depending on the size of the elephant) of charges. This makes it extremely difficult in many cases to determine not only the exact size of the protein, but also variants of the protein that might be caused by modifications or truncation, for example. This is due to the many charges leading to a large m/z charge distribution of the molecule that may mask these variants to the point where you are roughly estimating the mass of the protein based on this distribution and unable to accurately determine the masses of different populations in the sample. Even with very high resolution instrumentation this problem cannot be always avoided. One way this issue is typically addressed is via single particle detection coupled to a mass spectrometer (discussed in the linked manuscript). However these technologies have been limited to academic niches (to only few groups in the world) and require in house fabrication and deep expertise as they are not commercially available. Wörner et al demonstrate with a commercial instrument a means to measure very large elephants. Wörner et al demonstrate the power of the technique by measuring complex mixtures such as IgG oligomers, protein complexes, and viral assemblies with and without RNA. To summarize, they inject single particles into the Orbitrap analyzer (which is sensitive enough that the molecules can be stably trapped for seconds at a time) and measure the molecules one at a time. When injecting single particles you will only see 1 peak, i.e. no isotopic distribution. However, they observed that the intensity of this peak correlates to the charge of the molecule. This can be determined by making multiple measurements of single particles to accurately determine the mass and number of charges.

They were not the only group to report this phenomenon. Kafader et al (see link below) from the Kelleher group also published similar findings around the same time (2 days before) on bioRxiv. Their approach trapped around 100 particles at a time instead of single particles and observed this to be sufficient for accurately determining mass. They even go one step further and determine 500 proteoforms directly from a HEK cell line lysate which is really impressive. I look forward to seeing how much further this field can evolve in the coming years, particularly in regards to number of proteins and sensitivity. I imagine ion mobility will play a role in this at some point.

Link to the Kafader et al paper below:

https://www.biorxiv.org/content/10.1101/715425v1.full

Subcellular localization of over 10,000 proteins in human cell lines

https://www.cell.com/molecular-cell/fulltext/S1097-2765(18)31005-0

I really liked this paper, not just because of the extensive depth of their analysis but also the presentation of their data. The authors demonstrate with relatively high confidence where roughly 10,000 different proteins are located within the cell. As we know the mammalian cell contains various critical structures such as the nucleus where DNA is stored, ribosomes which synthesize proteins in the cell, or the mitochondria which generates all the power for the cell (via generation of ATP), to name a few. Orre et al reach extensive depth at the protein level through fractionation at multiple levels. First they separate the cellular compartments (i.e. cytosol, nucleus, organelles etc…) of 5 different cell lines via a centrifugation based approach. This already decreases the protein complexity significantly for proteomics analysis, however they further fractionated the peptides from each cellular compartment (which they isolated by extracting it out of the cells) using their trademark peptide fractionation strategy based on isoelectric focusing (termed HiRIEF) leading to the identification of over 150,000 peptides and 10,000 proteins per cell line. They were able to compare their analysis across other localization datasets which were based on bioinformatics or experimental analysis. Although it’s a lot to take in, one of the key discovery in my opinion was that majority (>90%) of the proteins localized to only one cellular compartment at a time. This is in contrast to other studies such as the protein Cell Atlas work that observed roughly 50% of the proteins can be localized to more than one compartment. Another key point was the cell line dependent localization of certain proteins, i.e. proteins could localize to different cellular compartments depending on the cell line they are expressed in.

In my opinion, the percentage of proteins that are localized to only one subcellular location most likely lies somewhere between the reported findings of this paper (90%) and the Cell Atlas publication (50%). As the findings of the Cell Atlas paper were based on immunofluorescence, it’s not outside the realm of possibility that antibody (non)specificity could lead to under estimation of localization specificity. However, the Orre et al findings were limited by the subcellular separation protocol, which collected several different organelles to only 5 fractions, severely limiting the resolution of the subcellular fractionation, for reference the Cell Atlas study classify 30 subcellular compartments and 13 organelles. The current limitation for both studies is the reliance on cell lines so it would be interesting to see how the localization data looks on cells isolated from physiological samples. It would also help clarify which of the two methods are better situated to address this question so the jury is still out. Regardless, I recommend everybody to check out the excellent web portal for Orre et al’s work at http://www.subcellbarcode.org/ .

Illuminating the dark phosphoproteome

I absolutely loved this publication from Needham et al in Science Signaling earlier this year. Unlike the other papers in this list, this one is a review article. Although I am slightly biased as somebody who extensively studied intracellular signaling and phosphoproteomics during my PhD, this review article is really thorough and beautifully presented. The authors do a good job of highlighting the current challenges and limitations of phosphoproteomics analysis, but more importantly they put into context the many “known unknowns” as Donald Rumsfeld once elegantly put it. What that translates into is that we are constantly finding unexpected and novel roles for even the well-studied kinases (proteins which phosphorylate other proteins to activate/deactivate, alter protein complexes, and or change subcellular localization). As there are over 400 human kinases, majority of which are heavily understudied and their biological context remains unknown. Kinases, as many of us know are an important class of enzymes and lucrative drug targets as they control many central cellular processes such as cell cycle, proliferation, apoptosis, intracellular signaling, and play a crucial part in sensing and responding to the extracellular environment, to name a few. The authors do a good job of exploring certain strategies which can be utilized to determine the biological functionality and context for different kinases. The extensive review (with over 200 references) is difficult to summarize here, so I highly recommend a quick read if possible. The review also contains movies for presentation but must be viewed separately (publishers really need to get up to date with integrating different forms of media). Overall it’s an excellent resource for anybody who wants to gain some insight into protein phosphorylation, the state of the field and future opportunities.

Role of phosphorylation in maintaining sleep-wake cycle

One of the challenges in studying protein phosphorylation is determining the role of specific phosphorylation events under physiological conditions. Although recent studies have started to elucidate the role phosphorylation at the cellular-tissue level, many of the studies are still done using traditional cell line models which most likely do not accurately recapitulate physiological processes at the molecular level. Brüning et al use mice models to study the role of protein phosphorylation on circadian clock. Specifically, they kept mice under very controlled light and dark cycles, during which they collected “synaptosomes” (or referred to as “synaptoneurosomes”) from synapses of neurons, i.e. the ends of branched neurons from dendrites where electrical and chemical signals are passed and received between neurons. These synaptoneurosomes were collected from the mice forebrains at regular timed intervals of the light and dark cycle during which mice were sacrificed (i.e. killed so that their brains could be studied). What they found was a torrent of phosphorylation activity correlating with the circadian cycles of mice. More specifically they observed the highest phosphorylation activity during the transition from light-to-dark, during which mice start to be active. The second highest phosphorylation activity was observed at night when the mice were preparing to sleep. On a whole this makes a lot of sense as it seems like the brain is preparing for these transitions and the activity remains stable throughout the day and night when the neurons are filling their roles for the circadian cycles. They additionally observed a large number of activated kinases concentrated at these synaptoneurosomes during the peak of their activity. They tested what happens if you disrupt the circadian clock of these mice by sleep deprivation. They found deprivation to abrogate the regular phosphorylation rhythms which correlated with the regular circadian cycles from their analysis, suggesting that protein phosphorylation plays an essential role in these circadian transitions. This is an important study which cannot be performed in humans, so whatever knowledge that can be gained from this can be applied to obtain a better understanding of human physiology. How this translates to potential drugs targeting the brain remains to be seen, drugs for the brain typically target certain chemical receptors, oftentimes with many side effects. Kinases localized to the these regions could potentially be attractive targets for regulating brain activity, however kinase drugs oftentimes come with their own set of side effects. A great study nonetheless and I am very much looking forward to seeing follow up studies on this subject.

FlashPack — Fast and simple preparation of nanocolumns

https://www.mcponline.org/content/18/2/383

The last member of my top 10 list is a very technical and niche publication that was very much appreciated by those who perform proteomics work on a daily basis. In summary, this paper demonstrated a technique to pack capillary columns with very high efficiency, extremely fast. Sensitive MS analysis is typically accomplished through separation of complex peptide mixtures using reversed-phase chromatography. This step is essential to reduce the complexity of the mixture so that peptides are sequenced in less complex blocks as they are separated on a column over a period of time and analyzed by MS. Most laboratories typically pack their own nanocolumn as it can be literally 100–1000 times cheaper than buying them commercially. It is also something that various vendors have not been able to manufacture reliably on a large scale. Homemade nanocolumns (few cents — few dollars each) have always outperformed very expensive commercial capillary columns (1000+ dollars) in our hands. However, packing these columns takes patience, experience, and accepting high failure rates. Kovalchuk et al demonstrate a very fast and easy method to accomplish this on the order of seconds to minutes. For reference, some columns might require packing to occur from an hour to 12+hours (typically overnight) what this method accomplishes in minutes. Proteomic groups at our institute and our group have adopted this method and have been successfully utilizing it.

--

--

Tanveer Singh Batth

Proteomics researcher in Copenhagen, Denmark with a variety of interests in science, technology, culture and society.