AI for Spatial Metabolomics I: The Datasets of Life

Sergey Nikolenko
Neuromation
Published in
8 min readMar 9, 2018
Image source

Here at Neuromation, we are starting an exciting — and rather sophisticated! — joint project with the Spatial Metabolomics group of Dr. Theodore Alexandrov from the European Molecular Biology Laboratory. In this mini-series of posts, I will explain how we plan to use latest achievements in deep learning and invent new models to process imaging mass-spectrometry data, extracting metabolic profiles of individual cells to analyze the molecular trajectories that cells with different phenotypes follow…

Wait, I’ve surely lost you three times already. Let me start over.

Omics: the datasets that make you

Image source

The picture above shows the central dogma of molecular biology, the key insight of XX century biology into how life on Earth works. It shows how genetic information flows from the DNA to the proteins that actually do the work in the cells:

  • DNA stores genetic information and can replicate it;
  • in the process known as transcription, DNA copies out parts of its genetic code to messenger RNA (m-RNA), also a nucleic acid;
  • and finally, translation is the process of making proteins, “reading” the genetic code for them from RNA strings and implementing the blueprint in practice.

I’ve painted a very simplified picture but this is truly the central, the most important information flow of life. The central dogma, first stated by Francis Crick in 1958, says that genetic information flows only from nucleic acids (DNA and RNA) to proteins and never back — your proteins cannot go back and modify your DNA or RNA, or even modify other proteins, they are controlled only by the nucleic acids.

Everybody knows that the genetic code, embodied in DNA, is very important. What is slightly less known is that each step along the central dogma pathway (a pathway is basically a sequence of common reactions that transform molecules into one another for example, DNA -> RNA -> protein is a pathway, and a very important one!) corresponds to its own “dataset”, its own characterization of an organism, each important and interesting in its own way.

Your set of genes, encoded in your DNA, is known as the genome. This is the main “dataset”, your primary blueprint, the genome is the stuff that says how you work in the most abstract way. As you probably know, the genome is a very long string of “letters” A, C, G, and T, which stand for the four nucleotides… don’t worry, we won’t go into too much detail about that stuff. The Human Genome Project successfully sequenced (“read out” letter by letter) a draft of the human genome in 2000 and a complete human genome in 2003, all three billion of letters. Since then, sequencing methods have improved a lot; moreover, all human genomes are, of course, very similar, so once you have one it is much easier to get the others. Your genome determines what diseases you are susceptible to and defines many of your characteristic traits.

The study of the human genome is far from over, but it is only the first part of the story. As we have seen above, genetic code from the DNA has to be read out into RNA. This is known as transcription, a complicated process which is entirely irrelevant for our discussion right now: the point is, pieces of the genome are copied into RNA verbatim (formally speaking, T changes to U, a different nucleotide, but it’s still the exact same information):

Image source

The cells differentiate here in which parts of the genome get transcribed.

The set of RNA sequences (both coding RNA that will later be used to make proteins and non-coding RNA, that is, the rest of it) in a cell is called the transcriptome. The transcriptome provides much more specific information about individual cells and tissues: for example, a cell in your liver has the exact same genome as a neuron in your brain — but very different transcriptomes! By studying the transcriptome, biologists can “increase the resolution” and see which genes are expressed in different tissues and how. For example, modern personalized medicine screens transcriptomes to diagnose cancer.

But this is still about the genetic code. The third dataset is even more detailed: it is the proteome that consists of all proteins produced in a cell, in the process known as translation, where RNA serves as a template, with three letters encoding every protein:

Image source

This is already much closer to the actual objective: the proteins that a cell makes determine its interactions with other cells, and the proteome says a lot about what the cell is doing, what its function in the organism is, what effect it has on other cells, and so on. And the proteome, unlike the genome, is malleable: many drugs work exactly by suppressing or speeding up the translation of specific proteins. Antibiotics, for instance, usually fight bacteria by attacking their RNA, suppressing protein synthesis completely and thus killing the cell.

Genomics, transcriptomics, and proteomics are subfields of molecular biology that study the genome, transcriptome, and proteome. They are collectively known as the “omics”. The central dogma has been known for a long way, but only very recently biologists have developed new tools appeared that actually let us peek into the transcriptome and the proteome.

And this has led to the big data “omics revolution” in molecular biology: with these tools, instead of theorizing we can now actually look into your proteome and find out what’s happening in your cells — and maybe help you personally, not just develop a drug that should work on most humans but somehow fails for you.

Metabolomics: beyond the dogma

Image source

Molecular biologists began to speak of “the omics revolution” in the context of genomics, transcriptomics, and proteomics, but the central dogma is still not the full picture. Translating proteins is only the beginning of the processes that occur in a cell; after that, these proteins actually interact with each other and other molecules in the cell. These reactions comprise the cell’s metabolism, and ultimately it is exactly the metabolism that we are interested in and that we might want to fix.

Modern biology is highly interested in processes that go beyond the central dogma and involve the so-called small molecules: enzymes, lipids, glycose, ATP, and so on. These small molecules are either synthesized inside the cells — in this case they are called metabolites, that is, products of the cell’s metabolism — or come from beyond. For instance, vitamins are typical small molecules that cells need but cannot synthesize themselves, and drugs are exogenous small molecules that we design to tinker with a cell’s metabolism.

These synthesis processes are controlled by proteins and follow the so-called metabolic pathways, chains of reactions with a common biological function. The central dogma is one very important pathway, but in reality there are thousands. A recently developed model of human metabolism lists 5324 metabolites, 7785 reactions and 1675 associated genes, and this is definitely not the last version — modern estimates reach up to 19000 metabolites, so the pathways have not been all mapped out yet.

The metabolic profile of an organism is not fully determined by its genome, transcriptome, or even proteome: the metabolome (set of metabolites) forms, in particular, under the influence of environment that provides, e.g., vitamins. Metabolomics, which studies the composition and interaction between metabolites in live organisms, lies at the intersection of biology, analytical chemistry, and bioinformatics, with growing applications to medicine (and that’s not the last of the omics, but metabolomics will suffice for us now).

Knowing the metabolome, we can better characterize and diagnoze various diseases: they all have to leave a trace in the metabolome because if the metabolism has not changed why is there a problem at all?.. By studying metabolic profiles of cells, biologists can discover new biomarkers for both diagnosis and therapy, find new targets for the drugs. Metabolomics is the foundation for truly personalized medicine.

The ultimate dataset

Image source

So far, I’ve been basically explaining recent progress in molecular biology and medicine. But what do we plan to do in this project? We are not biologists, we are data scientists, AI researchers; what is our part in this?

Well, the metabolome is basically a huge dataset: every cell has its own metabolic profile (set of molecules that appear in the cell). Differences in metabolic profiles determine different cell populations, how metabolic profiles change in time corresponds to patterns of cell development, and so on, and so forth. Moreover, in spatial metabolomics that we plan to collaborate on it comes in the form of special images: results of imaging mass-spectrometry applied at very high resolution. This, again, requires some explanation.

Mass-spectrometry is a tool that lets us find out the masses of everything contained in a sample. Apart from rare collisions, this is basically the same as finding out which specific molecules appear in the sample. For example, if you put a diamond in the mass-spectrometer you’ll see… no, not just a single carbon atom, you will probably see both 12C and 13C isotopes, and their composition will say a lot about the diamond’s properties.

Imaging mass-spectrometry is basically a picture where every pixel is a spectrum. You take a section of some tissue, put it into a mass-spectrometer and get a three-dimensional “data cube”: every pixel contains a list of molecules (metabolites) found at this part of the tissue. This process is shown on the picture above. I’d show some pictures here but it would be misleading: the point is that it’s not a single picture, it’s a lot of parallel pictures, one for every metabolite. Something like this (picture taken from here):

The quest of making better imaging mass-spectrometry tools mostly aims to increase resolution, i.e., make the pixels smaller, and increase sensitivity, i.e., detect smaller amounts of metabolites. By now, imaging mass-spectrometry has come a long way: the resolution is so high that individual pixels in this picture can map to individual cells! This high-def mass-spectrometry, which is becoming known as single-cell mass-spectrometry, opens up the door for metabolomics: you can now get the metabolic profile of a lot of cells at once, complete with their spatial location in the tissue.

This is the ultimate dataset of life, the most in-depth account of actual tissues that exists right now. In the project, we plan to study this ultimate dataset. In the next installment of this mini-series, we will see how.

Sergey Nikolenko
Chief Research Officer, Neuromation

--

--