Starschema Blog
Published in

Starschema Blog

Answering the big questions (this time, in chemistry)

When I was a kid, I drove my parents and teachers absolutely insane by my need to know why things are the way they are. This was particularly tough on my poor chemistry teacher, constantly inundated with questions. Fine, potassium cyanate is toxic, but why? Also, why are nitriles normally not particularly pungent, but isonitriles smell like some eldritch horror a chemically talented Lovecraft would have dreamed up? And can somebody please explain why methyl compounds of metals (methylcadmium, methylmercury, dimethyl zinc, you name it) are almost without exception horrible on multiple levels (explosivity, toxicity, stench, general unpleasantness to handle, broken glassware produced per gram synthesised, and so on)?

Much to the chagrin of my very, very patient wife, I haven’t changed a lot in this regard. Which is why I continue to be very interested in why things are the way they are. When we build a predictive model, it’s all well and good that we can advise our client as to which machine is going to break down next or where to drill for oil or which area is at a higher risk for malaria and therefore worth focusing treatment efforts on. Increasingly, however, clients are starting to join me in my attitude of wanting to know why, and we are just as keen on satisfying their curiosity as I was at asking interminable questions in Dr. Fuleki’s 10th grade chem class.

The following example will be quite unconventional. We’ll be building a model that tries to decide whether a small molecule is likely to be toxic. But that’s not all — we’ll also be interested in why it is toxic. What are the groups that are most likely to make it toxic, and what are the ones that suggest it might be innocuous? Unusually, we’ll be doing machine learning on something you probably haven’t seen it used a lot — molecules. Specifically, we’ll be looking at small molecules, meaning anything under 900 daltons. That includes a lot of modern pharmaceuticals as well as toxins and other unsavoury things, while also excluding large macromolecules such as proteins. There’ll be some chemistry talk, but this isn’t a cheminformatics blog post, so even if all you remember from high school chem is the smell and the occasional explosions, stay with me.

Our source data for this is Tox21, the toxicology programme of the National Institutes of Health. Tox21 has created a unique data set, pairing thousands chemical structures with their toxicities — their methodology basically relies on exposing cell cultures that are relatively representative of the human body to putative toxins, and see what happens. We’ll be using DeepChem, a fantastic package for pharmaceutical chemistry and drug discovery (among others), and RDKit for feature generation.

Feature generation

Like all supervised machine learning projects, we need to start this by deciding upon the features to learn, and generating them.

There are many ways of representing a molecule, some more suitable for our purposes than others. For instance, the summary formula (which tells you how many atoms of each element are present in one molecule) is useful, but when it comes to biology, the devil’s in the details, or rather, the structure. Famously, cocaine and hyoscine (scopolamine) share the same summary formula, but while one is a notorious stimulant, the other is a depressant-deliriant.

The structure formula and the SMILES notation of acetylsalicylic acid, more commonly known as aspirin. Note how side chains are enclosed in parentheses. Double bonds are signified by the = (equality) symbol. Hydrogen atoms are not explicitly displayed as they can be inferred.

So we need something that reflects structure, too. There are multiple such systems, such as InChI, which generates inordinately long and chaotic-looking strings, and SMILES, which we are going to use because it is slightly easier to handle. SMILES, somewhat analogously to chemical naming, looks for the longest backbone, breaks any cycles and then describes the branches. A worked example of acetylsalicylic acid, more commonly known as aspirin, is displayed to the left.

What we need at this point is a way to generate features from molecular structures represented by the SMILES strings — after all, what we’re after are relationships between structural motifs, such as the acyl group (the end part hanging off to the ‘top left’ of the benzene ring, comprising the atoms annotated 1, 2 and 3). For this, we first need to cleave apart our molecules into characteristic parts. In computational chemistry, this is referred to as molecular fingerprinting, and there is a large number of different approaches to this problem. In this case, we will be using a technique called Morgan or circular fingerprinting. Given a radius (defined as number of bonds), the Morgan fingerprint characterizes the atoms themselves (by element), their neighbourhood, their charge, whether they’re in a ring and various other chemical features.

Some Morgan features extracted from the Aspirin molecule, reproduced for reference as the top leftmost molecule.

These features give us a way to describe molecules in a way that we can not only correlate with toxicity, but which also allows us to determine what patterns of a molecule are likely to contribute to its toxicity. How? That’s the magic of LIME.

When life gives you LIME…

LIME (Local Interpretable Model Agnostic Explanations) is a model-agnostic way to peek into the inner workings of a classifier or regressor, first described by Ribeiro et al. (2016). The overriding idea is pretty trivial: to understand the black box between inputs and outputs, tweak inputs and see what happens to the outputs. These are referred to as perturbations, and the idea is to correlate particular perturbations to particular results.

Consider working out a model optimizing your steak: meat quality (from bottom-shelf steak to deep freeze Argentinian prime beef), temperature, marinade composition and days marinated, grill vs skillet, and so on. Theoretically, of course, you could try every single permutation of these parameters, but in practice, you’ll simply have steak every night for a year, and record the parameters. Eventually, you’ll have enough data points to build a model.

Now, assume you feed data about a completely new configuration of meat quality, temperature etc. into your model, curious if that will yield an enjoyable piece of prime beef or some indigestible shoe leather. You get a result… but you’re still curious — why? What parameter made the biggest difference in the model’s decision?

What LIME does is best described as creating perturbations — it runs a lot of relatively similar steak-making configurations through the model, and attempts to isolate which of the factors have the greatest impact on the expected steak quality score. Even better, it can break them down by effect: sure, this beef is quality stuff (suggesting a good outcome), but it’s been overcooked and the grill really isn’t the best for it (suggesting a worse outcome). You can see in quite precise detail what factors pointed which way in assessing your steak outcome.

Now let’s see how that’ll work for molecules.

Building the toxicity model

As mentioned, we’ll be using the Tox21 data set, and use DeepChem to build a fully-connected network (FCN) based multitask classifier (for more about multitask classifiers, look here). Our data contains twelve ‘tasks’ or assays — each of these are individual toxicity pathways, key biomolecules through which a molecule can express toxicity. These include, for instance, the estrogen receptor (ER) alpha signaling pathway, and chemicals that act as agonists (activators) of the ER alpha signaling pathway might disrupt the endocrine system. Similarly, an assay for the antagonism of the p53 cellular tumour antigen might suggest that the compound tested is likely to inhibit the process by which p53 protects cells from DNA damage upon insult (through inducing DNA repair, halting the cell cycle or triggering cell death, known as apoptosis). The Tox21 data set is publicly available, and contains data for approximately 10,000 chemicals. We will be looking at a reduced sample of 7,084 compounds, which have data for all twelve ‘tasks’, split into a training set (n=6264) and a test set (n=784).

Using DeepChem’s fcnet.MultitaskClassifier (saving us manually building a multitask classifier), we’re going to learn every ‘task’ toxicity, i.e. every assay, of each molecule, simultaneously. After merely 20 iterations, we get a mean ROC AUC of 0.757, with the ROC AUC being highest for the MMP ‘task’ (mitochondrial membrane potential disruptors) at 0.862 and the lowest for the estrogen receptor task, at only 0.650. This is an acceptable result, especially given that the task-wise ROC AUC helps us determine how reliable a particular judgment on the toxicity of a compound is depending on which assay suggested the toxicity.

Epigallocatechin gallate (EGCG) is a catechin present in green tea — and, apparently, a mitochondrial membrane potential disruptor. But why? Let’s go find out.

Let us randomly pick a molecule our model believes is toxic. Because the MMP task was the most accurate, let’s pick one where the model suggests an MMP disruption mechanism for toxicity. I randomly picked this beauty called epigallocatechin gallate, which is abundantly present in tea and drug candidate for hypertension, but can also be toxic in large doses — indeed, the EFSA warned in 2018 that large doses may cause liver injury. That, almost definitely, is the result of the mitochondrial toxicity the MMP assay suggested. But why? Which parts of this gorgeous-looking molecule cause havoc in hepatocytes?

Squeezing the LIME

What LIME thinks about the features in epigallocatechin gallate. The fp_ numbers each identify a Morgan feature or ‘fingerprint’. Fingerprints 578, 352 and 456 are strongly associated with mitochondrial toxicity, while the absence of fingerprints 420 and 175 are relatively weakly associated with no toxicity. Note that because this is a multi-task classification, combination sets are being examined here. Thus, the results suggest that much of EGCG’s toxicity is due to the presence of fingerprints 578 and 352 in the absence of fingerprint 456.

Fortunately, LIME can tell us just fine (and, slightly creepily, what parts of it make it less likely to be toxic). Let’s look at the five most significant fingerprints — that is, Morgan features — that are the most strongly associated with either toxicity or its absence according to LIME. Recall that LIME causes local perturbations, i.e. small changes in the input to see how that affects the outcome, thereby granting us a glimpse into the model itself. The model suggests that fingerprints 578 and 352 occurring in the absence of fingerprint 456 strongly suggests toxicity (96%), while the absence of fingerprints 420 and 175 suggest non-toxicity (or, to phrase it in the reverse, the molecule would be even more likely to be toxic if it incorporated those fingerprints).

So far, we’ve been talking about fingerprints in the abstract. Let’s see what is actually meant here. We can reverse the Morgan features’ (or fingerprints’) hashes to reveal what these fingerprints look like.

Fingerprints associated with MMP toxicity: fp587 and fp352, respectively, are correlated with toxicity where they occur in the absence of fp456 — as they indeed do in the case of EGCG.

Of course, what is really interesting is not so much what makes one compound (probably) toxic, but rather what fingerprints all molecules that exhibit toxicity, say, through the estrogen receptor (ER) pathway, have in common. If we didn’t know what the estrogen receptor looked like (which we do), and thus did not know where its binding sites are, but still had to figure out what a ligand (a molecule that interacts with a protein) would have to look like, we could look at what fingerprints the molecules that are proven agonists of the ER pathway have in common. This would then yield a useful first approximation of motifs that might be correlated to ER pathway activation.


The advent of quantitative high-throughput screening (qHTS), where thousands of molecules can be tested for toxicity or agonism on human cell lines with ease, allowing us to answer the big questions: why — owing to what structural motif — is one chemical toxic, but another isn’t? And beyond toxicity screening, peeking inside such models might help us understand more about why molecules do what they do in a biological context, opening up new vistas from drug discovery to creating better tasting synthetic flavourants. The possibilities are as endless as the human imagination — which is quintessentially about asking why?, all the time.

Even if it annoys teachers a lot.



Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Chris von Csefalvay

Chris von Csefalvay

VP of Special Projects at Starschema, clinical computational epidemiologist, rower, Golden Retriever dad, Fellow of the Royal Society for Public Health.