Finding drugs for Bile Duct Cancer using machine learning

5 min readJul 16, 2021

Cancer is weird. It’s as if a small part of you has gone rogue. Unfortunately, I have a sudden interest in trying to help find a cure, specifically for bile duct cancer. My background is in building AI apps for bank fraud so diving into the oncology / pharmacology space is a bit foreign.

For machine learning algorithms to work there needs to be a target variable or something to predict. In bank fraud that’s predicting whether or not a transaction is fraudulent. It’s pretty straightforward: identify good and bad transactions, add in some features, and the algorithms take care of the rest. With cancer there’s not exactly a “cure” target variable to find. Additionally there’s a layer of biology on top that likes to complicate things. The closest thing that exists to a target variable is called an IC50 score. IC50 is how much of a compound it takes to kill 50% of the cancer in a Petri dish.

Large pharmaceutical companies use high throughput virtual screening with machine learning to come up with viable drug candidates to synthesize and test. These drug compounds are trade secrets so unfortunately there’s not much opportunity to help out other than sit on the sidelines and watch the pipelines. (https://www.modernatx.com/pipeline, https://www.astrazeneca.com/our-therapy-areas/pipeline.html, https://www.pfizer.com/science/oncology-cancer/pipeline, etc.) The only way that I have found so far to contribute is through PaccMann.

PaccMann is an open source pipeline to find potential drug candidates for cancer created by an IBM team based in Switzerland. https://www.zurich.ibm.com/paccmann/

I would strongly recommend reading their two papers on finding cancer compounds and one on COVID compounds.

https://pubs.acs.org/doi/10.1021/acs.molpharmaceut.9b00520
https://www.cell.com/iscience/fulltext/S2589-0042(21)00237-6
https://iopscience.iop.org/article/10.1088/2632-2153/abe808/meta

TLDR:

The PaccMann team first built a deep learning model to predict a chemical compound’s IC50 score based on the genetic data of the tumor and the SMILES string of the chemical compound. SMILES is a way to transform a chemical compound into a string or text for the computer to more easily understand. I.e. Glucose (C6H12O6 ) is OC[C@@H](O1)[C@@H](O)[C@H](O)[C@@H](O)[C@H](O) in SMILES.

The data used by the PaccMann team was from a database called GDSC (
Genomics of Drug Sensitivity in Cancer) which contains cell lines of different cancers and the IC50 scores of different chemical compounds.

Now, the IC50 score is only partof the problem. Some compounds such as bleach have an incredibly good IC50 score but it kills more than just the cancer, they kill everything (but only 99.9% of germs).

In order to find the right compound, the PaccMann team expanded beyond the IC50 predictor model and implemented a reinforcement learning pipeline that had a number of stages and models involved.

Stage one is a molecular generator of SMILES strings trained from a database called CHEBL which contains all chemical compounds that are known to exist. This helps create feasible compounds that can be created more easily.

Stage two is a toxicology model to predict whether or not the compound is dangerous to humans. I wasn’t actually able to get the toxicology model to work so unfortunately I don’t have any data in my results.

Stage three is then to actually predict the IC50 strength.

There’s an agent that watches the entire process and will adjust parameters based on the resulting IC50 score. This is the reinforcement learning aspect of the machine learning model.

Unfortunately with bile duct cancer being a rare cancer, there were only 6 cell lines of data to run the pipeline against. This is compared to the few hundred cell lines for cancers like lung or breast.

Results:

Full CSV of drug compounds here: https://s3.amazonaws.com/intellivendo.com/results/generated.csv

The compound CC(C)C4=CC3C2C(C)CC1(O)CC(O)CCC1(C)C2CC=C3C4, mission accomplished?

If you look at the training loss, there’s a massive drop off at 45 training steps. You can see this in the generated compounds CSV file, it appears to almost just add additional carbons for little to no reward. I suspect this isn’t a local minima in the error reduction but is due to a global minima from a lack of data.

I’m not a biochemist so I can’t really interpret the molecules themselves in too much detail but it does seem to focus on this sequence “C(=O)NC” on 191 out of the 259 results. My guess is that this may inhibit some essential protein somewhere. Another interesting sequence is “CC1CCCCC1C(=O)“ which is Methylcyclohexanecarbaldehyde (C8H14O) and appears 32 times.

I spoke with Jannis of the IBM team and he recommended a few things that could be done to improve the results. The first being to add more bile duct cancer data from GDSC which updates over time. And the second would be to fine tune the molecular generator with compounds that are known to be effective. This involves specifying the training model with compounds that already contain a strong IC50 score.

The future

The next step is to actually synthesize and run some of these compounds and retrain the predictor based on the outputs. IBM has a tool that uses AI to print out the steps to create the compound called IBM RXN for Chemistry. There is another open drug discovery project that has built a full pipeline for malaria and zika called open source drug discovery (http://www.osdd.net/). Essentially the more data we have, the better the results that the pipeline could output creating a feedback loop.

AI will have a massive impact on oncology, with genetic testing of tumors being common place, we will soon have pipelines that can predict which existing drugs will have the most impact.

Does anyone know anyone who runs a lab so that we can start testing IC50 scores and get more data for the model?

Finding drugs for Bile Duct Cancer using machine learning

TLDR:

Results:

The future

Written by Jeffrey Morris