Rethinking Drug Discovery With AI & Big Data — Using AlphaFold predictions to structurally align cancer driver proteins

6 min readJul 9, 2023

Each time I open LinkedIn, I’m greeted by the same headlines, all saying something like “AI technology is advancing what’s possible in [insert industry]…” It’s a tagline becoming increasingly common these days, especially in business operations and content creation, but it feels like the breakthroughs of ChatGPT left many disciplines, including my own, largely unscathed.

The state of ML and AI in biotech R&D can’t be accurately assessed by looking at LLMs like GPT-4; we computational biologists face unique challenges in data consistency, model interpretability, and experimental validation. Protein secondary structure prediction is one example of a task that historically has underperformed, until recent developments from AlphaFold have reignited discussion about the benefits and use cases of the model.

AlphaFold is DeepMind’s AI model that predicts a protein’s three-dimensional conformation given its amino acid sequence. It predicts folds in the backbone of the peptide and amino acid residue interactions by scanning a database of protein sequences, creating a multiple sequence alignment (MSA), and comparing known conformations and their underlying biochemistries. The global distance test (GDT) compares a models’ predicted structures to experimentally-determined structures, atom-by-atom, to see how closely the model’s prediction matches experimental data. AlphaFold has a GDT score of 92.4 out of 100. AlphaFold allows us to perform big data analyses with proteome-wide coverage we couldn’t have before because of missing or inconsistent experimental data.

Cancer is caused by genetic mutations, and since genes dictate the structure of a protein, these mutations cause conformational changes to the protein. The common approach to drug development is to find binding pockets on these cancer driver proteins and design chemical inhibitors specific to just this protein.

Drug development is costly. Wouters and colleagues hypothesized that costs can reach up to $2.8 billion to bring a drug to market. But if we already have a chemical inhibitor (i.e “drug”) that targets one protein, can we find other similarly-structured proteins that might be candidates for our chemical inhibitor?

Conventional approaches use BLAST sequence alignment to find proteins of similar amino acid sequences for new target discovery. This is generally a good approach; since high protein sequence similarity is indicative of high protein structural similarity, comparing peptide sequences is a good proxy for peptide conformation. However, you can also have dissimilar protein sequences that result in similar protein folding, a concept known as remote homology. There is a lot of attention in bioinformatics on remote homology detection for sequence alignment, but what if we superseded this concern by structurally aligning proteins instead? This would allow us to efficiently steer drug discovery and development toward clusters of similarly structured cancer driver proteins and impact a larger patient population.

Using AlphaFold predicted secondary structures, I sought to perform this structural clustering analysis on cancer driver proteins and develop a visualization to overlay biologically-relevant data on these protein families.

As I mentioned, my protein data came from AlphaFold-predicted structures. To ensure only high confidence backbone predictions were being used, I subsetted the human proteome (~23,000 PDB structures) by a pLDDT > 70 threshold which left me with around 17,000 proteins. I cross-referenced this high confidence set with OncoKB, a shortlist of 1074 oncogenic and tumor suppressor genes, yielding a set of 616 cancer driver proteins with high confidence predictions from AlphaFold.

With the data sufficiently cleaned, the next step was performing pairwise comparisons of each protein in the set (comparing structures of each protein against all 615 others + self). To assess structural similarity, I needed an algorithm to map a 3D protein structure to a 2D distance matrix and align distance matrices against each other to calculate a weighted sum of intramolecular distances between equivalent pairs of atoms in amino acid residues.

Lucky for me, DALI’s distance matrix alignment algorithm does exactly this, so I didn’t have to reinvent the wheel. Dali calculates an RMSD and a Z score (unrelated to Z scores from hypothesis testing). RMSD, or root-mean-square deviation, measures the average distance between corresponding atoms in a protein’s backbone. Z score between two proteins is computed after aligning the backbones, comparing overlapping residues, and summing the differences in lengths between corresponding residues of proteins. The Z-score calculation builds on the RMSD one, but offers more robustness when comparing proteins of varying size and acts as a better indicator of structural similarity. It’s worth noting there is some discussion of whether a similarity or a dissimilarity score is a better metric of protein alignment, but for this project I stuck with Z-score as my primary metric for protein structural similarity.

I wrote a few Python scripts to generate a list of 379,456 (616 x 616) pairwise comparisons for DALI, partition the list for multithreading the pairwise alignment algorithm across ten parallel nodes (AWS EC2), mine the relevant data from each generated output directory, and build a 616 x 616 matrix of proteins across the rows and the columns with corresponding Z scores in each cell. Also worth noting is that DALI is significantly slower than sequence alignment alternatives. DALI took 10 hours to perform 38,000 structural similarity comparisons, while BLAST took under an hour for sequential alignment on the same dataset.

I performed ward clustering (agglomerative hierarchical clustering) using sklearn to sort comparisons by Z score, wrote a (painstaking) script to generate a tree of nested tuples from the output 2D array, and visualized this with Phylocanvas in JavaScript. The outputs were fascinating.

A correlation analysis between BLAST (sequence alignment) percent identity and DALI (structure alignment) Z-score on this cancer driver dataset revealed a moderately high R2 coefficient of determination. It was strong enough to validate DALI’s reliability (assuming BLAST isn’t completely wrong in its assumptions), and while probing, we discovered a unique feature of structural alignment modalities.

Structural alignment had revealed remote homologies where sequential alignment couldn’t. In numerous oncogenic families, I could clearly identify peptides with dissimilar amino acid sequences displaying similar protein structure.

Clusters form around proteins whose alignment with each other have high Z scores. Within superfamily clusters, both BLAST and DALI alignment contain many of the same protein members. However, there were many discrepancies in the specific companion protein that a given cancer driver protein clustered (aligned) most closely with. These minute differences within family orientation and membership inclusion were due to the remote homologies undetected by BLAST which appeared in DALI. Without going into the undisclosed specifics, there were tons of insights gained regarding cancer driver proteins of interest for my company, as well as a few new priority targets given the high Z-scores of structurally similar proteins.

An example of a polar dendrogram! Made by Sam Roberts at MathWorks. For confidentiality reasons, I cannot share the interactive dendrograms I made, though they were objectively prettier :)

The output polar dendrogram looked like this, but a little cleaner. I generated one for structural alignment with DALI and one for sequential alignment with BLAST to visualize alignment clusters. Phylocanvas offers some awesome customization; you can set the shape and color of each node individually, so I created a colorimetric scale from blue to red to encode additional data about a protein’s mutational frequency in patient populations. Using the Genie dataset, I computed the proportion of the patient population that tested positive for mutations in a given protein across different cancer histotypes (i.e CRC, NSCLC, pancreatic cancer).

Conclusion

All-in-all, this project was a success, and resulted in a dynamic data tool to support decisions about prioritizing the next generation of targets reusing RevMed’s unique tricomplex platform so we don’t have to “reinvent the wheel” chasing new cancer mutations. It also provides motivation to seek developing novel inhibitors for clusters of proteins that are highly similar and highly mutated in the patient population (whose families shall be unnamed in this write-up). Being able to overlay biological data of interest with node shape, size, or color is a fantastic feature, and can be used in the future to assess tissue distribution or CRISPR screen data. My analysis validated the reliability of DALI on cancer driver datasets like OncoKB, as well as justified the benefit of AlphaFold; 30% of predicted-protein structure data used in this project didn’t have any experimental crystallography data, meaning those proteins would have been lost in this structural clustering analysis without AlphaFold. I learned how to multithread data processing pipelines, became fluent at scripting in annoying vim editors, and, in a project necessitating over 20 scripts, was able to practice developing software pipelines for reusability and dynamism. Huge thank you to RevMed for letting me run the point on such a fascinating project!

Rethinking Drug Discovery With AI & Big Data — Using AlphaFold predictions to structurally align cancer driver proteins

Conclusion

Written by Vivek Kanpa