How does drug candidate screening work?

A primer for the data scientist.

Published in

Towards Data Science

9 min readJul 5, 2020

**A compound (‘ligand’) docking into an enzyme protein (left: ‘cartoon’ view, right: ‘surface’ view).** Source: EMSKE Phytochem, Software: PyMol

When it comes to drug trials, it’s necessary to shortlist high-potential candidates before incurring the extraordinary expenses of in-laboratory (in vitro) and clinical (in vivo) trials. Most pharmaceutical companies do “in silico” screenings in advance of real-world trials: They want the best possible chance of filtering through a ‘hit’ with as few ‘miss’ costs as possible.

In background research for our most recent pieces on antivirals efficacy that fundamentally rely on data science and drug-docking, I realized that there aren’t a lot of drug docking “how-to’s” for the data scientist audience. This apparent gap in the data science arsenal made me feel kind of sad because data scientists could be among the most prolific users of these tools to great effect. It would be perfectly OK if there was a generation of ‘biochemical data scientists’ wielding biochemical software tools towards solving some of humanity’s greatest health challenges.

This would be possible if data scientists could overcome any fear or unfamiliarity of dipping into the open-source tools of the trade. Definitely embrace your fellow biochemist/molecular biologist/virologist, but don’t fall into the trap of believing you couldn’t possibly have unique scope on their domains. Quite to the contrary, practitioners of these sciences can benefit enormously from your support. From experience, the smartest among them even welcome you with open arms.

Then to drop an early pebble into that pond, here is our primer on drug-docking for the data science community, as illustrated by our approach in applying these tools toward one application in particular — virology.

Background

At EMSKE we developed an automated screening process to assay candidate plant compounds in silico on the main protease (‘Mpro’)of the SARS-CoV-2 coronavirus. (This protease is not to be confused with the virus’s 2nd protease, called PLpro).

A good antivirals screen unlocks the capacity to do early, high-volume “thumbs-up/thumbs-down” indications for targeted viral enzymes. Why target the main protease? So long as a safe drug candidate (that’s demonstrated in vitro to have inhibitory efficacy on this critical enzyme) can make its way through the human metabolism and cross cell membranes, then it’s a one-shot-kill for the virus. Not the only viral protein worth targeting, to be sure, but a better choice than, say, a spike inhibitor where you need to tackle a plurality of the spikes to have any efficacy against its entering cells.

In silico screening fundamentals

Our screen rests atop a drug docking tool called Autodock Vina. Autodock and Autodock Vina are open-source tools put out by Scripps Research Institute. Rooted in machine-learning, it uses gradient-descent to find conforming, polarity-attractive “poses” of candidate compounds to conform onto any of a protein’s receptor sites that are under study. The Autodock family of docking software has been extensively applied for screening by life sciences researchers across the body of published biochemistry literature for over a decade. Vina is a cost-function and scoring refinement of Autodock which improves the hit-miss accuracy and significantly improves computation speed compared with the original Autodock. Here is the general workflow of autodocking tools:

Flow chart for Autodock (Source: MTIOpenScreen, (CC BY 4.0) modified for generalized autodocking)

Download Autodock Vina
Download the (larger) pdb files from the Protein Database.
Download the compounds (called ‘ligands’) in 3D SDF format from an online database, typically PubChem or ZINC.
For Autodock Vina to work with them, you need to convert the compounds and ligands into .pdbqt format. Converting is accomplished through a tools package from the MGLTools download page from the Scripps website (if you’re a Mac user, you’ll want to make sure to run that on an un-upgraded 32-bit Mac OS version).
Use MGLTools’ prepare_receptor4.py to convert the protein pdb file to .pdbqt; We prefer using the combination of obabel (obabel is its own installation effort, unfortunately, but ibabel is a pleasant front-end to it for Mac users) and prepare_ligand4.py to convert the .sdf file into .pdb and subsequently to .pdbqt respectively.

Now you’re almost ready to do a docking run! The remaining item to sort out is the search box. Vina does its gradient-descent search by iterating across the many-dimensional space of 3D location, orientation, and ligand rotational bond conformations. It searches within a three-dimensional volume search box that the user specifies in units of Angstroms. The user also specifies a center for the search. Together, these parameters allow the selection of either a ‘blind’ search, which means a search volume encompassing the entire protein (we use 150 Angstroms³). For targeted docking to active sites, a tighter search volume (we use 30 Angstroms³) narrows the search and speeds up computation.

Using a 3D visualization program like PyMol, it is easy to determine the coordinates for blind docking (typically the estimated geometric center of the protein). For a targeted site, just open up the human-readable pdbqt file itself and find the target site’s amino acids of interest. The active site’s amino acids are ascertained from published literature for the protein of interest. For SARS-CoV’s Mpro, we usually look at the active site defined by A-histidine-41 & A-cysteine-145. (‘A’ refers to one of the two chains of amino acid that combine to form the totality of this enzyme protein).

Under gradient-descent with a cost function defined by the ligand-protein complex’ energy level, Vina iterates the ligand’s position, rotation, and structural pose vis-a-vis the protein’s surface. Rather than directly taking into account the attraction of opposite charges on the ligand molecule and the protein receptor surface, it looks pairwise at the atomic bonds between the ligand and receptor that would be created in a given pose. Vina generates results along a -4 to -12 scale (formally: in units of kilocalories per mole). In other words, docking software like Autodock Vina thinks in terms of minimizing the “energy level” of a set of chemical bonds determined by a compound’s pose against a protein receptor site.

Our application of Vina essentially maintains this scale but applies some basic logic and averaging over an appropriate number of trials over a combination of protein receptor sites. That way our results (where the more negative number is more likely to be more inhibitory) may better reflect a compound’s propensity to defeat the virus’ ability to replicate itself, as illustrated next.

How do we effectively apply this tool to the coronavirus’ main protease?

When the earlier SARS virus first struck in 2003, there was sufficient time and funding in follow-on years for researchers to do in vitro inhibitory studies of the virus’ main protease Mpro (often referred to as 3CLpro in the older SARS literature). Researchers were studying to determine which compounds would inhibit this proteases’ growth in a petri dish (referred to as in vitro).

Why target a protease? Because it’s a one-shot-kill for viruses. Proteases’ efficacy as a target was well-established especially thanks to the HIV protease inhibitors that came out in the 1990s.

Critically, the SARS 2003 coronavirus is very, very similar to the current 2019 coronavirus. The two virus’ main proteases specifically are 95% identical in amino acid sequence, and differ primarily in their three-dimensional folding structure.

Treating in vitro results as our “source of truth”, we compared the results of our Autodock studies on the 2003 SARS Mpro with researchers’ in vitro results compound-for-compound over the 16 years since that epidemic. However it is difficult to know which target site in real life the in vitro researchers managed to bind their inhibitory compounds to. And just as a real compound can bind to different sites with the potential to inhibit, so too can we target autodocking at different sites.

Therefore we must explore different combinations of the autodock targeting results. By comparing combinatorially results from blind, A, and B docking studies (first, in isolation and then second, in different combinations with each other) and compare against real-world in vitro data for the same compounds, we arrive at the following resulting combination that features the highest correlation with real-world data:

Correlation of processed Autodock results with in vitro study

We achieve an R² of 0.362, a value that is not atypical for biochemical virology investigation.

Time to form the classifier

Correlation in hand, we can now develop a classifier for separating out the inhibitory ‘hit’ ligands from the worthless ‘miss’ ligands.

In vitro researchers treat any compound that inhibits with a concentration of 100 uM (micromolar) or less to be the low-side cutoff for an in vitro ‘hit’. That corresponds to a log10 value of 2.0. Therefore we trained our classifier by comparing Autodock Vina results of 27 compounds as simulated against multiple active sites of the similar, original SARS (2003) coronavirus Mpro with those actual, in vitro lab inhibitory results of the same compounds. Our best-fit-to-reality formulation of Vina results across several active sites yields the following hit-miss chart:

The 27-compound training set for our ML classifier of Autodock Vina results applied to SARS-2003 compounds

Benefiting from this training set, we assayed our formulation of Vina results against an additional 8-compound test set. The classifier demonstrated that anything with a -7.7 kcal/mol score or above should be classified as a ‘hit’.

Results of our ML classifier’s test set of 8 additional compounds, also based on SARS-2003.

Based on the test set, our classifier demonstrates a classification accuracy of 87.5%.

Parting thoughts

Please consider yourself encouraged to tool up and try your own docking studies at home. There are many viral enzymes (and normal biochemical process proteins) of interest out in the wild that one can try to dock to. As demonstrated above, just make sure to keep your simulations grounded by comparing results against published real-world in vitro findings. You want to keep the false-positives and true-negatives to a minimum, but humility requires accepting there will be some of each of these even so.

And as alluded to in the beginning of this post, even once you have in vitro hits, there’s still the entire pharmacokinetics modeling process that needs to be applied for any substance that’s ingested (this is called ADME modeling), and cellular membrane entry modeling in case your drug candidate needs to enter inside of a cell. There’s even drug-interaction modeling to avoid the unintended consequences of a drug in the human body. These tools are at an earlier level of maturity than drug docking, but perhaps biochemically-equipped data scientists can contribute to enabling those efforts to reach maturity faster.

But overall the computer is a perfectly safe environment to run simulated biochemical assays. (It’s when it comes time to draw conclusions and start the lab tests that the concern begins). But till you reach that point, then, by all means, D̶o̶n̶’̶t̶ Try this at home.

As the global community of drug-dockers are fond of saying — — “Happy docking!” :)

The author would like to recognize 1.the originators and contributors to the open-source Autodock, Vina, MGLTools 2. the MTiOS hosted docking initiative; and 3. Yu Wai Chen, PhD, Protein Crystallography, Cambridge University for his efforts spanning decades to build awareness of computational drug-docking tools.

Update 3 September: This description earlier stated that Vina takes into account the polarity of atomic pairs between the receptor and ligand. In fact that is not the case; rather Vina applies a score based on the type of bond created between two atoms, but without performing a physics-based polarity model between the atoms. This is intentional, as it matches up better with real-world protein-ligand binding affinities than physics-based models in place prior to Vina’s development.