DATA STORIES | DRUG DESIGN | KNIME ANALYTICS PLATFORM

Molecular Filtering for Drug Design

Enrich the chemical space and reduce the time for post-filtering methods with KNIME

Marko Jukic

Published in

Low Code for Data Science

9 min readJun 15, 2022

Co-Authors: Sebastjan Kralj, and Urban Bren.

This article is a readapted version of the scientific paper Comparative Analyses of Medicinal Chemistry and Cheminformatics Filters with Accessible Implementation in Konstanz Information Miner (KNIME) (2022) published in the International Journal of Molecular Sciences.

All drug design efforts begin by defining a biological target (enzyme, receptor, protein, etc.). The focus of the early phase of drug discovery rests on the identification of leads (compounds that exhibit pharmacological activity against this specific target). Virtual compound libraries used in early drug design span from 10⁷ to 10¹⁸ molecules and consist of compounds saved in either string (SMILES) or spatial data storage formats (SDF). High-throughput virtual screening (HTVS) is employed to sift through virtual libraries and identify potential lead compounds.

The large chemical space associated with virtual compound libraries is a double-edged sword. On the one hand, the probability of finding potential leads when screening larger libraries is greater, but on the other hand, screening of entire libraries even with the aid of computational methods may not be economically viable or even accessible in a timely manner. With the identification of the biological target in the early steps of the drug discovery process and the definition of the binding site, the chemical space adequate for further lead design becomes specific. Medicinal chemists can use this fact to focus libraries on specific chemical spaces. One may wonder what is the Occam’s razor of speeding up the early drug design process and focusing the chemical space on drug-like compounds?

Figure 1 : Visual representation of a ligand compound (orange) bound to its binding site on the protein (yellow).

Molecular filters

Molecular filters increase the low (1%) hit rates of drug development campaigns and are the simplest and most elegant way to do so [1]. The logic behind them is to eliminate molecules with a low probability of becoming leads [2]. Filtering removes both unwanted chemical structures and unwanted chemical properties [3].

The application of Molecular filters in drug design was pioneered by Chris Lipinski and coworkers, who compared early HTS and combinatorial chemistry drug hits at Pfizer with a subset of 2245 known drugs from the World Drug Index [4]. The aim was to understand the common molecular features (e.g. Molecular Weight) of orally available drugs. They came to several conclusions on the factors affecting poor drug absorption and devised the term Rule-of-5 to describe molecular properties for drug-likeness [5]. Although the term drug-likeness is often used in different ways by different authors, it generally refers to molecules that have properties or contain functional groups that are consistent with the majority of the known drugs [6–8]. The typical drug-like compounds exhibit desirable properties such as oral bioavailability, low toxicity, membrane permeability, and reasonable clearance rates [9]. Filters that adopt the same knowledge-based approach in their design but expand beyond the scope of classic drug-like filtering exist as well. The Ro4 (rule-of-4), designed to focus libraries on protein-protein interaction inhibitors, uses descriptor cut-offs opposite of what is traditionally defined as drug-like and attests to the universal nature of molecular filters [10].

Using compound filters to design drugs faster

In this article, we show how we integrated medicinal chemistry and cheminformatics filters into KNIME Analytics Platform and explain several key steps that the user must be aware of when using the filters in his/her early drug design efforts. The details of the work done can be found in our peer-reviewed published article [14].

1. What type of input should I use?

To successfully apply filters in HTVS, the selected compound library must use supported data formats, for example, the string representation SMILES (simplified molecular input line entry specification format) or 3D representations such as SDF (structure-data file format) or MOL (MDL Molfile) [11]. As the majority of the compound libraries online are available in the SDF format the workflows are designed with the SDF Reader node as the first step. After reading in the SDF, the RDKit Canon SMILES node is used throughout the workflows for standardizing the input format before filtering. Besides SDF, the SMILES format can be used as well. We recommend the use of the Excel Reader for SMILES in Excel sheets, or for SMILES in .txt format the CSV Reader node, both in conjunction with the Molecule Type Cast node (Figure 2). Inside metanodes the input data is converted using the RDKit From Molecule node to ensure data coherency.

Figure 2: Example of various input workflows.

2. When to filter?

An important question to consider is when to apply the filters in the early drug design process. Opponents of filtering point out that any rule-based system of filtering ignores the fact that exceptions exist, and that blind use of such restrictive filters would eliminate potential drugs such as cyclosporine and erythromycin, where the majority of the drug-like rules break down [7]. Exceptions such as the aforementioned drugs bring up an important topic of distinction between properties of useful lead-like molecules and drugs. In general, lead compounds exhibit less molecular complexity and are less hydrophobic. This indicates that the process of optimizing simple leads into drugs is favorable, supporting the idea of filtering libraries before assays or other screening and optimizing them into drugs later [12].

3. What filter should I use?

We implemented 11 filters (REOS, PAINS, Aggregators, Rule-of-5, Rule-of-4, Rule-of-3, Veber filter, Mozziconacci filter, Egan filter, Van de Waterbeemd filter, Murcko filter) into our Compound Filters for Drug Design workflow. All filters are publicly available and accessible for free on the KNIME Hub.

We designed all the workflows in a similar fashion with each having the read-in section (orange), the filtering section (yellow), and the write section (green). The filtering section provides the reader with the information on the rules used, the recommended application, and the original literature on which the filter is based. The nodes that form the filter are all placed inside meta nodes to provide better overviews of workflows (Figure 3).

Figure 3: Workflow example of our Veber filter implementation for effective design of compound libraries. Black lines represent the expanded metanode that contains nodes used for filtering.

The PAINS and REOS filters are both based on the RDKit Substructure Counter node and compare the substructures present in the input compounds with a list of problematic functional groups. A Rule-based Row Filter node removes the hits from the database. The aggregation propensity filter, called the “aggregator filter”, evaluates the aggregation propensity based on the Similarity Search node that calculates the Tanimoto coefficients of given molecules and compares it to a database of 12.641 known aggregators [5]. The user can personally control how strict the filter is with the low, medium, and high propensity Row Filter nodes provided. The remaining filters are knowledge-based Rule-based Row Filter nodes. The filters are simple property counting filters that firstly calculate descriptor values using the RDKit Descriptor Calculation node or the molecule property nodes, and then employ the Rule-based Row Filter. The exception among property filters is the Rule-of-5 which allows one rule break. To allow rule breaks, we created the filter using several Rule Engine nodes that assign the value of 1 for each rule break of one descriptor, the Math Formula node to sum up all the values, and finally the Rule-based Row Filter node to filter out compounds with more than one rule break.

Users can expand the meta nodes and delete redundant steps in the process (e.g., duplicate generation of the canonical SMILES in the linked workflow) when combining several filters for their drug design, which would result in faster workflows (Figure 4).

Figure 4: An example of a multi-filter workflow that combines several meta nodes.

4. How does my filter impact the compound library?

To test and demonstrate the functionality of the filters implemented and their effects on the chemical space, we applied the filtering workflow to a general ZINC database consisting of 9,216,175 compounds (a large non-specific chemical library).

Note. The database was obtained by accessing the ZINC website (https://zinc.docking.org/tranches/home/ accessed on 21 June 2021) selecting the following parameters (representation “2D”, reactivity “standard”, purchasability “in-stock”) and downloading the SMILES wget command file.

Using the Row Sampling node, 1% of the total database was sampled and ran through selected implemented filters. To depict the effect on the chemical space of such filters we used the Statistics node followed by the 2D/3D Scatterplot node.

Figure 5: 3D scatterplot of exact molecular weight, SlogP, and total polar surface area (TPSA) for the unfiltered (red) and filtered Rule-of-3 (Blue), Rule-of-4 (purple), and Rule-of-5 (green). We can see how the space occupied by compounds changes drastically.

The Ro3 (rule of three) and Ro4 (rule of four) filters are stringent filters that define specific chemical space, filtering out 97% and 94% of the database, respectively. Despite their similarity in the filtered-out percentage, they operate in opposite ways. The Ro3 represents a strict filter designed to support “hit identification” and “fragment-based” drug research and only accepts molecules with a molecular weight of less than 300 [13]. It supports the paradigm that small compounds still capture the desired chemical space yet leave a lot of space for future compound optimization towards leads. The Ro4 attempts to capture the protein-protein interaction inhibitor chemical space and retains molecules with molecular weight above 400, as such larger molecules are able to form multiple interactions. Morelli et al. designed the filter with the aim of establishing guidelines for druggable protein-protein inhibitors since these most often break traditional property filter rules [10]. The Lipinski Rule-of-5 is a set of rules for drug likeness and oral bioavailability[4]. It removes 9% of the input database but does not shift chemical space strongly with the observed descriptors.

Conclusion — Quickly enhance hit rate and increase processing speed in drug design efforts

We implemented 11 compound filters into workflows compatible with small or large compound databases, with the purpose of bridging the gap between molecular filters and their accessibility to the public. They provide the researcher with a simple, fast, and robust way to enrich the chemical space and reduce the time associated with post-filtering methods. They are easy to use and can be customized to particular preferences of the studied chemical space. However, the user must be aware of the properties used for filtering, as some, such as REOS and PAINS, were not designed with covalent chemistry in mind. In such cases, it is better to flag the compounds for a later evaluation. For further details, we recommend reading our peer-reviewed published article [14].

Acknowledgments. Thanks KNIME for their support of our work!

References

1. Macarron, R.; Banks, M.N.; Bojanic, D.; Burns, D.J.; Cirovic, D.A.; Garyantes, T.; Green, D.V.S.; Hertzberg, R.P.; Janzen, W.P.; Paslay, J.W.; et al. Impact of High-Throughput Screening in Biomedical Research. Nat Rev Drug Discov 2011, 10, 188–195, doi:10.1038/nrd3368.

2. Thorpe, D.S.; Edith Chan, A.W.; Binnie, A.; Chen, L.C.; Robinson, A.; Spoonamore, J.; Rodwell, D.; Wade, S.; Wilson, S.; Ackerman-Berrier, M.; et al. Efficient Discovery of Inhibitory Ligands for Diverse Targets from a Small Combinatorial Chemical Library of Chimeric Molecules. Biochemical and Biophysical Research Communications 1999, 266, 62–65, doi:10.1006/bbrc.1999.1775.

3. Oprea, T.I. Property Distribution of Drug-Related Chemical Databases. J Comput Aided Mol Des 2000, 14, 251–264, doi:10.1023/a:1008130001697.

4. Lipinski, C.A. Drug-like Properties and the Causes of Poor Solubility and Poor Permeability. Journal of Pharmacological and Toxicological Methods 2000, 44, 235–249, doi:10.1016/S1056–8719(00)00107–6.

5. Oprea, T. Virtual Screening in Lead Discovery: A Viewpoint. Molecules 2002, 7, 51–62, doi:10.3390/70100051.

6. Walters, W.P.; Stahl, M.T.; Murcko, M.A. Virtual Screening — an Overview. Drug Discovery Today 1998, 3, 160–178, doi:10.1016/S1359–6446(97)01163-X.

7. Walters, W.P.; Murcko, M.A. Prediction of “Drug-Likeness.” Adv Drug Deliv Rev 2002, 54, 255–271, doi:10.1016/s0169–409x(02)00003–0.

8. Lipinski, C.A.; Lombardo, F.; Dominy, B.W.; Feeney, P.J. Experimental and Computational Approaches to Estimate Solubility and Permeability in Drug Discovery and Development Settings. Adv Drug Deliv Rev 2001, 46, 3–26, doi:10.1016/s0169–409x(00)00129–0.

9. Muegge, I.; Heald, S.L.; Brittelli, D. Simple Selection Criteria for Drug-like Chemical Matter. J. Med. Chem. 2001, 44, 1841–1846, doi:10.1021/jm015507e.

10. Morelli, X.; Bourgeas, R.; Roche, P. Chemical and Structural Lessons from Recent Successes in Protein–Protein Interaction Inhibition (2P2I). Current Opinion in Chemical Biology 2011, 15, 475–481, doi:10.1016/j.cbpa.2011.05.024.

11. Dalby, A.; Nourse, J.G.; Hounshell, W.D.; Gushurst, A.K.I.; Grier, D.L.; Leland, B.A.; Laufer, J. Description of Several Chemical Structure File Formats Used by Computer Programs Developed at Molecular Design Limited. J. Chem. Inf. Comput. Sci. 1992, 32, 244–255, doi:10.1021/ci00007a012.

12. Oprea, T.I.; Davis, A.M.; Teague, S.J.; Leeson, P.D. Is There a Difference between Leads and Drugs? A Historical Perspective. J Chem Inf Comput Sci 2001, 41, 1308–1315, doi:10.1021/ci010366a.

13. Congreve, M.; Carr, R.; Murray, C.; Jhoti, H. A “rule of Three” for Fragment-Based Lead Discovery? Drug Discov Today 2003, 8, 876–877, doi:10.1016/s1359–6446(03)02831–9.

14. Kralj, S.; Jukič, M.; Bren, U. Comparative Analyses of Medicinal Chemistry and Cheminformatics Filters with Accessible Implementation in Konstanz Information Miner (KNIME). IJMS 2022, 23, 5727, doi:10.3390/ijms23105727.