What is Chemoinformatics?

Patrick Chirdon
Jul 16 · 11 min read


Chemoinformatic approaches that combine chemistry with programming are now forming an integral part of the drug discovery pipeline. Algorithms can make predictions for target discovery, toxicity, and binding efficacy, among others, before doing lab testing. While not a substitute for lab experiments, chemoinformatic approaches are useful as they allow screening of an enormous chemical space, reducing costs and aiding in hypothesis generation (Faulon et al, p155). To begin a discussion of chemoinformatics, we need to introduce some fundamental concepts.

Algorithms are recipes or a series of steps that need to be taken to complete a task. Traditional algorithms rely on a set of pre-set rules programmed by a technician. The first type of algorithm we will use is called a genetic algorithm. A genetic algorithm takes a structure and randomly mutates it to select the structures for a desired property. Then the new mutated molecule is added to the valid list of children in a loop to generate more molecules.

Machine learning algorithms are a class of algorithms that detect patterns in data to derive rules for making predictions. Machine learning algorithms do not require programming of pre-set rules. Instead, these algorithms learn as they process data (Geron, p4). A data-point comprises of a vector of attributes called features. Machine learning can be classified into the categories of supervised, unsupervised, or semi-supervised algorithms. Supervised learning uses labeled data to learn. A label is the desired output associated with a data-point. Given enough input data, supervised algorithms can get better at predictions given some performance measure (Geron, p8). This will be the type of learning that our algorithms will rely upon. Supervised algorithms can be used to make classifications or to predict a numerical value. An example of classification is labeling a set of images as those of cats and not cats. A popular classification algorithm is the support vector machine (SVM). In a SVM, a hyperplane that acts as a boundary separating two classes of data points is sought. The vectors/cases that define the hyperplane are the support vectors (Geron, p8). Along with classification, SVMs can be employed for regression as well, which allows handling of continuous variables. Regression determines the best fit equation to the data and then predicts values on a continuous variables. Regression determines the best fit equation to the data and then predicts values on a continuous scale. For example, one can create a regression equation to predict stock prices based on market variables. Another type of algorithm that we will make use of is the Artificial Neural Network (ANN), which can also do both classification and regression (Geron, p260). These methods are discussed in more detail in the section below.

1.1 Virtual Screens

Searching for new drugs is an expensive endeavor. In order to bring a new drug to market, it can take 10–15 years and cost around 1 billion dollars. The cost of drug development is high because a large number of candidate molecules need to be screened for efficacy in many in-vitro to in-vivo studies. Scientific American (Jogalekar, 2014) outlines three reasons as to why drug discovery is so challenging: It is difficult to identify which proteins are involved in a disease, and determine if their activity can be altered by a drug. There are no straightforward guiding principles to identify small molecules that will be able to bind to a target protein specifically and strongly (Jogalekar, 2014). It is challenging to ensure that a drug molecule is able to enter a cell.

Virtual screening of molecules to reject the ones with undesired characteristics is a widely used methodology to accelerate the drug discovery process. In recent years, many start-up companies have emerged that focus on employing machine learning (which we will refer to as machine learning methods) based computer models for drug discovery. Here is a list of some common goals for AI algorithms in drug discovery: (a) aggregate and synthesize physiochemical information; (b) understand mechanisms of diseases; ( c ) generate data and models for toxicity, (d) repurpose existing drugs; (d) generate novel drug candidates; and (e) validate drug candidates; Here is a list of start-up companies that employ these approaches for drug discovery applications:

https://blog.benchsci.com/startups-using-artificial-intelligence-in-drug-discovery (Smith).

In drug discovery, targets can be receptors, proteins, genes, or enzymes. Models can be used to predict quantitative structure activity relationship (QSAR) properties, which model the relationship between a drug’s structural properties and its biological properties. QSAR takes into consideration properties like molecular weight, molecular volume, electronegativity, partition coefficients, and hydrogen bonds and acceptors. Neural networks have been employed to model absorption, distribution, metabolism, excretion, toxicity, side effects, and drug delivery. Two broad principles are generally employed for classification in virtual screening:

(a) Ligand based: This method is used when structural information is scarce. An example of this is a QSAR model (summarizes the relationship between a chemical structure and its biological activities) or ligand based pharmacophores (a pharmacophore is a binding site description), or similarity calculations.

(b) Structure Based: An example of this method is docking (finding the preferred orientation of one molecule with another so they form a stable complex) or structure based pharmacophores.

1.2 Machine Learning Models and Drug Discovery

Methods that use machine learning can make predictions before experiments and rule out less successful molecules using publicly available databases like PubChem for each property under investigation (Elkins, 2016). Molecules are represented in SMILES format (Simplified Molecular Line Entry System). SMILES encodes for atom type, charges, attached hydrogens, bond types, aromaticity, and stereochemistry. These are 2D representations that have atoms represented by their atomic symbols. Bonds are represented as follows: single “-”, double “=”, and “#” for triple. Branches are represented by parentheses (Faulon, 2010). The SMILES can be used to train various models as described below:

a. Support Vector Machine (SVM) -SVM is a machine learning method for the purpose of classification of data. Support vector machines have been employed for developing QSAR models (Varnek, 191). SVM are used for binary (yes/no) classification (Varnek, 225). A support vector machine creates a hyperplane that separates data into two classes. The hyperplane is defined by the equation H(x) = Wx +b, where W is the weight and b is the bias. When training a SVM model, the objective is to maximize the distance between the hyperplane and the training instances labeled as “active” or “inactive”(Varnek, 223). A penalty for incorrect training error is set by the user which can be adjusted depending on the results of the training. A confusion matrix can assess the results of the training using the counts for true positives, true negatives, false negatives, and false positives (Géron, 87). The predictions used to create the matrix are generated using a test set as part of the randomized training set.

b. Neural Networks- For target classification, a simple yes/no classification for a given protein target may be adequate. We can even calculate a probability based on the fingerprint similarity. However, in some cases, such as the problem of predicting solubility value a regression needs to be performed. For this purpose, a popular non-linear technique is an artificial neural network (ANN). An ANN uses neurons to relate inputs to the output value. In the training of an ANN, a set of weights are adjusted based on inputs to a neuron to determine the influence each input has on the output (Sejnowski, 111). The weights converge when the neuron is able to correctly predict the output values. Multiple neurons are needed to combine inputs to make decisions based on the separate inputs in problems like ours where categories are not linearly separable (Sejnowski, 112). To prevent the network from just memorizing the training data without being able to generalize to new examples, regularizing techniques are implemented to restrict the number of parameters each time (Sejnowski, 120).

c. Genetic Algorithms- A genetic algorithm takes a structure and randomly mutates it to select the structures for a desired property. Then the new mutated molecule is added to the valid list of children in a loop to generate more molecules. This was performed using Data Warrior and will be discussed more later.

1.3 Generating new molecules

To generate new molecules using genetic algorithms, we have used a package called Data Warrior. We generated new molecules starting with 16 scaffolds of molecules listed in a US Patent owned by Ohio University (Goetz et al., 2019). Data Warrior was used by us because it is an open source easy-to-use platform for generating new molecules based on the scaffolds provided. In addition to the creation of compound libraries, Data Warrior can be used to calculate physiochemical properties, create graphs, and visualize data (Sander, 2019).

Briefly, the genetic algorithm of Data Warrior works in the following way: (a) Input a user provided structure of a molecule called scaffold; (b) Mutate it randomly by changing fragments on the molecule and select the structures that are most similar to the original structure; ( c ) Generate a pre-specified number of children for every scaffold; (d) Select the most structurally similar molecules from the population. These molecules become the starting structures for the next generation. The above steps are repeated till we have a desired number of generations. Since we want to generate compounds that were more druglike, we had to start with compounds that had a high drug score to start. We used the evolutionary algorithm to generate compounds similar to the original.

Drug Score is a numerical score given to a molecule to predict how good are its drug-like characteristics. Drug Score is based on the statistics of how frequently the fragments that comprise the molecule appear in known drugs. In addition, Drug Score considers desirable physical properties, such as solubility, molecular weight etc. Kindly see the appendix for the formula for how drug score is calculated.

2. Post-Processing

The Lipinski rules, from which the expression for drug score derived, states that druglike compounds have molecular weight less than 500, logP9) and performed a second round to generate more. This resulted in a larger number of compounds with scores greater than 9. We also found that the higher the drug score, the more likely a compound would not contain problematic functional groups based on frequency.


PAN ASSAY Interference (PAINS) are problematic functional groups that are false positives in bioassays in the sense that they appear to be strong binders to a protein targets, but really are not very selective in their binding. The PAINS screen used in this work is the NIH filter in RDKIT. For a review of PAINS kindly see (Baell and Walters 2014). We found that 82 percent of marketed drugs did not contain PAINS as defined by the NIH filter. In addition, we also screened for the fraction of SP3 hybridized carbons since a ratio of greater than .47 is associated with more selective binding (Baell and Walters, 2014). We found that 48 percent of FDA approved drugs have fraction of sp3 hydbrized carbons fsp3 > 0.47 and 65% have scores greater than 0.36. Compounds with a higher fSP3 ratio were more complex and did not contain as many double bonds (Baell and Walters, 2014).

3. Protein ligand docking with Rosetta

Docking programs predict the best pose for a ligand with its binding site on a protein. They can be used to screen out ligands that are not good binders. We used Rosetta because it is a widely used program for energy-based calculations based on thermodynamics. The scores it returns are not actual energies but are roughly correlated. It is difficult to predict IC50 or delta G. The compounds that bind the target with the lowest energy are considered the best binders. Docking makes several assumptions. Proteins are treated as rigid or mainly rigid (Varnek, 2017). Therefore, when you have a flexible molecule, the predictions might not be accurate. Moreover, ligands are limited by the number of conformations in 3D space the user supplies (Varnek, 2017). Therefore, it’s important to supply a variety of conformations. In addition, the program may or may not be able to simulate conformational changes in the protein due to ligand binding. Other considerations include ionization and protonation, which may alter the binding capacities (Varnek, 2017).

4. Combinatorial Approach

After finding the best binders for each scaffold using Rosetta, a user can take the top binders and fragment them into pieces and recombined these into new molecules. The goal was to produce compounds with lower binding scores/better binders/lower energy required for binding than the original by recombining the best fragments.

5. Target Prediction

Docking methods are often combined with ligand-based methods like quantitative structure activity relationships (QSARS). It was useful to use large molecule databases of ligands to predict the targets of our compounds for targets. QSAR models can be created with high sensitivity, specificity and accuracy. Sensitivity is the ability to correctly identify positive test results. Specificity is the ability to correctly rule out negative test results. Accuracy is the ability to correctly identify both true positives and true negatives. These models are quick to create, over a million targets are available, and no protein structure is needed. Only ligands are required to build a model. Ligands that bind to a target are considered positive and labeled a 1. Ligands that do not bind the target are labeled a 0.

6. Model Architecture

In this section, I have shown architectures of different machine learning models and strategies that have been discussed in the previous section.

Image for post
Image for post
Figure 1. QSAR Model Creation
Image for post
Image for post
Figure 2. Multiclass Neural Network


Baell, Jonathan, and Michael A. Walters. “Chemistry: Chemical Con Artists Foil Drug Discovery.” Nature, vol. 513, no. 7519, 2014, pp. 481–483., doi:10.1038/513481a.

Bidwell, Bradley N, et al. “Silencing of Irf7 Pathways in Breast Cancer Cells Promotes Bone Metastasis through Immune Escape.” Nature Medicine, vol. 18, no. 8, 2012, pp. 1224–1231., doi: 10.1038/nm.2830.

ChEMBL-og. “Using Autoencoders for Molecule Generation.” , 12 July 2017, http:// chembl.blogspot.com/2017/07/using-autoencoders-for-molecule.html.

“Compound Classification Using the Scikit-Learn Library.” Tutorials in Chemoinformatics, by Alexandre Varnek, John Wiley & Sons, 2017, pp. 223–227.

Faulon, Jean-Loup and Bender, A. “Algorithms to Store and Retrieve Two-Dimensional (2D) Chemical Structures.” Handbook of Chemoinformatics Algorithms Chapman & Hall/CRC, 2010.

Géron, A. (n.d.). Hands-on machine learning with Scikit-Learn and TensorFlow.

Goetz, D, Bergmeier, Stephen, McMills, M, Orac, Cina. Prevention and Treatment of Nonalcoholic Fatty Liver Disease. 20180362521, 2015.

Lipinski, Christopher A., et al. “Experimental and Computational Approaches to Estimate Solubility and Permeability in Drug Discovery and Development Settings.” Advanced Drug Delivery Reviews, vol. 64, 2012, pp. 4–17., doi:10.1016/j.addr.2012.09.019.

Mccall, Kelly D, and Frank L Schwartz. “A Novel Small Molecule Drug Derived from Methimazole (Phenylmethimazole) That Targets Aberrant Toll-like Receptor Expression and Signaling for the Potential Prevention or Treatment of Diabetes Mellitus and Non-Alcoholic Fatty Liver Disease.” US Endocrinology, vol. 11, no. 01, 2015, p. 17., doi:10.17925/use.2015.11.1.17.

Sander, Thomas. “Data Warrior.” Openmolecules.org, 5, 2019, www.openmolecules.org/about.html. Accessed 2019.

Sejnowski, Terrence J. The Deep Learning Revolution. The MIT Press, 2018.
“Protein-Ligand Docking.” Tutorials in Chemoinformatics, by Alexandre Varnek, John Wiley & Sons, 2017.

“Sulfhydryl Compounds.” DrugBank, www.drugbank.ca/categories/DBCAT000316. Varnek, Alexandre. Tutorials in Chemoinformatics. John Wiley & Sons, 2017.

York, Autumn G., et al. “Limiting Cholesterol Biosynthetic Flux Spontaneously Engages Type I IFN Signaling.” Cell, vol. 163, no. 7, 2015, pp. 1716–1729., doi:10.1016/j.cell.2015.11.045.

UniProt ConsortiumEuropean Bioinformatics InstituteProtein Information ResourceSIB Swiss Institute of Bioinformatics. “Interferon Regulatory Factor 3.” UniProt ConsortiumEuropean Bioinformatics InstituteProtein Information ResourceSIB Swiss Institute of Bioinformatics, 16 Oct. 2019, https://www.uniprot.org/uniprot/Q14653.

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch

Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore

Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store