Introduction to Cheminformatics — Circular Fingerprints

Hacer Tilbec
3 min readApr 17, 2018

Since the beginning of the Spring semester, I started to work on Cheminformatics. Since it is a new research area for me, I started to read papers and trying to understand the concepts of cheminformatics. Here, I will share with you what I learned as I read and practice.

Cheminformatics is one of the application areas of machine learning. Several researches have been done in this field. Researches are mainly on similarity searching, and prediction of molecular properties such as toxicity, solubility, and absorption.

If you want to work in this field you can find several datasets that researchers prepared and shared in public. You can reach and examine various datasets from this link.

These datasets may be written in different formats to represent molecules. For example, SMILES (paper) is a popular dataset that is used to represent molecules in one string. This representation contains atoms of a molecule as well as the bond relations. As an example, the SMILES representation of Methyl isocyanate (CH3–N=C=O) is CN=C=O.

In order to perform investigations on molecules, firstly they must be represented in a way that algorithms can understand. In the preliminary studies, molecules were represented based on pre-determined structural properties. To do this, important and unimportant features are manually selected, and then, vectorization is performed by looking at neighboring atom relations in a molecule. This representation is called as a circular fingerprint. Circular fingerprint representation is more representative then former methods since it can discover local structural properties in a molecule.

Lets look at the steps of generating a circular fingerprint for a molecule:

Figure 1. Definition of a circular fingerprint: Incorporating
information on the arrangement of heavy atoms around each central atom.
(Obtained from https://www.ncbi.nlm.nih.gov/pubmed/16523386)
Figure 2. Circular fingerprint — Layer 2
  1. Each heavy atom is used as a starting point
  2. For each heavy atom, an atom type is assigned. Atom types may vary according to the technique used. For example, Sybyl mol2 (REF) uses elemental atom type and hybrid shape information. On the other hand, Sci Tegic (REF) atom type uses electronic configuration of the atoms. For example, in Figure 1, force field atom types which use hybridization state of atoms in addition to the elemental atom type.
  3. Then, for each layer, atom types are assigned to neighboring atoms. For example, in Figure 2, layer 1 is specified by looking at neighboring atoms (highlighted with blue) of the central atom (highlighted with red). Atom types of two neighboring atoms are are same, aromatic carbon, and the third ones is aliphatic carbon.
  4. At each distance from the central atom, the number of atoms with each given atom type are recorded in order to calculate descriptor values.
  5. To create fingerprints for an entire molecule, previous steps are repeated for each of in the molecule.

--

--

Hacer Tilbec

Computer Engineer, Data Scientist, working on Deep Learning applications in various fields.