MolPlotly, a wonderful tool for the scientist

Analytical chemistry with python programming

Qco_Juan_David_Marin
8 min readApr 14, 2023

By Juan David Marín, e-mail: qcojuandavidmarin@gmail.com, LinkInd: https://www.linkedin.com/in/qco-juan-david-marin/, Notio, GitHub

Have you ever listened about Molplotly? If you say yes, you know that is a wonderful library, and if you don´t know, let me tell you that is a library witch you can visualize molecules into a plotly plot. You can find more information in this amazing post:

https://www.wmccorkindale.com/post/introducing-Molplotly

When I worked with chromatography–mass spectrometry (GC-MS), Molplotly becomes in a special ally to represent the translation between a gas chromatography analysis and data analysis with python.

https://github.com/wjm41/molplotly

This library lets us translate the thing that happen into a complex world like GC-MS in a simple way, for example, the molecular compounds that are identified by mass spectrometry could by represented analyzing its time retention, molecular weight, vapor pressure, and all the molecular descriptors that you can calculate.

In the scientific paper https://tobaccocontrol.bmj.com/content/27/1/105#F2 the authors worked with tobacco samples and GC-MS, and you can find the data on this link. The paper talks about how throw a chemical analysis like GC-MS the flavors additives in tobacco products can be identified.

These data are appropriate to show how Molplotly could work as a bridge between analytical chemistry science and data science with python.

Let us begin.

The libraries and python version needed are: matplotlib = 3.5.3, Pandas version = 1.4.3, molplotly = 1.1.4, rdkit-pypi = 2022.9.5, scikit-learn = 1.2.2, numpy = 1.23.2, plotly = 5.10.0.

First load the libraries

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

from rdkit import Chem
from rdkit.Chem import PandasTools, AllChem, Descriptors
from rdkit import DataStructs
import molplotly

from sklearn.manifold import TSNE
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler

import plotly.express as px
import plotly.io as pio
pio.renderers
pio.renderers.default = "notebook_connected"

The data were slightly modified for practical purposes. This dataset contains a description of the molecules identified by GC-MS of each tobacco sample with their respective retention time, their SMILES structure, vapor pressure, and olfactory descriptor.

odor = pd.read_excel('tobaccocontrol_2018.xlsx', sheet_name = 'data', engine='openpyxl')

Them, we need to tramform the data

tabacco_df = pd.melt(odor, value_vars=['RYO_TP2', 'Cig_TP13', 'Cig_TP2', 'Cig_TP1',
'Cig_TP14', 'Cig_TP15', 'Cig_TP5', 'Cig_TP16', 'RYO_TP3', 'RYO_TP1',
'Cig_TP4', 'Cig_TP8', 'Cig_TP7', 'Cig_TP9', 'Cig_TP12', 'Cig_TP10',
'Cig_TP11', 'Cig_TP17', 'Cig_TP18', 'Cig_TP19', 'Cig_TP6', 'Cig_TP3',
'Cig_RP1', 'Cig_RP2', 'Cig_RP3', 'Cig_RP4', 'Cig_RP5', 'Cig_RP6', 'CM6',
'University_Kentucky', 'American_Virginia_flue_cured_213',
'Uncut_Burley', 'Semi_Oriental_456'],id_vars=['class', 'Name', 'Cas', 'smiles', 'vp', 'RT', 'Description',
'Odour threshold'], value_name='signal_noise_ratio', var_name='tobacco')

We will eliminate the NA values, and we will review the distribution of the data for each sample.

tabacco_df = tabacco_df.dropna(subset=['signal_noise_ratio'])
tabacco_df['tobacco'].value_counts().plot(kind='bar')
plt.show()

We need to order the values by tobacco samples and retention time. Then we will calculate the molecular weight of each molecule, this molecular descriptor is important because its value increases as the vapor pressure decreases. In general, as mentioned in my previous post, the molecules with the highest vapor pressure are detected first. Therefore, they could be the molecules that give the first characteristics of the products, in this case, tobacco.

tabacco_df['RT'] = round(tabacco_df['RT'],2)
tabacco_df = tabacco_df.sort_values(by=['tobacco', 'RT'])
tabacco_df['mw'] = tabacco_df['smiles'].apply(lambda x: Chem.MolFromSmiles(x)).apply(lambda x: Descriptors.ExactMolWt(x))

Here, is when Molplotly come into play. The first graph is the representation of the vapor pressure of each molecule in each tobacco sample.

fig_scatter = px.scatter(data_frame=tabacco_df, x = 'RT', y = 'vp', color = 'tobacco',template = "simple_white",
title='Vapor pressure', animation_frame='tobacco', size='mw')
fig_scatter.update(layout_showlegend=False)
fig_scatter = molplotly.add_molecules(fig=fig_scatter, df = tabacco_df,
smiles_col='smiles',
color_col = 'tobacco',
title_col='Name',
caption_cols= ['mw'],
caption_transform={'mw': lambda x: f"{x:.2f}"})
fig_scatter.run_server(mode='inline',port=7999)

In the graphs above, we can see how the first molecular structures present higher vapor pressures, more volatility, and this pattern is the same in all tobacco samples

As I have mentioned above, the molecules that have a higher vapor pressure have lower molecular weights. The relationship is shown in the following Molplotly graph

fig_scatter = px.scatter(data_frame=tabacco_df, x='RT', y = 'mw', color='tobacco', size=tabacco_df['vp']**0.2,
template = "simple_white",trendline="ols",
title='Molecular Weith', animation_frame='tobacco')
fig_scatter.update(layout_showlegend=False)
fig_scatter.update_layout(xaxis_title='TR min', yaxis_title='Molecular Weith')
fig_scatter = molplotly.add_molecules(fig=fig_scatter, df = tabacco_df,
smiles_col='smiles',
color_col = 'tobacco',
title_col='Name',
caption_cols= ['mw','vp'],
caption_transform={'mw': lambda x: f"{x:.2f}"},
show_coords=False)
fig_scatter.run_server(mode='inline',port=1003)

The olfactory descriptors of each molecule is another important aspect that allows the classification and description of each tobacco sample. This facilitates its characterization

polar_1 = px.bar_polar(data_frame=tabacco_df, r='RT',
theta='Description',
color='tobacco',
animation_frame='tobacco',
color_discrete_sequence=px.colors.sequential.Plasma_r,
title='Odor distribution by tabacoo class and retention time',
template="plotly_dark")
polar_1.update(layout_showlegend=False)
polar_1 = molplotly.add_molecules(fig=polar_1, df = tabacco_df,
smiles_col='smiles',
color_col = 'tobacco',
title_col='Description',
caption_cols= ['RT'],
show_coords=False)
polar_1.run_server(mode='inline',port=1056)

So far, we have been able to see how Molplotly easily represents a chromatographic behavior. But how could it help us in a segmentation analysis using a cluster algorithm?

First we will see how to represent molecules in machine language using Morgan fingerprints which are a type of circular fingerprints that represent the local chemical environment around each atom in the molecule. They are commonly used for tasks such as finding molecular similarities, machine learning models to predict various properties, and classification of molecules.

The GetMorganFingerprintAsBitVect function generates Morgan fingerprints as a binary vector, where each bit in the vector corresponds to a specific substructure or atom environment present in the molecule. The function takes several arguments, including the molecule object, the radius of the fingerprint, and the number of bits in the fingerprint. The output of the function is a binary vector of 1s and 0s, where a 1 indicates the presence of the corresponding substructure or environment in the molecule.

Overall, the AllChem.GetMorganFingerprintAsBitVect function is a useful tool for generating Morgan fingerprints for molecules in Python, which can be used for a variety of cheminformatics and machine learning applications.

You can find more information about Morgan fingerprints in this link https://www.rdkit.org/UGM/2012/Landrum_RDKit_UGM.Fingerprints.Final.pptx.pdf

import random
tabacco_df_mfp = tabacco_df.copy(deep=True).reset_index(drop=True)
tabacco_df_mfp['mfp'] = tabacco_df_mfp['smiles'].apply(lambda x:Chem.MolFromSmiles(x))

mol = random.choice(tabacco_df_mfp.mfp.values)

morg_fp1 = AllChem.GetMorganFingerprintAsBitVect(mol, 2, nBits = 32)
print(morg_fp1)
arr = np.zeros((0,), dtype=np.int8)
DataStructs.ConvertToNumpyArray(morg_fp1, arr)
print(arr)
print('Number of zeros: ', np.count_nonzero(arr==0))
print('Number of ones: ', np.count_nonzero(arr==1))
mol

The code above shows the binary representation (Moragan finger print) of a random molecule from some tobacco sample.

So, each molecule is represented by a vector, and this is useful for finding similar molecular structures with similar physicochemical characteristics.

Based on this information, we calculate the Morgan Fingerprints for each molecule identified by MS using the following function:

tabacco_df_mfp = tabacco_df.copy(deep=True).reset_index(drop=True)
def create_fp_df_from_mols(df):
morg_fps = df['smiles'].apply(lambda x:Chem.MolFromSmiles(x)).apply(lambda x: AllChem.GetMorganFingerprintAsBitVect(x, radius = 2,nBits = 32))
np_fps = []
for fp in morg_fps:
arr = np.zeros((1,))
DataStructs.ConvertToNumpyArray(fp, arr)
np_fps.append(arr)
np_fps = np.vstack(np_fps)
fpdf = pd.DataFrame(np_fps, dtype=int)
return fpdf
df_mfp = create_fp_df_from_mols(tabacco_df_mfp)

We are going to make a grouping of molecules through their molecular structure (binary representation), we will use t-Distributed Stochastic Neighbor Embedding (t-SNE) that is a non-linear dimensionality reduction technique used for visualizing and exploring high-dimensional data. It maps the data from high-dimensional space to low-dimensional space for visualization. The algorithm models the similarity between data points in high-dimensional and low-dimensional spaces using probability distributions. It uses an objective function to minimize the divergence between the two probability distributions. t-SNE is widely used in fields such as bioinformatics, genomics, data visualization, and machine learning. It provides a powerful tool for exploring and understanding complex patterns and relationships in high-dimensional data.

tsne = TSNE(random_state = 0, n_components=3,perplexity=70) # perplexity=50
tabacoo_tsne = tsne.fit_transform(df_mfp)
# Concat the TSNE result with the dataframe
tsne_tabacoo = pd.concat([tabacco_df_mfp[['class','Name','smiles','vp','Description','tobacco']], pd.DataFrame(tabacoo_tsne) ], axis=1).reset_index(drop=True).rename(columns={0:'tsne_1',1:'tsne_2',2:'tsne_3'})
tsne_tabacoo.head()

Now, by means of a k-means grouping, we group the similar structures, and we will visualize the groups using Molplotly

from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
kmeans = KMeans(n_clusters=3, init='k-means++', max_iter=300, n_init=10, random_state=123)
sc = StandardScaler()
df_mfp_sc = sc.fit_transform(df_mfp)
clus = kmeans.fit(df_mfp_sc)
# Concat the K-Means result with the dataframe
tsne_tabacoo_kmean = pd.concat([tsne_tabacoo, pd.DataFrame(clus.labels_)], axis=1).rename(columns={0:'K_Means'})
tsne_tabacoo_kmean.head(5)

The size of each point that represents a molecular structure will depend on the value of vapor pressure.

scatter_tsnekmean = px.scatter(data_frame=tsne_tabacoo_kmean, x = 'tsne_1', y = 'tsne_2', size='vp',color='K_Means',
title='tsne_tabacoo_kmean')

scatter_tsnekmean = molplotly.add_molecules(fig=scatter_tsnekmean, df = tsne_tabacoo_kmean,
smiles_col='smiles',
color_col ='K_Means',
caption_cols=['tobacco','vp','Description','K_Means'],
caption_transform={'vp': lambda x: f"{x:.4f}"},
show_coords=False)

scatter_tsnekmean.run_server(mode='inline',port=8068)

The 2D representation gives us a good idea about how the molecules are classified, but, with a 3D representation is easier to see how they are distributed

scatter_tsnekmean_3d = px.scatter_3d(data_frame=tsne_tabacoo_kmean, x = 'tsne_1', y = 'tsne_2', z = 'tsne_3',  size='vp', color='K_Means',
template = "simple_white")
scatter_tsnekmean_3d.update_layout(margin=dict(l=0, r=0, b=0, t=0))
scatter_tsnekmean_3d = molplotly.add_molecules(fig=scatter_tsnekmean_3d, df = tsne_tabacoo_kmean,
smiles_col='smiles',
color_col = 'K_Means',
title_col='Name',
caption_cols=['tobacco','vp','Description','K_Means'],
caption_transform={'vp': lambda x: f"{x:.4f}"},
show_coords=False)
scatter_tsnekmean_3d.run_server(mode='inline',port=8005)

It is interesting how MolPlotly can help us to visualize the information about the results of an algorithm like TSNE and K-Means

It is also possible to see the distribution of molecules with their respective vapor pressure value with a molplotly bar graph.

fig_bar = px.bar(cluster_vp, x = 'tobacco',y ='vapor_pressere'  ,color='K_Means',barmode='group',
template = "simple_white", title = "Vapor pressure by Tabacoo cluster")
fig_rd_bar = molplotly.add_molecules(fig = fig_bar,
df = cluster_vp,
smiles_col= 'smiles',
title_col= 'Name',
color_col= 'K_Means',
caption_cols=['Name'])
fig_rd_bar.run_server(mode='inline', port=7590)

A brief analysis of the results using Molplotly allows us to see that the molecules with lower vapor pressures are in group 0, and in general, molecules with larger vapor pressures are in group 2.

And that’s all for now, it’s interesting how one of the many libraries can help us solve or analyze any type of information with a high level of detail, also make decisions based on data

--

--

Qco_Juan_David_Marin

Chemist with experience in python and statistics. Interested in data science, chemistry and statistics to propose innovative solutions that save time and money