SQL, Python, R and PowerBI for molecular analysis data in the same Jupyter notebook

Qco_Juan_David_Marin
18 min readMay 19, 2023

--

By Juan David Marín, e-mail: qcojuandavidmarin@gmail.com, LinkInd: https://www.linkedin.com/in/qco-juan-david-marin/, Notio, GitHub

Sometimes when I work with lab information like molecular descriptor for molecules obteined by GC/MSD I like having options like SQL to manage and store the information, python for all the possibilities it offers, and R for statistical analysis And Power BI for a fast visualization and presentation data. I know, It is not necessary have to know all of them, but in this example I would like to show how we can integrate them in to solve a challenge in laboratory data.

In the industry of fragrance, there is a lot of effort to understand the behavior of the molecules and their olfactory descriptors, in this way, to be able to classify or predict the behavior of a product made with these molecules, and why not, predict its duration of odor or find patterns that allow us to define characteristics. However, It’s worth noting that odor perception is subjective and can vary among individuals. Therefore, the accuracy of the classification may depend on the availability and quality of labeled data as well as the complexity of the odor descriptions.

As is known, the data is an important part for making machine learning and statistical analysis and its management in databases is essential, and also the visualization and presentation of data.

In this example I will use three SQL databases with the next information:
1. Molecular odor descriptor with its CAS number, name and SMILES notation respectively

2. Molecular SMILES notation and its vapor pressure value

3. Molecular Name with its CAS number and SMILE notation

With this database, it is possible to generate a SQL query to create a dataset containing Number, CAS, Name, Odor description, SMILES notation and Vapor pressure value for each molecule. After that, I calculate some molecular descriptors and choose the most relevant of them.

To consider: The relationship between molecular structure and odor is complex and not fully understood. Odor perception is subjective and can vary among individuals. Certain molecular descriptors, such as the number of carbon and oxygen atoms, specific functional groups (e.g., aldehydes, ketones, esters, alcohols, ethers, etc.), molecular weight, polarity, refractive indices, octanol-water partition coefficient (LogP), surface area, and molecular volume, have been associated with odor. However, the relationship between molecular structure and odor is highly complex and influenced by various factors. The choice of descriptors depends on the dataset and type of analysis.

That being said, the data obtained is adequate to show the example and make a statistical analysis and non supervised machine learning

The libraries needed for this example are:

# Commons python libraries for handle data and plots
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.gridspec as gridspec
import seaborn as sns
import numpy as np
import plotly.io as pio
# Cheminformatics toolkit library
from rdkit import Chem
from rdkit.Chem import PandasTools, AllChem, Descriptors
from rdkit import DataStructs
import plotly.express as px
import cirpy
import molplotly
# Machine learning library
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
# Manage an SQLite database file in python
import sqlite3

Let start. The database is loaded with its tables and then the SQL query is made for integrating them in the data of interest

con = sqlite3.connect('smiles_cas_pv.db')
cur = con.cursor()
raw_data = pd.read_sql_query("SELECT DISTINCT A.CAS,A.Name, max(A.Primary_Odor) as odor, B.smiles,C.pv
FROM odor AS A
LEFT JOIN name_cas_smiles as B
ON A.CAS=B.CAS
INNER JOIN vp as C
ON B.smiles=C.smiles
GROUP BY A.CAS, A.Name
ORDER BY pv", con)
print(raw_data.shape)
raw_data.tail()

This is how the odor data is distributed

As you can see in the data, the largest count value belongs to Woody descriptor, moreover there are 66 odor descriptor, there are many, so, to be more practice, I will work with three odors descriptors [‘Sweet’, ‘Waxy’, ‘Woody’]. I need to create a new column [‘odor_categorical’] for the molplotly graph in the next steps

data_odor = raw_data[raw_data['odor'].isin(['Sweet', 'Waxy', 'Woody'])].reset_index(drop = True)
data_odor['odor_categorical'] = data_odor['odor'].apply(lambda x: 'Wo' if x == 'Woody' else('SW' if x == 'Sweet' else 'Wa'))
print(data_odor.shape)
data_odor.head()
freq_odor = data_odor.groupby('odor').agg(count=('odor', 'count'), mean_pv = ('pv', 'mean') ).reset_index(drop=False).sort_values(by='mean_pv', ascending=False)
bar = px.bar(data_frame=freq_odor, x = 'odor', y = 'mean_pv', color='odor', text_auto=True, text='count',
template = "simple_white", title = 'odor distribution')
bar.update_layout(autosize=False,width=800,height=400)
bar.show()
This plot shows the mean of vapor pressure by odor description
This plot shows the mean of vapor pressure by odor description

With molplotly (As I showed in this blog) it is possible to find which molecules has the lower and higher vapor pressure value in a plotlyplot

bar = px.bar(data_frame=data_odor, x = 'odor', y = 'pv', 
color = 'odor_categorical',
template = "simple_white", title = 'odor distribution')
bar1 = molplotly.add_molecules(fig = bar, df = data_odor,
smiles_col='smiles',
color_col= 'odor_categorical',
caption_cols=['Name'])
bar1.run_server(mode='inline',port=7796)

We can see that the highest average vapor pressure values belong to Sweet odor descriptor and the lowest belong to the Waxy odor description. However, the new data set is unbalanced, there are more molecules with a woody odor descriptor than Waxy and Sweet. But for this example I will leave it that way, because the purpose of this work is to investigate whether there is a behavior that discriminates molecular behavior by its vapor pressure value within olfactory descriptors or clusters using unsupervised machine learning techniques. But also show how we can integrate SQL, Python, R and Power BI in a single jupyter notebook

As was mentioned before, some of the molecular descriptors that have been related to odor include:

Number of carbon and oxygen atoms. Specific functional groups (e.g., aldehydes, ketones, esters, alcohols, ethers, etc.). Molecular weight. Polarity. Refractive indexes. Octanol-water partition coefficient (LogP). Surface area. Molecular volume. However, it is important to keep in mind that the relationship between molecular structure and odor is very complex and influenced by many different factors

So, With RDKIT library some molecular descriptors will be calculated and then most important of them will be selected

carboxylic_acid = Chem.MolFromSmiles('CC(=O)[OH]')
ester = Chem.MolFromSmiles('C[C](=O)OC')
benzene = Chem.MolFromSmarts('c1ccccc1')
eter = Chem.MolFromSmiles('COC')
ketone = Chem.MolFromSmarts('CC(=O)C')
Aldehyde = Chem.MolFromSmiles('C(=O)C')

molecular_query = pd.DataFrame([carboxylic_acid,ester,benzene,eter,ketone,Aldehyde]).rename(columns ={0:'molecules'})
molecular_query['smiles'] = molecular_query['molecules'].apply(lambda x: Chem.MolToSmiles(x))
molecular_query['names'] = molecular_query['smiles'].apply(lambda x: cirpy.resolve(x, 'names')[0])
PandasTools.FrameToGridImage(molecular_query, column='molecules', molsPerRow=6,legendsCol='names')
These are some Chemical functional groups that will be counted for each molecule in the data set

Descriptors will also be added as: The molar refractivity of the molecule [‘MolMR’], Molecular weight [‘MW’], Partition coefficient [‘MolLogP’], Topological polar surface area [‘TPSA’], Fraction of heavy atoms that are aromatic [‘CSP3’] and Partial charges [‘Part_charg’]

data_odor['MolMR'] = data_odor['smiles'].apply(lambda x: Chem.MolFromSmiles(x)).apply(lambda x: Descriptors.MolMR(x))
data_odor['MW'] = data_odor['smiles'].apply(lambda x: Chem.MolFromSmiles(x)).apply(lambda x: Descriptors.ExactMolWt(x))
data_odor['MolLogP'] = data_odor['smiles'].apply(lambda x: Chem.MolFromSmiles(x)).apply(lambda x: Chem.Crippen.MolLogP(x))
data_odor['TPSA'] = data_odor['smiles'].apply(lambda x: Chem.MolFromSmiles(x)).apply(lambda x: Chem.rdMolDescriptors.CalcTPSA(x))
data_odor['CSP3'] = data_odor['smiles'].apply(lambda x: Chem.MolFromSmiles(x)).apply(lambda x: Descriptors.FractionCSP3(x))

Partial_charges = []
san = data_odor['smiles'].apply(lambda x: Chem.MolFromSmiles(x))
#------
for i in san:
AllChem.ComputeGasteigerCharges(i)
for j in san:
value = j.GetAtomWithIdx(0).GetDoubleProp('_GasteigerCharge')
Partial_charges.append(value)
data_odor['Part_charg'] = Partial_charges
#------
data_odor["n_Atoms"] = data_odor['smiles'].apply(lambda x: Chem.MolFromSmiles(x)).apply(lambda x: x.GetNumAtoms())
data_odor['n_car_acid'] = data_odor['smiles'].apply(lambda x: Chem.MolFromSmiles(x)).apply(lambda x: len(x.GetSubstructMatches(carboxylic_acid)))
data_odor['n_ester'] = data_odor['smiles'].apply(lambda x: Chem.MolFromSmiles(x)).apply(lambda x: len(x.GetSubstructMatches(ester)))
data_odor['n_benzene'] = data_odor['smiles'].apply(lambda x: Chem.MolFromSmiles(x)).apply(lambda x: len(x.GetSubstructMatches(benzene)))
data_odor['n_eter'] = data_odor['smiles'].apply(lambda x: Chem.MolFromSmiles(x)).apply(lambda x: len(x.GetSubstructMatches(eter)))
data_odor['n_ketone'] = data_odor['smiles'].apply(lambda x: Chem.MolFromSmiles(x)).apply(lambda x: len(x.GetSubstructMatches(ketone)))
data_odor['n_Aldehyde'] = data_odor['smiles'].apply(lambda x: Chem.MolFromSmiles(x)).apply(lambda x: len(x.GetSubstructMatches(Aldehyde)))
raw_dara_odor = data_odor.copy(deep=True)
data_odor.head()

Some characteristics are correlated, but the most different behavior is for Woody molecular descriptors. Apparently, the physico-chemical characteristics are similar for the odour descriptors Sweet and Waxy.

The variables with the greatest contribution to the total variance are selected. This can be done in several ways, but a simple way is to select variables whose contribution to the total variance is greater than a given threshold. For example, you can set a threshold of X% and select the variables whose contribution to the variance is greater than this value. Once you have selected the most representative variables, use the PCA model to get the new variables that best represent the variance of the original data.

X = data_odor.iloc[:,4:].drop('odor_categorical', axis = 1)
y = data_odor['odor']

X_norm = StandardScaler().fit_transform(X)
model_pca = PCA()
model_pca.fit_transform(X_norm)
prop_var = model_pca.explained_variance_ratio_

threshold = 0.06
selected_var = []
for i in range(len(prop_var)):
if prop_var[i] >= threshold:
selected_var.append(i)

X_selec = X.iloc[:, selected_var]
X_selec.head()
Those are the principal characteristics that describe “Better the data” so, the data set is reduced

An explained variance of 6% is selected. As we saw in the chart above, the data does not contribute much to the classification of odour classes. But it is fine for the proposed example

data_odor = pd.concat([data_odor.iloc[:,0:4], X_selec], axis=1)
data_odor.head()
This is the resulting dataset

Now, with the data prepared in python, these will be transformed to work in R

rpy2 is a Python library that provides seamless integration between Python and R, allowing you to use R functions and objects within your Python code. It enables data scientists and researchers to leverage the power of both languages, combining the flexibility and ease of Python with the statistical capabilities of R.

import os
os.environ["R_HOME"] = r"C:\Program Files\R\R-4.2.3"
os.environ["PATH"] = r"C:\Program Files\R\R-4.2.3\bin\x64" + ";" + os.environ["PATH"]
import rpy2
print(rpy2.__version__)
from rpy2.robjects import pandas2ri, packages
pandas2ri.activate()
stats = packages.importr('stats')
%load_ext rpy2.ipython

'''3.5.11
The rpy2.ipython extension is already loaded. To reload it, use:
%reload_ext rpy2.ipython'''

If you have problems with load the rpy2 library, you should set the directory of R version as is shown in the above code. In each jupyther cell is needed start with %%R command.

First the R libraries are loaded

%%R
library(dplyr)
library(tidyverse)
library(tidyr)
library(rstatix)
library(ggpubr)
library(ggstatsplot)
library(WRS2)
library(FactoMineR)
library(factoextra)

The python data frame is converted to an R data frame

%%R -i data_odor
r_data_odor <- as.data.frame(data_odor)
r_data_odor$odor <- as.factor(r_data_odor$odor)
row.names(r_data_odor) <- r_data_odor$Name

print(class(r_data_odor))
print(class(r_data_odor$odor))

[1] "data.frame"
[1] "factor"

The PCA algorithm is performed and then clustering with HCLUS is performed

%%R 
data_pca <- r_data_odor[,c(3,5:9)]
row.names(data_pca) <- r_data_odor$Name
r_pca_odor <- PCA(data_pca, scale.unit = T, graph = F, quali.sup = 1)
get_eigenvalue(r_pca_odor)

[,1] [,2] [,3]
[1,] 2.982431114 59.6486223 59.64862
[2,] 1.292633938 25.8526788 85.50130
[3,] 0.665402106 13.3080421 98.80934
[4,] 0.054032510 1.0806502 99.88999
[5,] 0.005500332 0.1100066 100.00000
%%R -w 1500 -h 700 --type=cairo
hcpc <- HCPC(r_pca_odor, nb.clust = -1, graph = F, min= 3)
fviz_dend(hcpc, rect = T, rect_fill = T)

These are the most significant variables for dimensions 1 and 2

%%R -w 1300 -h 500 --type=cairo
q<- fviz_contrib(r_pca_odor, choice='var', axes = 1) +
theme(text= element_text(size=20), plot.subtitle = element_text(size=20))

p <- fviz_contrib(r_pca_odor, choice='var', axes = 2) +
theme(text= element_text(size=20), plot.subtitle = element_text(size=20))

r <- fviz_pca_var(r_pca_odor, col.var = factor(c("TPSA", "MW", "MolMR", "MolLogP", 'pv')),
legend.title = list(fill = "Col", color = "variables")) +
theme(text= element_text(size=20), plot.subtitle = element_text(size=20))
q|p|r
%%R -w 1300 -h 500 --type=cairo
library("corrplot")
var <- get_pca_var(r_pca_odor)
corrplot(var$cos2, is.corr=FALSE)

The variables with the greatest contribution to a PC are the most correlated with it. MolMR, MolLogP and MW contribute to dimension 1, for TPSA and PV both have the best contribution in dimension 2

Description of the dimensions identifying the variables associated with each component

%%R
res_desc <- dimdesc(r_pca_odor, axes = c(1,2), proba=0.05)
res_desc$Dim.1

________________
Link between the variable and the continuous variables (R-square)
=================================================================================
correlation p.value
MolMR 0.9792868 2.305256e-207
MolLogP 0.9287778 1.278260e-129
MW 0.9229810 8.816131e-125
TPSA -0.3279343 6.696648e-09
pv -0.5859296 7.317022e-29

Link between the variable and the categorical variable (1-way anova)
=============================================
R2 p.value
odor 0.03973219 0.002528567

Link between variable and the categories of the categorical variables
================================================================
Estimate p.value
odor=Woody 0.2905562 0.0135279300
odor=Sweet -0.6810859 0.0005586908
%%R
res_desc$Dim.2

________________
Link between the variable and the continuous variables (R-square)
=================================================================================
correlation p.value
TPSA 0.8935910 5.566395e-105
MW 0.2616673 4.692049e-06
MolLogP -0.3250756 9.199324e-09
pv -0.5226796 2.715452e-22

Link between the variable and the categorical variable (1-way anova)
=============================================
R2 p.value
odor 0.0364622 0.004174858

Link between variable and the categories of the categorical variables
================================================================
Estimate p.value
odor=Sweet 0.2826154 0.002314026
odor=Woody -0.2952893 0.001534208

The associated variables are sorted according to the correlation coefficients with each component. For the dimension one, the variable MolMR presents greater correlation and PV presents the lowest correlation and negatively (greater MolMR lower PV). The odor variable is significantly related to the first PC and explains 3.97% of the data variance. For Woody odor descriptors, these have a higher mean value on the first axis.

The associated variables are sorted according to the correlation coefficients with each component. For the dimension one, the variable MolMR presents greater correlation and PV presents the lowest correlation and negatively (greater MolMR lower PV). The odor variable is significantly related to the first PC and explains 3.97% of the data variance. For Woody odor descriptors, these have a higher mean value on the first axis.

For dimension for the TPSA variable presents greater correlation, and PV presents the lowest correlation and negatively. The odor variable is significantly related to the second PC and explains 3.64% of the data variance. Sweet odour descriptors have a higher mean value on the second axis.

Molecules with similar profiles are identified. Molecules that are in the center of fviz_pca_ind graph has an average value of all variables, molecules far away from the center stand out for some variable

%%R -w 1700 -h 1000 --type=cairo
n <- fviz_contrib(r_pca_odor, choice='ind', axes = 1, top =110) +
theme(text= element_text(size=7), plot.subtitle = element_text(size=10))

m <- fviz_contrib(r_pca_odor, choice='ind', axes = 2, top = 100) +
theme(text= element_text(size=7), plot.subtitle = element_text(size=10))

p <- fviz_pca_ind(r_pca_odor, geom.ind =c("point"), repel = F, habillage = 1,
pointsize = "cos2")
n|m|p

With HCPC Hierarchical clustering on main components, the data is classified by clusters with similar characteristics

%%R -w 1300 -h 500 --type=cairo
fviz_cluster(hcpc, show.clust.cent =T, repel = F) +
theme(text = element_text(size = 10),
axis.title = element_text(size =15),
axis.text = element_text(size = 15))
%%R
hcpc$desc.var$category

----------------
$`1`
NULL

$`2`
Cla/Mod Mod/Cla Global p.value v.test
odor=Woody 45.41485 71.72414 76.84564 0.04297437 -2.023959

$`3`
Cla/Mod Mod/Cla Global p.value v.test
odor=Woody 50.65502 82.85714 76.84564 0.02100434 2.307906
odor=Sweet 31.11111 10.00000 15.10067 0.02099339 -2.308103

Categorical variables that characterize each group.

45.41% of olfactory descriptor Woody belong to group two, 71.72% of the molecules in the group to belong to the olfactory descriptor Woody.

50.66% and 31.11% of the descriptors Woody and Sweet respectively belong to group three, 82.86% and 10% of the molecules of group three are Woody and Sweet respectively

Quantitative variables that describe better each group:

%%R
hcpc$desc.var$quanti
$`1`
v.test Mean in category Overall mean sd in category Overall sd
pv 15.049088 6.562308 0.5235317 2.058488 1.476951
MolLogP -4.259740 1.741194 3.1428279 1.026677 1.211097
MolMR -5.927077 32.349708 55.1443768 9.157728 14.155337
MW -6.403422 105.316777 181.7162358 23.586637 43.914267
p.value
pv 3.500257e-51
MolLogP 2.046649e-05
MolMR 3.083748e-09
MW 1.519326e-10

$`2`
v.test Mean in category Overall mean sd in category Overall sd
TPSA 5.650932 24.572138 19.955738 13.353444 13.705657
MW -11.072426 152.733997 181.716236 26.203642 43.914267
MolMR -12.259791 44.800419 55.144377 7.507538 14.155337
MolLogP -12.369186 2.249926 3.142828 0.694216 1.211097
p.value
TPSA 1.595805e-08
MW 1.707146e-28
MolMR 1.488839e-34
MolLogP 3.836887e-35

$`3`
v.test Mean in category Overall mean sd in category Overall sd
MolMR 14.703518 67.97441000 55.1443768 6.05550438 14.155337
MolLogP 14.130697 4.19777071 3.1428279 0.63560517 1.211097
MW 13.709361 218.82778992 181.7162358 23.51225302 43.914267
TPSA -5.092572 15.65321429 19.9557383 12.74161294 13.705657
pv -5.581699 0.01535013 0.5235317 0.02128214 1.476951
p.value
MolMR 6.119576e-49
MolLogP 2.457016e-45
MW 8.924064e-43
TPSA 3.532393e-07
pv 2.381807e-08

The variables that define each grouping significantly. For example: cluster 1 presents higher values in the global average of the PV variable and lower values for MW. This has the logic because generally higher vapor pressure lower molecular weight. Group 2 is composed of molecules with the highest value of the topological polar surface area (TPSA), and in group 3 are the molecules with lower vapor pressure

Molecules which are the most representative of each group.

%%R
hcpc$desc.ind
$para
Cluster: 1
4-Heptanone 2-Methylpyrazin 1-Butanol 3-Pentanol Thioacetic acid
0.4718251 1.3149548 1.5841648 1.5965202 1.8697976
------------------------------------------------------------
Cluster: 2
gamma-Nonalactone 1-Phenylethyl acetate Prenyl isobutyrate
0.3626067 0.4018175 0.4019961
4-Thujanol 4-Methylbenzyl acetate
0.4106790 0.4494803
------------------------------------------------------------
Cluster: 3
2,6,10-Trimethylundec-9-enal
0.2760547
13-Oxabicyclo(10.1.0)trideca-4,8-diene, trimethyl-
0.3040667
Caryophyllene oxide
0.3377409
10-Epi-gamma-Eudesmol
0.3608474
Butylated hydroxytoluene
0.3804981

$dist
Cluster: 1
Pyrazine Hexanal 3-Pentanol 1-Butanol 2-Methylpyrazin
6.795043 6.395202 5.897309 5.644292 5.117037
------------------------------------------------------------
Cluster: 2
Triethyl citrate 2-Oxobutyric acid
7.059849 5.186209
4-Hydroxy-5-methylfuran-3(2H)-one Ethyl 5-oxooxolane-2-carboxylate
5.155769 4.996329
4-Hydroxybenzoic acid
4.970779
------------------------------------------------------------
Cluster: 3
Methyl 2-[[[4-(4-hydroxy-4-methylpentyl)-3-
6.768996
Isopropyl myristate
4.704395
1-Propanol, 2-[1-(3,3-dimethylcyclohexyl)ethoxy]-2-methyl-, 1-
4.410206
Pentadecane
4.331606
Eugenyl phenylacetate
4.215927

The “$para” option gives us information about the molecules closest to the group data center. Thus, “4-Heptanone” is the closest molecule to the barycentre of cluster one, “gamma-nonalactone” for cluster two and “2,6,10-trimethylan-undec-9-enal” for cluster 3.

The “$dist” option allows to reference the most specific molecules of each cluster that could not be classified in another group, (the molecules that would be further from the center of another cluster). So “Pyrazine” is the most specific molecule for cluster one, “Triethyl citrate” for cluster 2 and “Methyl 2-[[[4-(4-hydroxy-4-methylpentyl)-3-” for cluster 3

Now, the clusters for each molecule is attached to the data set

%%R
r_pca_odor_clus <- data.frame(data_pca, data.frame(clus = hcpc$data.clust$clust))
head(r_pca_odor_clus)

---------------
odor pv MolMR MW
Linalyl acetate Woody 0.00e+00 59.0330 196.1463
Hexyl acetate Woody 0.00e+00 40.7750 144.1150
Methyl 2-[[[4-(4-hydroxy-4-methylpentyl)-3- Sweet 7.79e-10 101.5063 343.2147
1,4-Dioxacycloheptadecane-5,17-dione Woody 4.38e-07 72.7050 270.1831
Eugenyl phenylacetate Sweet 2.94e-06 82.6760 282.1256
4-Hydroxybenzoic acid Woody 7.03e-06 35.0661 138.0317
MolLogP TPSA clus
Linalyl acetate 3.2406 26.30 3
Hexyl acetate 2.1298 26.30 2
Methyl 2-[[[4-(4-hydroxy-4-methylpentyl)-3- 4.8432 58.89 3
1,4-Dioxacycloheptadecane-5,17-dione 3.3775 52.60 3
Eugenyl phenylacetate 3.5718 35.53 3
4-Hydroxybenzoic acid 1.0904 57.53 2

With a pca_biplot is possible see the information given by “$para” and “$dist”. A classification cooperation between odor and groups is performed

%%R -w 1600 -h 500 --type=cairo

odor_plot <- fviz_pca_biplot(pca_hcpc, repel = F, labelsize = 4,
pointshape = 21, pointsize = 2,
#habillage = 1,
addEllipses =TRUE, ellipse.alpha = 0.3,
fill.ind = r_pca_odor_clus$odor,
ellipse.type = "confidence",
#geom.ind =c("point","text"),
geom.ind =c("point"),
geom.var = c("text", "arrow"),
title = 'PCA BIPLOT by ODOR',
col.var = factor(c("TPSA", "MW", "MolMR", "MolLogP", 'pv')),
legend.title = list(fill = "odor", color = "variables")) +
theme(text = element_text(size = 12),
axis.title = element_text(size =12),
axis.text = element_text(size = 12))

clust_plot <- fviz_pca_biplot(pca_hcpc, repel = F, labelsize = 4,
pointshape = 21,
pointsize = 2,
#habillage = 7,
addEllipses =TRUE, ellipse.alpha = 0.5,
fill.ind = r_pca_odor_clus$clus,
ellipse.type = "confidence",
#geom.ind =c("point","text"),
geom.ind =c("point"),
geom.var = c("text", "arrow"),
title = 'PCA BIPLOT by PCA',
col.var = factor(c("TPSA", "MW", "MolMR", "MolLogP", 'pv')),
legend.title = list(fill = "clus", color = "variables")) +
theme(text = element_text(size = 12),
axis.title = element_text(size =12),
axis.text = element_text(size = 12))

odor_plot|clust_plot

As it is possible to see, the biggest group separation is given by Cluster analysis. Through odor descriptions is more difficult separate the groups by its physical-chemical behaviour.

This could mean that the molecules classified by the odor descriptor group have similar physical-chemical properties (With the descriptors that we have already calculated)

A brief ANOVA analysis will allow us to find out if there are significant differences in the vapour pressure (PV) values between the molecules classified by clusters or by odor descriptors. (Some supposed stages will be avoided to not make the text so long)

Using a box plot we can compare the data of the groups using odour descriptors and clusters

%%R -w 1700 -h 500 --type=cairo
clus_box <- ggboxplot(r_pca_odor_clus,
x = 'clus',
y = 'pv',
color = "clus",
add = c("jitter"),
fill = "clus", alpha = 0.2) +
stat_summary(fun = mean, geom = "point", shape = 18, size = 7,
color = "darkred",
position = position_dodge(width = 1)) +
stat_summary(fun = mean, colour = "red",
position = position_dodge(width = 1),
geom = "text", vjust = -0.7, size = 7,
aes(label = round(..y.., digits = 5))) +
ggtitle('Vapor pressure mean among clusters')
odor_clus <- ggboxplot(r_pca_odor_clus,
x = 'odor',
y = 'pv',
color = "odor",
add = c("jitter"),
fill = "odor", alpha = 0.2) +
stat_summary(fun = mean, geom = "point", shape = 18, size = 7,
color = "darkred",
position = position_dodge(width = 1)) +
stat_summary(fun = mean, colour = "red",
position = position_dodge(width = 1),
geom = "text", vjust = -0.7, size = 7,
aes(label = round(..y.., digits = 5))) +
ggtitle('Vapor pressure mean among odor descriptors')
clus_box|odor_clus
Agrupando los datos por clusters y por descriptores de olor,es posible observar que el cluster uno presenta mayores presiones de vapor con descriptores de olor Sweet y Woody. En el Cluster dos estan resentes los tres descriptores con valores de presion de vapor mas bajos que el cluster uno. Cluster three contains the lowest vapour pressures

As mentioned above, by separating molecules by cluster analyses these can be discriminated much better than by grouping them by odour descriptors. Vapor pressure values are more different from cluster than by odor descriptors

A 2-way anova allows us to evaluate if there are differences in the vapour pressure value between clusters and odor types, and if there is significant interaction

First 2-way anova analysis is observed for odour descriptor groups and clusters

%%R -w 1800 -h 500 --type=cairo
anova_clus <- ggbetweenstats(data = r_pca_odor_clus, x = clus, y = pv, type = "nonparametric",
results.subtitle = T, message = T, var.equal = F, p.adjust.method = "bonferroni",
ggsignif.args = list(textsize = 5, tip_length = 0.01)) +
theme(text = element_text(size = 10), plot.subtitle = element_text(size = 13))
anova_odor <- ggbetweenstats(data = r_pca_odor_clus, x = odor, y = pv, type = "nonparametric",
results.subtitle = T, message = T, var.equal = F, p.adjust.method = "bonferroni",
ggsignif.args = list(textsize = 5, tip_length = 0.01)) +
theme(text = element_text(size = 10), plot.subtitle = element_text(size = 13))
anova_clus|anova_odor

So, the data grouped by clusters are statistically different while by odor descriptors they are not.

%%R -w 1700 -h 700 --type=cairo
grouped_ggbetweenstats(data = r_pca_odor_clus, x = clus, y = pv,
grouping.var = odor, type = "nonparametric",
results.subtitle = T, message = T, var.equal = F, p.adjust.method = "bonferroni",
ggsignif.args = list(textsize = 6, tip_length = 0.01)) +
theme(text = element_text(size = 10), plot.subtitle = element_text(size =10))
A comparison between groups shows that the vapor pressure value of the odour descriptors Sweet and Woody are statistically different in the three clusters, Waxy has no significant difference.

With an Ancova analysis we can check if there are differences in the values of VP between olfactory groups or clusters taking into account or controlling the molecular MW

%%R -w 1700 -h 500 --type=cairo
anco_clus <- ggscatter(r_pca_odor_clus, x = 'MW', y = 'pv', color= 'clus', add = c('reg.line'), size = 4, title = 'by cluster',
shape = 'odor', palette = "jco")+
stat_cor(aes(color =clus))
anco_odor <- ggscatter(r_pca_odor_clus, x = 'MW', y = 'pv', color= 'odor', add = 'reg.line', size = 4, title = 'odor',
shape = 'odor',palette = "jco")+
stat_cor(aes(color = odor))

anco_clus| anco_odor

A clearer difference remains for clustering by Clusters. Higher molecular weight lower vapor pressure. the slopes of the graph grouped by odor are very similar, therefore, they present the same behavior

These results are verified with a permanova analysis, when evaluating the vapour pressure by olfactory groups controlling the molecular weight there are no significant statistical differences

%%R -w 1700 -h 500 --type=cairo
res_anova_clus = r_pca_odor_clus %>%
anova_test(pv ~ MW*clus)
pwc11 <- pwc1 %>% add_xy_position (x="clus", fun="mean_se")
Perma_clus <- ggline(get_emmeans(pwc11), x = "clus", y = "emmean", title='Permanova by Cluster') +
geom_errorbar(aes(ymin= conf.low, ymax = conf.high), width = 0.2) +
stat_pvalue_manual(pwc11, hide.ns=TRUE, tip.length= FALSE) +
labs(subtitle = get_test_label(res_anova_clus, detailed = T),
caption = get_pwc_label(pwc11))

res_anova_odor = r_pca_odor_clus %>%
anova_test(pv ~ MW+odor)
pwc22 <- pwc %>% add_xy_position (x="odor", fun="mean_se")
Perma_odor <- ggline(get_emmeans(pwc22), x = "odor", y = "emmean", title='Permanova by Odor') +
geom_errorbar(aes(ymin= conf.low, ymax = conf.high), width = 0.2) +
stat_pvalue_manual(pwc22, hide.ns=TRUE, tip.length= FALSE) +
labs(subtitle = get_test_label(res_anova_odor, detailed = T),
caption = get_pwc_label(pwc22))

Perma_clus|Perma_odor
Statistical differences are seen for permanova analysis by Cluster. Group one with group two and group one with group 3

Finally, the R dataset will be converted to python pandas dataset and then processed in Power BI using the powerbiclient library

%%R -i raw_dara_odor
r_raw_dara_odor <- as.data.frame(raw_dara_odor)
%%R -o data_R_pd
pca_odor <- prcomp(r_data_odor[,c(5:9)], scale = T)
scores = as.data.frame(pca_odor$x)
data_R_pd <- cbind(r_raw_dara_odor, clus = r_pca_odor_clus$clus, scores)
from rpy2.robjects import pandas2ri
data_r_python = pandas2ri.PandasDataFrame(data_R_pd)
data_r_python = pd.DataFrame(data_r_python).reset_index(drop=True)
print('Columns', data_r_python.columns, '\n')
print('NULL VALUES:', data_r_python.columns.isnull().sum())
data_r_python.head()

A fast POWER BI analysis

auth = DeviceCodeLoginAuthentication()

-------------
'''Performing device flow authentication. Please follow the instructions below.
To sign in, use a web browser to open the page https://microsoft.com/devicelogin and enter the code xxxxxx to authenticate.

Device flow authentication successfully completed.
You are now logged in .

The result should be passed only to trusted code in your notebook'''
PBI_visualize = QuickVisualize(get_dataset_config(data_r_python), auth = auth)
PBI_visualize

With this dashboard it is possible to change the information showed in the each graph easier and present the data analysis

As a brief conclusion, the analysis of the molecules by their olfactory descriptors is not feasible by means of the calculated molecular descriptors, We could look for other physical-chemical alternatives or variables that allow us to find larger differences.

This example had a very brief statistical and behavioral analysis of the data, its purpose was to show how these technologies are trying to solve the daily challenges we could face. ¡It is Another option!.

best regards !!!!!

--

--

Qco_Juan_David_Marin

Chemist with experience in python and statistics. Interested in data science, chemistry and statistics to propose innovative solutions that save time and money