Interactive R Markdown for the analysis of correlations among products in the context of Competitive Intelligence. Part 2

Analytical chemistry with R programming

Qco_Juan_David_Marin
7 min readJul 10, 2023

By Juan David Marín, e-mail: qcojuandavidmarin@gmail.com, LinkInd: https://www.linkedin.com/in/qco-juan-david-marin/, Notion, GuitHub

This is the second part of the post. In this section, we will continue the analysis of competitive intelligence for skin care products using instrumental chemical analysis. In the previous post, we explored the correlations among various products based on their physicochemical behaviors and observed how they were grouped into clusters based on similarity. Now, with this data, we can perform correlation analysis and assess the potential for making predictions on new products in the future.

First, we will start with the correlation analysis.

```{r }
numeric <- data %>%
select_if(is.numeric)


selectInput(inputId = 'corr',label = 'Select a numeric variable for LM',
choices = names(numeric), selected = names(numeric)[5], multiple = F)

renderPrint(
sapply(numeric, function(x) round(cor.test(x, numeric[,c(input$corr)])$p.value,5))
)
renderPrint(
sapply(numeric, function(x) round(cor.test(x, numeric[,c(input$corr)])$estimate,2))
)
```
# Correlation plot

```{r}
numeric <- data_clus %>%
select_if(is.numeric)
selectInput(inputId = 'numerical_df',label = 'Select a numeric variable',
choices = names(numeric), selected = names(numeric), multiple = T)


# Creando DF con lo seleccionado en los select input
df_interactive <- reactive({
cbind(data_clus[input$numerical_df])
})

library(corrplot)
renderPlot(
df_interactive() %>%
#dplyr::select(-class,-brand) %>%
cor() %>%
corrplot(method = "ellipse",type = "lower", tl.cex = 1,
order = "hclust", insig = "pch", tl.col='black',
addCoef.col = 'black', number.cex =0.9, number.font = 5,
col = COL2('PiYG'))
)
```

The p-values of the correlations among the variables are less than 0.05, indicating that they are statistically different from zero. Additionally, the correlations are positive and high, except for variable var_A3.

With the correlation analysis complete, the next step is to perform simple linear regression models.

```{r warning=FALSE}
numeric <- data %>%
select_if(is.numeric)

selectInput(inputId = 'lm',label = 'Select a numeric variable for LM',
choices = names(numeric), selected = names(numeric)[5], multiple = F)
renderPrint(
sapply(numeric, function(x) round(summary(lm(x ~ numeric[,c(input$lm)]))$r.squared,2))
)
```
```{r}
library(shiny)
selectInput(inputId = 'lmx',label = 'Select a numeric variable for for x axis',
choices = names(numeric), selected = names(numeric)[5], multiple = F)
selectInput(inputId = 'lmy',label = 'Select a numeric variable for y axis',
choices = names(numeric), selected = names(numeric)[1], multiple = F)
checkboxInput(inputId = 'tf', label = 'Show ellipse', value = F)
renderPlot(
data_clus %>%
ggscatter(x = input$lmx, y = input$lmy, add = 'reg.line', color = 'clus',
palette = "jco", ellipse = input$tf, ellipse.level = 0.95,ellipse.alpha = 0.1,
ellipse.border.remove = T,mean.point = input$tf,ellipse.type = "norm",
add.params = list(color = "red")) +
stat_cor(label.x.npc = "left", label.y.npc = "top",
aes(label = paste(..rr.label.., ..p.label.., sep = "~`,`~"))) +
stat_regline_equation(label.x.npc = 'center', label.y.npc = 'top')
)
```

The F-test for all linear regressions among the variables was significant, as the p-value obtained was lower than 0.05 for a 95% confidence level. Overall, the models have good predictive ability, except for the variable Var_A3, which showed a different pattern.

For the predictions of the rest variables I will choose the var_B5 for those physicochemical variables type B and var_A5 for those type A as independent variables

Let’s suppose now that we have a new product, and instead of performing all 10 physicochemical measurements, we will only perform 2 of them, Var_A5 and Var_B5. The obtained results are 1.4 and 295.3, respectively. We will see how this new sample would be classified using the predicted data.

Linear models for Variables measured with chemical measuring equipment type A

```{r}
var_A5 <- data.frame(var_A5 = 1.4)

model_A1 <- lm(var_A1 ~ var_A5, data = data)
model_A2 <- lm(var_A2 ~ var_A5, data = data)
model_A3 <- lm(var_A3 ~ var_A5, data = data)
model_A4 <- lm(var_A4 ~ var_A5, data = data)

pred_A1 <- predict(model_A1, var_A5)
pred_A2 <- predict(model_A2, var_A5)
pred_A3 <- predict(model_A3, var_A5)
pred_A4 <- predict(model_A4, var_A5)
```

Linear models for Variables measured with chemical measuring equipment type B

```{r}
var_B5 <- data.frame(var_B5 = 295.3)
model_B1 <- lm(var_B1 ~ var_B5, data = data)
model_B2 <- lm(var_B2 ~ var_B5, data = data)
model_B3 <- lm(var_B3 ~ var_B5, data = data)
model_B4 <- lm(var_B4 ~ var_B5, data = data)

pred_B1 <- predict(model_B1, var_B5)
pred_B2 <- predict(model_B2, var_B5)
pred_B3 <- predict(model_B3, var_B5)
pred_B4 <- predict(model_B4, var_B5)
```

Create a dataset with the predicted data.

```{r}
prediction = cbind(var_A1 = pred_A1,var_A2 = pred_A2, var_A3 = pred_A3,var_A4 = pred_A4,var_A5 = var_A5,
var_B1 = pred_B1,var_B2 = pred_B2,var_B3 = pred_B3,var_B4 = pred_B4,var_B5 = var_B5)
prediction
row.names(prediction) <- 'pred'
```
var_A1 var_A2 var_A3 var_A4 var_A5 var_B1 var_B2 var_B3 var_B4 var_B5
pred 15.75512 551.1726 1.763584 4.100321 1.4 34.15665 3505.221 647.0743 4294.472 295.3

We performed the clustering of the new sample using the predicted data.


```{r}
original_mean <- attr(data_std, "scaled:center")
original_sd <- attr(data_std, "scaled:scale")

prediction_sc <- scale(prediction, center = original_mean, scale = original_sd ) %>%
as.data.frame()

clus_pred <- predict(data_km, newdata =prediction_sc)
prediction <- cbind(prediction, clus = as.factor(clus_pred) )

clus_pred
```

> clus_pred
[1] 3

We scale the predictions obtained from each variable using the mean and standard deviation of the original dataset. Then, we perform the respective classification.


```{r}
new_data_clus_pred <- data_clus %>%
dplyr::select(-class, -brand) %>%
rbind(prediction)


selectInput(inputId = 'label',label = 'Labels', multiple = T,
choices = c("ind", "ind.sup", "quali", "var", "quanti.sup"), selected = "ind.sup")

selectInput(inputId = 'invisible',label = 'Invisible', multiple = T,
choices = c("ind", "ind.sup", "quali", "var", "quanti.sup"), selected = "ind")

res_pca3 <- PCA(new_data_clus_pred, graph = F,
quali.sup = 11,
ind.sup =30
)
renderPlot(
fviz_pca_biplot(res_pca3,pointshape = 21, pointsize = 2,
habillage = 11, label = c(input$label),
invisible = c(input$invisible),
addEllipses =TRUE,
ellipse.alpha = 0.1,
ellipse.type = "norm",
alpha.var ="contrib",
#col.var = 'black',
col.ind.sup = "red",
)
)
```

In the above gif, it is possible to observe the classification of the new sample with the “pred” label highlighted in red.

Now, the possibility of viewing the new sample predicted perfil is possible through a radar plot, where the profiles of each cluster a new sample predicted can be compared

Now, it is possible to visualize the predicted profile of the new sample through a radar plot, which allows for a comparison of the profiles of each cluster with the new sample.

```{r}
## data long escaled
data_long_by_clus_scale <- new_data_clus_pred %>%
select_if(is.numeric) %>%
scale() %>%
cbind(clus = new_data_clus_pred$clus) %>% data.frame() %>%
tibble::rownames_to_column(var = 'names') %>%
gather(key = 'var', value = 'values',2:11, factor_key = T) %>%
group_by(clus, var) %>%
get_summary_stats(values, type ="mean_sd")

## Prediction
prediction_scale_long <- prediction_sc %>%
cbind(clus = as.factor(clus_pred)) %>%
tibble::rownames_to_column(var = 'names') %>%
gather(key = 'var', value = 'values', 2:11) %>%
group_by(clus, var) %>%
get_summary_stats(values, type ="mean")


clus_1 <- subset(data_long_by_clus_scale, data_long_by_clus_scale$clus ==1)
clus_2 <- subset(data_long_by_clus_scale, data_long_by_clus_scale$clus ==2)
clus_3 <- subset(data_long_by_clus_scale, data_long_by_clus_scale$clus ==3)

library(plotly)

fig <- plot_ly(
type = 'scatterpolar',
mode = 'lines',
fill = 'toself'
)

fig <- fig %>%
add_trace(
type = 'scatterpolar',
mode = 'lines',
r = clus_1$mean,
theta =clus_1$var,
mode = 'markers',
fill = 'toself',
name = 'Clus 1'
)
fig <- fig %>%
add_trace(
type = 'scatterpolar',
mode = 'lines',
r = clus_2$mean,
theta =clus_2$var,
mode = 'markers',
fill = 'toself',
name = 'Clus 2'
)
fig <- fig %>%
add_trace(
type = 'scatterpolar',
mode = 'lines',
r = clus_3$mean,
theta =clus_3$var,
mode = 'markers',
fill = 'toself',
name = 'Clus 3'
)
fig <- fig %>%
add_trace(
type = 'scatterpolar',
mode = 'lines',
r = prediction_scale_long$mean,
theta =prediction_scale_long$var,
mode = 'markers',
fill = 'toself',
name = 'prediction'
)

fig <- fig %>%
layout(
polar = list(
radialaxis = list(
visible = T
)
)
)

renderPlotly(
fig
)
```

The new sample has been classified into cluster 3, indicating a similarity in its physicochemical profiles.

Conclusion: In these posts, the intention was to demonstrate an interactive application to a competitive intelligence problem. It was shown how through chemical and data analysis, companies can become more efficient, which is of great importance. Cost and time savings are constant areas for improvement. A brief summary of machine learning models and their respective evaluation statistics was provided. However, some assumptions were overlooked, especially in simple linear regressions, where assumptions of linearity, homoscedasticity, heteroscedasticity, distribution of residuals, etc., were not shown here. Of course, these assessments should be conducted behind the practical work.

However, it has been a good example of what my work as a scientist has been like, and knowing that I was able to help clarify ideas for those who needed it.

--

--

Qco_Juan_David_Marin

Chemist with experience in python and statistics. Interested in data science, chemistry and statistics to propose innovative solutions that save time and money