Interactive R Markdown for the analysis of correlations among products in the context of Competitive Intelligence. Part 1

Analytical chemistry with R programming

Qco_Juan_David_Marin
17 min readJul 9, 2023

By Juan David Marín, e-mail: qcojuandavidmarin@gmail.com, LinkInd: https://www.linkedin.com/in/qco-juan-david-marin/, Notion, GuitHub

Competitive intelligence is a crucial practice in business that provides valuable insights into competitors’ strategies, market trends, and risks. It helps businesses make informed decisions, develop effective strategies, and stay ahead in the market. By monitoring competitors, businesses can identify opportunities, mitigate risks, and enhance innovation. Competitive intelligence also improves market positioning by enabling businesses to differentiate themselves and target their audience effectively.

The work done by scientists helps companies in intelligent competitive processes. For example, through comparative analyses of competitive products in the market, we can evaluate the physicochemical behavior of our product formulations in comparison to those of the market. This enables us to identify areas for improvement and adapt quickly to market changes, facilitating the development of innovative products with better quality and efficiency. Analytical chemistry plays a crucial role in these evaluations.

Another important factor is data analysis. With the results obtained from analytical chemistry, the synergy between them becomes an important competitive tool that allows us to thoroughly analyze the results, gather more information, and make predictions for future analyses

A brief example of how it’s possible to integrate these fields is shown in this post. You can choose software for statistics and data analysis, such as Python, R Studio, Minitab, SPSS, etc. The most popular ones are Python and R, both of which are amazing software for data analysis and data visualization. Moreover, they offer many possibilities to create interactive interfaces, facilitating user interaction.

“For example, R has a function called R Markdown where it is possible to showcase the results of statistical models, machine learning models, and their respective graphics through an interactive Rmd document. I will use it for an analysis of some skincare products, with some manufactured by Market 1 and others by Market 2. The goal is to determine which products exhibit better performance based on their physicochemical properties. The results (in this example they were simulated) were obtained through instrumental chemical analysis in the lab. The analyses were conducted using two pieces of chemical analysis equipment that provide ten different measurements of physicochemical behavior in distinct ways. The measurements are labeled as Var_A for the variables obtained with equipment A, and Var_B for the variables obtained with equipment B.

This analysis has consumed a significant amount of time and resources. However, we will explore how to conduct future analyses in a more cost-effective and time-efficient manner.

Let's start with the analysis, loading the data and libraries, and adjusting the output of the Markdown document

---
title: "publication"
author: "Juan David Marin"
date: "2023-07-09"
output: html_notebook
runtime: shiny
---

library(readxl)
library(tidyr)
library(dplyr)
library(tidyselect)
library(tibble)
library(rstatix)
library(FactoMineR)
library(factoextra)
library(DT)
library(patchwork)
library(shiny)
library(ggstatsplot)
library(plotly)

```{r warning=F}
data <- read.csv('brand_study.csv', sep = ';', dec = '.', header = T, stringsAsFactors = T)
data <- column_to_rownames(data, var = 'sample')
data
```

So, the data consists of 29 products, with 10 numeric variables and 2 categorical variables. The numeric variables are associated with physicochemical measurements conducted in the lab. The categorical variable ‘class’ indicates whether the product is manufactured by company A (Market_1) or other markets (Market_2). The other categorical variable, ‘brand,’ represents the name of the manufacturing company.

First, let’s begin by setting the output format of the markdown document to Shiny.

---
title: "publication"
author: "Juan David Marin"
date: "2023-07-09"
output: html_notebook
runtime: shiny
---

Now, some basic statistics.

data %>% 
dplyr::select_if(is.numeric) %>%
gather(key = 'var', value = 'values') %>%
group_by(var) %>%
get_summary_stats(values, type='mean_sd') %>%
dplyr::select(-variable) %>%
DT::datatable()
mean and standard deviation
data %>% 
dplyr::select_if(is.numeric) %>%
gather(key = 'var', value = 'values') %>%
group_by(var) %>%
identify_outliers(values) %>%
DT::datatable()

Some statisticians assess the adequacy of the data to perform a PCA

# Correlation matrix
renderPlot(
data %>%
dplyr::select(-brand) %>%
ggpairs(columns = 1:10, progress = FALSE,aes(color=class))
)

Currently, there are correlations among the variables despite the presence of a few outliers. However, we need to assess whether it is appropriate to perform a PCA.

We can evaluate whether PCA is appropriate for analyzing the study data using the following 3 methods that analyze the correlation structure between the variables:

• Bartlett’s test of sphericity evaluates the nature of the correlations.
If it is significant, it suggests that the variables are not an “identity matrix” where correlations occur due to sampling error.

• The Kaiser-Meyer-Olkin (KMO) measure of sampling adequacy is based on the common variance.
It assesses whether there is an appropriate number of observations relative to the number of variables being evaluated.
There is an overall score and a score for each variable.
It ranges from 0 (poor adequacy) to 1 (good adequacy).

• Determinant positivity test evaluates multicollinearity.
The result should preferably fall below 0.00001.

##Bartlett's test
```{r}
library(psych)
data %>%
dplyr::select_if(is.numeric) %>%
cortest.bartlett()
```
R was not square, finding R from data
$chisq
[1] 640.0806

$p.value
[1] 3.335575e-106

$df
[1] 45

For a significance level of 5%, we can conclude that the variables are not an identity matrix where correlations occur solely due to sampling error, as the p-value is less than 0.05. Therefore, based on the significant result from Bartlett’s test, it is appropriate to apply PCA.

## Kaiser-Meyer-Olkin (KMO)
```{r}
data %>%
dplyr::select_if(is.numeric) %>%
cor() %>% KMO()
```
Kaiser-Meyer-Olkin factor adequacy
Call: KMO(r = .)
Overall MSA = 0.74
MSA for each item =
var_A1 var_A2 var_A3 var_A4 var_A5 var_B1 var_B2 var_B3 var_B4 var_B5
0.71 0.72 0.09 0.79 0.79 0.91 0.89 0.67 0.68 0.69

This measure provides information on whether the variables have sufficient correlation to extract meaningful principal components. Variables with a value <0.5, such as var_A3, could potentially be eliminated. However, it is important to note that the overall KMO index value is 0.74, which indicates good suitability of the data for factor analysis.

## Determinant positivity test evaluates multicollinearity
```{r}
options(scipen = 100, digits = 6)
data %>%
dplyr::select_if(is.numeric) %>%
cor() %>%
det()
```
[1] 0.00000000000216949

The determinant of the matrix is positive and very close to 0 (below 0.00001).

Performace a PCA algorithm. The variables ‘brand’ and ‘class’ will be used as categorical supplementary variables

```{r}
res_pca <- data %>%
PCA(scale.unit = T, graph = F, quali.sup = c('brand', 'class'))
summary(res_pca)
```
Eigenvalues
Dim.1 Dim.2 Dim.3 Dim.4 Dim.5 Dim.6 Dim.7 Dim.8 Dim.9 Dim.10
Variance 7.858 1.140 0.638 0.153 0.127 0.042 0.032 0.006 0.004 0.001
% of var. 78.575 11.402 6.377 1.531 1.269 0.425 0.317 0.059 0.037 0.007
Cumulative % of var. 78.575 89.978 96.355 97.887 99.156 99.581 99.898 99.956 99.993 100.000

Individuals (the 10 first)
Dist Dim.1 ctr cos2 Dim.2 ctr cos2 Dim.3 ctr cos2
product_1 | 4.613 | -4.431 8.615 0.923 | -1.022 3.158 0.049 | 0.412 0.917 0.008 |
product_2 | 3.921 | -3.869 6.569 0.974 | 0.248 0.186 0.004 | 0.519 1.455 0.017 |
product_3 | 3.707 | -3.664 5.890 0.977 | 0.277 0.231 0.006 | 0.429 0.995 0.013 |
product_4 | 2.655 | -2.632 3.041 0.983 | -0.111 0.038 0.002 | -0.230 0.286 0.008 |
product_5 | 2.269 | -2.245 2.211 0.979 | 0.130 0.051 0.003 | -0.275 0.408 0.015 |
product_6 | 2.010 | -1.643 1.185 0.668 | -0.568 0.975 0.080 | -0.844 3.849 0.176 |
product_7 | 3.036 | -2.991 3.925 0.970 | 0.320 0.310 0.011 | 0.306 0.508 0.010 |
product_8 | 4.148 | -2.129 1.989 0.263 | 3.258 32.095 0.617 | 0.994 5.340 0.057 |
product_9 | 2.354 | -2.269 2.259 0.929 | -0.372 0.418 0.025 | -0.410 0.907 0.030 |
product_10 | 1.926 | -1.517 1.011 0.621 | -1.112 3.740 0.334 | -0.110 0.065 0.003 |

Variables
Dim.1 ctr cos2 Dim.2 ctr cos2 Dim.3 ctr cos2
var_A1 | 0.852 9.235 0.726 | -0.065 0.366 0.004 | 0.478 35.770 0.228 |
var_A2 | 0.782 7.780 0.611 | -0.433 16.428 0.187 | 0.375 22.092 0.141 |
var_A3 | 0.072 0.065 0.005 | 0.953 79.658 0.908 | 0.275 11.817 0.075 |
var_A4 | 0.976 12.126 0.953 | -0.026 0.061 0.001 | 0.034 0.176 0.001 |
var_A5 | 0.972 12.019 0.944 | -0.034 0.102 0.001 | 0.056 0.497 0.003 |
var_B1 | 0.954 11.592 0.911 | 0.080 0.566 0.006 | 0.027 0.114 0.001 |
var_B2 | 0.966 11.885 0.934 | 0.069 0.414 0.005 | -0.234 8.567 0.055 |
var_B3 | 0.964 11.826 0.929 | 0.133 1.545 0.018 | -0.190 5.656 0.036 |
var_B4 | 0.957 11.652 0.916 | 0.020 0.037 0.000 | -0.214 7.204 0.046 |
var_B5 | 0.964 11.820 0.929 | 0.097 0.824 0.009 | -0.227 8.106 0.052 |

Supplementary categories
Dist Dim.1 cos2 v.test Dim.2 cos2 v.test Dim.3 cos2 v.test
market_1 | 0.338 | -0.219 0.420 -0.427 | 0.191 0.321 0.981 | 0.132 0.153 0.905 |
market_2 | 0.362 | 0.234 0.420 0.427 | -0.205 0.321 -0.981 | -0.141 0.153 -0.905 |
brand_1 | 0.602 | -0.514 0.727 -0.703 | 0.100 0.027 0.358 | 0.232 0.149 1.117 |
brand_2 | 1.498 | 1.339 0.799 0.688 | -0.081 0.003 -0.109 | 0.090 0.004 0.162 |
brand_3 | 2.123 | 1.905 0.805 1.641 | -0.910 0.184 -2.057 | 0.174 0.007 0.527 |
brand_4 | 1.271 | -0.955 0.565 -0.491 | -0.104 0.007 -0.140 | -0.659 0.269 -1.189 |
brand_5 | 1.269 | -1.122 0.781 -0.847 | 0.578 0.207 1.146 | -0.024 0.000 -0.063 |
brand_6 | 1.252 | -1.107 0.781 -0.569 | -0.506 0.163 -0.682 | -0.257 0.042 -0.463 |
brand_7 | 2.447 | 1.991 0.662 1.277 | 1.214 0.246 2.044 | -0.620 0.064 -1.395 |
brand_8 | 4.613 | -4.431 0.923 -1.581 | -1.022 0.049 -0.957 | 0.412 0.008 0.516 |

The eigenvalues measure the amount of variation retained by each principal component. With 2 dimensions, we are retaining 90% of the variability in our data.

Quality of variable representation for each component.

```{r fig.height=5, fig.width=15}
library(patchwork)
renderPlot(
fviz_cos2(res_pca, choice = 'var', axes = 1, fill = 'yellow')/fviz_contrib(res_pca, choice = 'var', axes = 1, fill = 'yellow') |
fviz_cos2(res_pca, choice = 'var', axes = 2,fill = 'blue')/fviz_contrib(res_pca, choice = 'var', axes = 2, fill = 'blue')|
fviz_pca_var(res_pca, repel = T,col.var =factor(c("var_A1","var_A2","var_A3","var_A4","var_A5","var_B1","var_B2","var_B3","var_B4","var_B5")),
legend.title = list(fill = "Col", color = "variables"), geom = c("arrow","text"))
)
```
Quality of variable representation for each component

cos2 represents the quality of representation for the variables in each PC.
It corresponds to the squared value of the coordinates (coord\²).
Contrib represents the contribution (%) of the variable in each PC.
It is calculated as the variable’s quality divided by the total quality of the component (cos2/sum(cos2)).

The angles (between variables or between variable and PC) can be understood as an indication of the correlation between them.
The length of the arrow represents the magnitude of the correlation.

We can observe the variables significantly associated with the 2 principal components based on their correlation values With a significance level of 95%.

```{r}
res_pca %>%
dimdesc(axes = c(1,2), proba=0.05)
```
$Dim.1

Link between the variable and the continuous variables (R-square)
=================================================================================
correlation p.value
var_A4 0.9761142 1.951360e-19
var_A5 0.9717958 1.793345e-18
var_B2 0.9663531 1.880352e-17
var_B3 0.9639856 4.643262e-17
var_B5 0.9637021 5.153217e-17
var_B4 0.9568662 5.083155e-16
var_B1 0.9543994 1.061293e-15
var_A1 0.8518285 4.616555e-09
var_A2 0.7818571 5.493111e-07

$Dim.2

Link between the variable and the continuous variables (R-square)
=================================================================================
correlation p.value
var_A3 0.9530472 1.562002e-15
var_A2 -0.4327988 1.902550e-02

Link between variable and the categories of the categorical variables
================================================================
Estimate p.value
brand=brand_7 1.305609 0.03846791
brand=brand_3 -0.818287 0.03712282

Quality and contribution of the representation of the individuals

```{r fig.height=5, fig.width=15}
renderPlot(
fviz_cos2(res_pca, choice = 'ind', axes = 1, fill = 'yellow',top = 15)/fviz_contrib(res_pca, choice = 'ind', axes = 1, fill = 'yellow',top = 15)|
fviz_cos2(res_pca, choice = 'ind', axes = 2, fill = 'blue',top = 15)/fviz_contrib(res_pca, choice = 'ind', axes = 1, fill = 'blue',top = 15)|
fviz_pca_ind(res_pca, repel = T, col.ind = data$class)
)
```
Quality and contribution of the representation of the individuals
  • Individuals with average values in all variables will be placed at the center of the graph.
    Conversely, we expect individuals with extreme values to be far from the center.
    • Individuals with similar values (in the studied variables) will be grouped together on the map.
    Conversely, individuals with different profiles will be distant from each other.
# Biplot 
```{r fig.height=6, fig.width=13}
renderPlot(
fviz_pca_biplot(res_pca, repel = T, habillage = 11,
fill.ind = data$class,
pointshape = 21, pointsize = 2,
addEllipses =TRUE,
ellipse.alpha = 0.1,
ellipse.type = "confidence",
geom.ind =c('text',"point"),
geom.var = c("text", "arrow"),
alpha.var ="contrib",
col.var = 'black',
legend.title = list(fill = "Class", color = "variables"),title = 'PCA BIPLOT')
)
```

We can see that the separation is not good for the marker.

Cluster Analisys

The purpose of cluster analysis is to partition the dataset into groups of observations. Observations within the same group are as similar as possible, observations in different groups are very different.
Since the examples are unlabeled, clustering relies on unsupervised machine learning. It is an exploratory analysis technique for multivariate data involving multiple variables.

Before conducting a cluster analysis, it is important to evaluate the clustering tendency of the data. There are statistical and visual methods to assess the clustering tendency:

• Hopkins statistics: It evaluates whether the dataset exhibits a significant clustering structure or if it is random.
• Visual Assessment of Tendency (VAT): It generates a heat map using distances between observations, rearranging the observations so that similar ones are placed close to each other.

# Hopkins statistics
```{r warning=FALSE}
library(clustertend)
set.seed(123)
data_std <- data %>%
dplyr::select_if(is.numeric) %>%
scale()
clustertend::hopkins(data_std, n=nrow(data_std)-1)
```
$H
[1] 0.2639197

The value of the Hopkins test is different from 0.5, indicating that the data exhibits a clustering tendency.

# Visual Assessment of Tendency
```{r}
renderPlot(
data_std %>%
dist() %>%
fviz_dist(lab_size = 0.1, show_labels = F)
)
```
We can observe a clustering tendency of around three groups

Algorithm K-MEANS

```{r}
set.seed(123)
(data_km <- kmeans(data_std, centers = 3))
```
K-means clustering with 3 clusters of sizes 9, 11, 9

Cluster means:
var_A1 var_A2 var_A3 var_A4 var_A5 var_B1 var_B2 var_B3 var_B4 var_B5
1 1.0706735 0.90640511 0.2476437 1.1590971 1.1322577 1.233807 1.2220000 1.2341190 1.16773886 1.2164815
2 -0.2089852 -0.05976085 -0.4035685 -0.1776409 -0.1474333 -0.181420 -0.1545585 -0.2156286 -0.02948453 -0.1454128
3 -0.8152472 -0.83336408 0.2456067 -0.9419804 -0.9520614 -1.012072 -1.0330952 -0.9705730 -1.13170221 -1.0387547

Clustering vector:
product_1 product_2 product_3 product_4 product_5 product_6 product_7 product_8 product_9 product_10 product_11 product_12 product_13
3 3 3 3 3 2 3 3 3 2 2 3 1
product_14 product_15 product_16 product_17 product_18 product_19 product_20 product_21 product_22 product_23 product_24 product_25 product_26
2 2 2 2 2 2 1 1 2 2 1 1 1
product_27 product_28 product_29
1 1 1

Within cluster sum of squares by cluster:
[1] 57.06298 13.23896 19.63200
(between_SS / total_SS = 67.9 %)

Available components:

[1] "cluster" "centers" "totss" "withinss" "tot.withinss" "betweenss" "size" "iter" "ifault"

The quality of the data partition is 67.9 %, Fairly well…

```{r fig.height=6, fig.width=13}
data_clus <- data %>%
cbind(clus = as.factor(data_km$cluster))
res_pca2 <-data_clus %>%
PCA(scale.unit = T, graph = F, quali.sup = c('brand', 'class', 'clus'))
catergorical <- data_clus %>%
select_if(is.factor)

sliderInput(inputId = 'contriVar', label = 'Contribution Var', min = 1,
max = dim(data_clus)[2], value = dim(data_clus)[2], step = 1)
sliderInput(inputId = 'contriInd', label = 'Contribution Var', min = 1,
max = dim(data_clus)[1], value = dim(data_clus)[1],step = 1)
selectInput(inputId = 'categorical',label = 'Select a categorical variable',
choices = names(catergorical), selected = names(catergorical)[1])

renderPlot(
fviz_pca_biplot(res_pca2, repel = F, habillage = input$categorical,
pointshape = 21, pointsize = 2,
addEllipses =TRUE,
ellipse.alpha = 0.1,
ellipse.type = "norm",
geom.ind =c('text',"point"),
geom.var = c("text", "arrow"),
alpha.var ="contrib",
col.var = 'black',
select.var = list(contrib = input$contriVar),
select.ind = list(contrib = input$contriInd),
legend.title = list(fill = "clus"),title ='K groups',
)
)
```

With interactivity, it is possible to choose variables and select the most important variables and individuals.

K-means algorithm results

Products belonging to each cluster.
```{r}
data_km$cluster
```
product_1 product_2 product_3 product_4 product_5 product_6 product_7 product_8 product_9 product_10 product_11 product_12 product_13
3 3 3 3 3 2 3 3 3 2 2 3 1
product_14 product_15 product_16 product_17 product_18 product_19 product_20 product_21 product_22 product_23 product_24 product_25 product_26
2 2 2 2 2 2 1 1 2 2 1 1 1
product_27 product_28 product_29
1 1 1
```{r}
data_km$centers
```
var_A1 var_A2 var_A3 var_A4 var_A5 var_B1 var_B2 var_B3 var_B4 var_B5
1 1.0706735 0.90640511 0.2476437 1.1590971 1.1322577 1.233807 1.2220000 1.2341190 1.16773886 1.2164815
2 -0.2089852 -0.05976085 -0.4035685 -0.1776409 -0.1474333 -0.181420 -0.1545585 -0.2156286 -0.02948453 -0.1454128
3 -0.8152472 -0.83336408 0.2456067 -0.9419804 -0.9520614 -1.012072 -1.0330952 -0.9705730 -1.13170221 -1.0387547

Cluster 1 contains products with the highest values, while Cluster 2 contains products with the lowest values.

# Number of samples in each group.
```{r}
data_km$size
```
[1] 9 11 9
# Summary by group
```{r }
data_clus %>%
dplyr::select_if(is.numeric) %>%
aggregate(list(data_km$clus), mean)
```
```{r fig.height=7, fig.width=8}
library(flexclust)
selectInput(inputId = 'byclus', label = 'by cluster',choices = c(F,T),selected = T)
set.seed(123)
renderPlot(
as.kcca(data_km, data_std) %>%
barplot(bycluster=input$byclus)
)
```

Importance of variables in the clustering.

```{r fig.height=15, fig.width=15}
library(FeatureImpCluster)
renderPlot(
as.kcca(data_km, data_std) %>%
FeatureImpCluster(as.data.table(data_std)) %>%
plot()
)
```

The most important variables belong to Variable B, which refers to physicochemical characteristics measured with instruments related to B.

Let’s observe how the data is separated based on the variables.

```{r fig.height=6, fig.width=13}
renderPlot(
fviz_cluster(data_km, data_std, choose.vars = c('var_B4', 'var_B1'))|
fviz_cluster(data_km, data_std, choose.vars = c('var_B4', 'var_B2'))|
fviz_cluster(data_km, data_std, choose.vars = c('var_B4', 'var_B3'))|
fviz_cluster(data_km, data_std, choose.vars = c('var_B4', 'var_B5'))
)
```
Variables of type B
```{r fig.height=6, fig.width=13}
renderPlot(
fviz_cluster(data_km, data_std, choose.vars = c('var_B4', 'var_A1'), labelsize = 6)|
fviz_cluster(data_km, data_std, choose.vars = c('var_B4', 'var_A2'), labelsize = 6)|
fviz_cluster(data_km, data_std, choose.vars = c('var_B4', 'var_A3'), labelsize = 6)|
fviz_cluster(data_km, data_std, choose.vars = c('var_B4', 'var_A5'), labelsize = 6)
)
```
Variables of type A

It is also possible to plot variable by variable, facilitating the visualization of which variables better separate the groups.

```{r}
catergorical <- data_clus %>%
select_if(is.factor)

numeric <- data_clus %>%
select_if(is.numeric)

## select input catergorical
selectInput(inputId = 'num1',label = 'Select a numeric variable',
choices = names(numeric), selected = names(numeric)[1])

## select input numeric
selectInput(inputId = 'num2',label = 'Select a numeric variable',
choices = names(numeric), selected = names(numeric)[2])

renderPlot(
fviz_cluster(data_km, data_std, choose.vars = c(input$num1, input$num2))
)
```

Through cluster separation, the differentiation of products based on their measured physicochemical characteristics becomes more evident.

```{r fig.height=6, fig.width=13, message=F}
data_std_2 <- data_std %>%
cbind(clus = as.factor(data_clus$clus)) %>%
data.frame()

library(GGally)
renderPlot(
ggpairs(data_std_2, columns = c("var_B1","var_B2","var_B3","var_B4","var_B5"), aes(col = as.factor(data_clus$clus)))
)
```
Correlations by cluster variables of type B
```{r fig.height=6, fig.width=13, message=F}
renderPlot(
ggpairs(data_std_2, columns = c("var_A1","var_A2","var_A3","var_A4","var_A5"), aes(col = as.factor(data_clus$clus)))
)
```
Correlations by cluster variables of type A

K-Means algorithm representation

```{r}
library(plotly)
renderPlotly(
ggplotly(data_clus %>%
select(is.numeric) %>%
scale() %>%
cbind(data_clus[11:length(data_clus)]) %>%
tibble::rownames_to_column(var = 'names') %>%
gather(key = 'var', value = 'values',2:11, factor_key = T) %>%
ggplot(aes(x=var, y = values, group = clus, color = clus, label = names)) +
stat_summary(geom = 'line', fun = 'mean') +
stat_summary(fun = mean, size = 3,
geom = "point") +
geom_point(size = 0.9)+
theme_classic() +
theme(text = element_text(size = 14), legend.position = 'bottom',
axis.text.x = element_text(angle = 90, hjust = 1,size = 8))+
labs(title = 'K-Means algorithm representation',
subtitle = 'Features of each cluster',
x = '', y = 'value') )
)
```

It is possible to observe that the discrimination of the groups is more evident through the variables Var_B5 and Var_A5

# Measures of stability.

```{r}
library(cluster)
renderPlot(
silhouette(data_km$cluster, dist(data[1:10])) %>%
plot(cex.names = 0.7, col = 1:3)
)

rownames(data_clus[9,])
rownames(data_clus[24,])
```
[1] "product_9"
[1] "product_24"

In both cluster 1 and cluster 3, there are misclassified samples.

Two-way ANOVA

So far, we have observed some correlations between the measured physicochemical variables and their relationship with the studied products through PCA.
Additionally, we have clustered the most similar products using K-Means.
As a competitive intelligence tool, data analysis enables us to understand and evaluate market comparisons more effectively.
However, we can obtain even more information by developing a predictive model that can forecast the remaining variables using only a few measured variables.
This approach would streamline processes, save time on measurements, and ultimately reduce costs, leading to increased efficiency.

One option is to select the variables that exhibit the greatest statistical difference between markets and clusters with a 2-ways ANOVA analysis.
This would make it easier to distinguish a clear separation between them.
Furthermore, these variables would be used as predictors.

We evaluate which physicochemical characteristic differs the most between clusters and market type (class).

Verify assumptions of ANOVA:

# Normality,
```{r}
data_clus%>%
select_if(-brand) %>%
dplyr::group_by(class, clus) %>%
shapiro_test(var_A1,var_A2,var_A3,var_A4,var_A5,var_B1,var_B2,var_B3,var_B4,var_B5) %>%
DT::datatable()
```
Normality of variables type A
```{r}
data_clus%>%
select(-brand) %>%
dplyr::group_by(class, clus) %>%
shapiro_test(var_A1,var_A2,var_A3,var_A4,var_A5,var_B1,var_B2,var_B3,var_B4,var_B5) %>%
DT::datatable()
```
Normality of variables type B
# homogeneity of variance.
```{r select_input}
selectInput(inputId = 'vars', label = 'select variables for levene test',
choices = colnames(data_clus), selected = colnames(data_clus)[1])

renderPrint(
data_clus %>%
select(-brand) %>%
levene_test(data_clus[,input$vars] ~ data_clus$class*data_clus$clus)
)
```

Most of the variables show a normal trend and homogeneity of variance, although some have p-values slightly below 0.05. For this example, we will assume a parametric statistic, and through a two-way ANOVA, we will establish significant statistical differences among the obtained clusters, classes, and measured variables. This way, we can statistically identify which variables better discriminate between the groups.

```{r fig.height=12, fig.width=20}
library(ggpubr)
renderPlot(
ggboxplot(data_long, x = 'clus' , y = 'values',
color = 'class',add = c("mean", "jitter"),
fill = "class", alpha = 0.2) +
stat_summary(fun = mean, colour = "black", size = 3,
position = position_dodge(width = 0.75),
geom = "text", vjust = -0.7,
aes(label = round(..y.., digits = 2), group = class)) +
facet_wrap(~var, ncol=2, scales="free")
)
```

The variables Var_B5 and Var_A5 could be the only measurements performed on a new product and used to classify it into one of the clusters.
These variables effectively separate the groups. However, it is necessary to perform a statistical analysis to determine if there are significant differences.


```{r}
## select inpu categorical
selectInput(inputId = 'cate1',label = 'Select a categorical variable 1',
choices = names(catergorical), selected = names(catergorical)[1])
selectInput(inputId = 'cate2',label = 'Select a categorical variable 2',
choices = names(catergorical), selected = names(catergorical)[3])


## select inpu numeric
selectInput(inputId = 'nume1',label = 'Select a numeric variable',
choices = names(numeric), selected = names(numeric)[1])

renderPlot({
grouped_ggbetweenstats(data = data_clus, x = !!rlang::sym(input$cate1), y = !!rlang::sym(input$nume1), grouping.var = !!rlang::sym(input$cate2),
results.subtitle = F, messages = F, var.equal = T, p.adjust.method = 'holm')

})

Indeed, variables Var_B5 and Var_A5 exhibit statistically significant differences among clusters, markets, and physicochemical characteristics.

Conclusion: In this first part, we have demonstrated the utility of instrumental chemical analysis and data analysis in identifying characteristics and patterns in products, allowing for segmentation based on their physicochemical properties. Furthermore, we have shown that future analyses can achieve classification using just two out of the ten variables, leading to time and cost savings, as well as increased efficiency. These scientific approaches foster a strong synergy within companies, enabling data-driven decision-making.

In the second part, we will delve into making predictions for classifying new products of interest.

--

--

Qco_Juan_David_Marin

Chemist with experience in python and statistics. Interested in data science, chemistry and statistics to propose innovative solutions that save time and money