What is Canonical Correlation Analysis?

Vahid Naghshin
Analytics Vidhya
Published in
7 min readAug 29, 2020

Canonical correlation analysis (CCA)is a statistical technique to derive the relationship between two sets of variables. One way to understand the CCA, is using the concept of multiple regression. In multiple regression, the relationship between one single dependent variable and a set of independent variables are investigated. In CCA, we extend the multiple regression concept to more than one dependent variable. In some applications, we confront with more than one dependent variable which are inter-correlated, so it is not sensible to ignore dependency. For example, in depression study, the Centre for Epidemiological Studies Depression (CESD) and health status are inter-correlated, and both can be formulated as dependent variables which depends on some variables such as sex, age, education and income. So, we aim at deriving a relation between a function of one set of variables with function of another set. This function is usually a linear function with weights to be specified based on some constraints.

CCA: two sets of variables connected through canonical variables

As you can see in figure above, we have one set of variables in the left side and other set in right side. Without loss of generality, let’s assume that the number of independent and dependent variables are p and q, respectively. All variables in left side and right side are lumped into two different variables, shown as yellowish circle in figure. The aim of CCA is finding the relationship between two lumped variables in a way that the correlation between these twos is maximum. Obviously, there are several linear combinations of variables, but the aim is to pick only those linear functions which best express the correlations between the two variable sets. These linear functions are called the canonical variables, and the correlations between corresponding pairs of canonical variables are called canonical correlations.

The basic idea behind CCA is finding a linear combination of Ys which has the maximum correlation with linear combination of Xs. Say

Canonical equations, Eq(1)

For every choice of weights, we can derive the value pair of U and V. Then the correlation between U and V can be obtained. There are k=min(q, p) number of variants, each corresponds to different set of weights, which gives us k number of different correlation. These correlations are proved to be the square root of the eigenvalue of the product of two matrices given by Eq (2).

Eq. (2)

Note that the X is a matrix with n rows and p columns. Similarly, the Y is a matrix with n rows and q columns. The n rows denote the number of samples observed and p or q is the number of features of X and Y, respectively. The eigenvalue and eigenvectors give us the canonical correlation and corresponding conical variables U and V. As mentioned, there are k number of different canonical variables. The first canonical variables are the most important one as the corresponding correlation is the maximum correlation between others. The correlation coefficients are obtained in descending order. In other words, the first canonical correlation is the greatest, the second canonical correlation is the second greatest and so on. now we grasp the basics behind CCA, let’s see how we can use in analysing the real-world data. The CCA is less popular than the multiple regression as the result interpretation is harder than the multiple regression.

Application of CCA in Real-World Case

In order to demonstrate the usefulness of CCA, we apply the CCA in real-world data to see how it is helpful in getting insight. To this end, we use data sets mmreg.dta, with 600 observations on eight variables. The psychological variables are locus_of_control, self_concept andmotivation. The academic variables are standardized tests in reading (read), writing (write), math (math) and science (science). Additionally, the variable female is a zero-one indicator variable with the one indicating a female student. The code snippet can be obtained from here.

library(lme4)
library(CCA) #facilitates canonical correlation analysis
library(CCP)
require(ggplot2)
require(GGally)
mm <- read.csv("https://stats.idre.ucla.edu/stat/data/mmreg.csv")

summary(mm)

The summary of the variables of data are as follows:

locus_of_control   self_concept       Motivation         Read     
Min. :-2.2300 Min. :-2.6200 Min. :0.000 Min. :28.3
1st Qu.:-0.3725 1st Qu.:-0.3000 1st Qu.:0.330 1st Qu.:44.2
Median : 0.2100 Median : 0.0300 Median :0.670 Median :52.1
Mean : 0.0965 Mean : 0.0049 Mean :0.661 Mean :51.9
3rd Qu.: 0.5100 3rd Qu.: 0.4400 3rd Qu.:1.000 3rd Qu.:60.1
Max. : 1.3600 Max. : 1.1900 Max. :1.000 Max. :76.0

Write Math Science Sex
Min. :25.5 Min. :31.8 Min. :26.0 Min. :0.000
1st Qu.:44.3 1st Qu.:44.5 1st Qu.:44.4 1st Qu.:0.000
Median :54.1 Median :51.3 Median :52.6 Median :1.000
Mean :52.4 Mean :51.9 Mean :51.8 Mean :0.545
3rd Qu.:59.9 3rd Qu.:58.4 3rd Qu.:58.6 3rd Qu.:1.000
Max. :67.1 Max. :75.5 Max. :74.2 Max. :1.000

The correlation coefficient within sets for two sets of variables can be obtained through the following code snippet

psych <- mm[, 1:3]
acad <- mm[, 4:8]
ggpairs(psych)ggpairs(acad)

The output would be

The correlation coefficients within psychological variables
The correlation coefficients within academic variables

Now, turn to obtain the canonical correlations. As mentioned, the number of canonical variants are k = min(p, q) = min(4, 3) = 3. So, we will have three canonical correlation which for our case can be obtained as

matcor(psych, acad)
cc1 <- cc(psych, acad)
cc1$cor

The output of the code snippet would be

0.4640861 0.1675092 0.1039911

As can be seen, the first canonical correlation is 0.464 which is the correlation between first canonical variables. This correlation coefficient is greater than any single correlation between any pair of psychological and academic variables (it is not shown here). The coefficients to generate the canonical variables can be obtained through the following code snippet

cc1$ycoef
cc1$xcoef

The output will be

> cc1$ycoef
[,1] [,2] [,3]
read -0.044620600 -0.004910024 0.021380576
write -0.035877112 0.042071478 0.091307329
math -0.023417185 0.004229478 0.009398182
science -0.005025152 -0.085162184 -0.109835014
female -0.632119234 1.084642326 -1.794647036
> cc1$xcoef
[,1] [,2] [,3]
locus_of_control -1.2538339 -0.6214776 -0.6616896
self_concept 0.3513499 -1.1876866 0.8267210
motivation -1.2624204 2.0272641 2.0002283

For example, the coefficient for read variable for first canonical variable (U1) is -0.044.

If you want to look at the canonical variable itself, you can call cc1$scores$yscores and cc1$scores$yscores. It will give you the canonical variables Ui and Vi.

We can also look at the correlation between the academic and psychological variables and canonical variables. These correlations are called canonical variable loadings. The code snippet and its corresponding outputs can be obtained as follows:

cc1$scores$corr.X.xscores
[,1] [,2] [,3]
locus_of_control -0.90404631 -0.3896883 -0.1756227
self_concept -0.02084327 -0.7087386 0.7051632
Motivation -0.56715106 0.3508882 0.7451289
cc1$scores$corr.Y.xscores
[,1] [,2] [,3]
Read -0.3900402 -0.06010654 0.01407661
Write -0.4067914 0.01086075 0.02647207
Math -0.3545378 -0.04990916 0.01536585
Science -0.3055607 -0.11336980 -0.02395489
Sex -0.1689796 0.12645737 -0.05650916
cc1$scores$corr.X.yscores
[,1] [,2] [,3]
locus_of_control -0.419555307 -0.06527635 -0.01826320
self_concept -0.009673069 -0.11872021 0.07333073
Motivation -0.263206910 0.05877699 0.07748681
cc1$scores$corr.Y.yscores
[,1] [,2] [,3]
Read -0.8404480 -0.35882541 0.1353635
Write -0.8765429 0.06483674 0.2545608
Math -0.7639483 -0.29794884 0.1477611
Science -0.6584139 -0.67679761 -0.2303551
Sex -0.3641127 0.75492811 -0.5434036

Here, the canonical variable loadings shows the correlation between the variable sets and the canonical variables. For example, to see what the correlation between Read and first canonical variable Y, we should look at first column of cc1$scores$corr.Y.yscores, which is -0.84.

Interpreting the Results

Interpreting the results of CCA is a little bit tricker than multiple regression. In order to interpret the results, we would look at the standardised coefficient. The standardised coefficient can be obtained by multiplying the unstandardized coefficient with the standard deviation of the corresponding variables. The standardised coefficient can be obtained through the following code snippet.

# standardized psych canonical coefficients diagonal matrix of psych sd's
s1 <- diag(sqrt(diag(cov(psych))))
s1 %*% cc1$xcoef
s2 <- diag(sqrt(diag(cov(acad))))
s2 %*% cc1$ycoef

The output would be

> s1 %*% cc1$xcoef
[,1] [,2] [,3]
locus_of_control -0.8404196 -0.4165639 -0.4435172
self_concept 0.2478818 -0.8379278 0.5832620
Motivation -0.4326685 0.6948029 0.6855370
> s2 %*% cc1$ycoef
[,1] [,2] [,3]
Read -0.45080116 -0.04960589 0.21600760
Write -0.34895712 0.40920634 0.88809662
Math -0.22046662 0.03981942 0.08848141
Science -0.04877502 -0.82659938 -1.06607828
Female -0.31503962 0.54057096 -0.89442764

The interoperation of the CCA is very similar to the multiple regression. For example, as can be seen, the standardised correlation coefficient of the locus_of_control is -0.84. This means that one unit increase in locus_of_control will lead to 0.84 decrease in the first canonical variable when all the other variables in the model are held constant. Looking at the three canonical coefficients we see thatlocus_of_control has negative effect on all canonical variables. Each canonical variable give different interpretation based on a given coefficients. However, since the first correlation coefficient is the maximum, the interpretation based on the first set of coefficients is more sensible and reliable.

Further Reading:

[1] Afifi, A., May, S., & Clark, V. A. (2003). Computer-aided multivariate analysis. CRC Press.

[2] González, I., Déjean, S., Martin, P. G., & Baccini, A. (2008). CCA: An R package to extend canonical correlation analysis. Journal of Statistical Software, 23(12), 1–14.

[3] Cortez, P., Cerdeira, A., Almeida, F., Matos, T., & Reis, J. (2009). Modeling wine preferences by data mining from physicochemical properties. Decision Support Systems, 47(4), 547–553.

--

--

Vahid Naghshin
Analytics Vidhya

Technology enthusiast, Futuristic, Telecommunications, Machine learning and AI savvy, work at Dolby Inc.