Canonical Correlation Analysis and Neural Network Representation Similarities
Part 2 of 5: Canonical Correlation Analysis (CCA) and its use to measure representation similarities of neural networks
Our similarities bring us to a common ground.
In a past post, I had talked about what is an internal representation learned by the neural networks and why researchers are focused on exploring aspects of their similarities.
Now I would like to talk about the technique Canonical Correlation Analysis (CCA) and how it emerged as a tool of choice to measure representation similarities of neural networks. Introduced in the year 1936 by Harold Hotelling, CCA is a statistical method that investigates relationships among two or more variable sets, where each set consists of at least two variables. Why the condition that demands two or more variables in an individual set? It is because canonical logic applied to sets containing less than two variables would then become something in the likes of a t-test or regression analysis.
CCA is a multivariate method that simultaneously considers all the variables in a single analysis. It honors the reality that in nature all the variables can interact with each other. This results in statistically significant results and huge effect sizes. It is a multivariate form of a general linear model since a general linear model recognizes that all analyses are correlational and yield variance-accounted-for effect sizes.
So what runs under the hood of CCA? The first step in a CCA computes a bi-variate (remember two variable sets?) product-moment correlation matrix involving all the variables. It finds any linear combination of a set of variables that are most highly correlated with any linear combination of another set of variables. The resulting compact linear representations are called canonical variates. Each of the resulting canonical variates is then computed using a canonical vector, which is the weighted sum of every original variable in the set.
Let us take an example of demographic factors like X1 ( = age, sex, diet) and another set of demographic variables X2 ( = heart-rate, hemoglobin, blood pressure). CCA can be used to estimate the possible association between X1 and X2 by quantifying the correlation between the two sets of multidimensional variables.
A sample data can be considered for a random size of N = 50 of survey participants in an attempt to determine which factors influence health conditions in X2. For this, two collections of variables were measured. The first set X1 contained age, sex, and diet of the participant, and the second set comprised of heart- rate, hemoglobin, and blood pressure measured for each participant. CCA then sought to re-express the datasets as multiple pairs of canonical variates that were highly correlated with each other across participants as shown in the above figure.
In each domain of data, the resulting canonical variate is composed of the weighted sum of variables by the canonical vector. In the scatter plot shown above, each participant can then be described by two canonical variates that were maximally correlated. The linear correspondence between the two canonical variates of X1 and X2 is the canonical correlation — a primary performance metric used in CCA modeling.
Canonical Correlation Analysis on Neural Network Representations
Coming back to the use of CCA to gauge representation similarities between neural networks, the underlying process is a neural network being trained on some tasks. The multidimensional variates, in this case, are neuron activation vectors over some dataset X. As explained in Part 1, a neuron activation vector denotes the outputs of a single neuron ( = z) on X.
In short, one multidimensional variate = a single neuron activation vector
Then, a set of multidimensional variates = a layer consisting of neurons
We can then consider two layers, L1 and L2 of a neural network as two sets of observations, to which we can then apply CCA to determine the similarity between the two layers. Most importantly, it also enables comparisons between different neural networks which is not naively possible due absence of any kind of neuron to neuron alignment.