Not Your Vanilla Canonical Correlation Analysis

Gatha Varma, PhD

Published in

WiCDS

5 min readJan 22, 2021

Part 3 of 5: A look into flavors of Canonical Correlation Analysis and their applications to convolution layers

Accepting help is its own kind of strength.

In previous parts of this series I talked about internal representations learned by the neural networks, and how Canonical Correlation Analysis (CCA) emerged as a potential candidate to compare the internal representations of different neural networks. Now let us see how a variant of CCA was put to use in the scheme of things.

Existing applications of CCA had included the study of brain activity and training multi-lingual word embeddings in language models. In the year 2017, it was proposed as a means to compare deep representations, but in combination with Singular Value Decomposition (SVD). Now you may ask why a two-step comparison method was devised when CCA by itself should be sufficient? While CCA is powerful enough to allow comparisons over different types of architectures, it suffers from a specific shortcoming in determining how many directions are important to the original space X. If learned representations were distributed across many dimensions, CCA would require an accomplice to capture the full picture. Therefore SVD and CCA were brought together to propose Singular Vector Canonical Correlation Analysis (SVCCA).

In a previous article, I had introduced the vector of outputs for a neuron as zˡᵢ = (zˡᵢ (x1 ), · · · , zˡᵢ (xm )).

So for the representations aligned over all of the neurons in a multidimensional space, SVD will first find singular vectors over the span (z1, z2, …, zm) which are the outputs of neurons at a particular layer in a network. The subsequent CCA would then compute a linear transform W to give orthogonal canonically correlated directions.

The outputs of SVCCA can be condensed into a single value called the SVCCA similarity ρ̄ that encapsulates how well the representations of two layers are aligned with each other.

SVCCA Similarity measure. Image by the author.

where min(m1, m2) is the size of the smaller of the two layers under comparison. The SVCCA similarity ρ̄ is the average correlation across aligned directions and is a direct multidimensional analog of Pearson correlation.

M. Raghu et al, who proposed SVCCA, applied it to convolution layers in two different ways to probe the nature of representations.

1. For same layer comparisons

In this case, X and Y were the same layer but at different time-steps or across random initializations, receiving the same input. For convolution layers, neurons at different pixel coordinates see different image data patches to each other.

2. For different layer comparisons

This is the case when X and Y were not the same layer, therefore the image patches seen by different neurons had no natural correspondence.

SVCCA and its applications

Learning Dynamics

SVCCA gave a peek into learning dynamics by comparing the representation at a layer for different points during training till its final representation. The given figure shows the learning dynamics plots for conv and res nets trained on CIFAR-10. Each pane is a matrix of size layers × layers, with each entry showing the SVCCA similarity ρ̄ between the two layers.

For convolution layers, it was found that the learning broadly happens ‘bottom up’ i.e. the layers closer to the input seem to solidify into their final representations much before the very top layers. This knowledge helped develop a simple, computationally more efficient method of training networks called Freeze Training. In freeze training, the lower layers are sequentially frozen after a certain number of time steps. The method was computationally cheaper since it focused on updating higher and higher layers, thereby lowering efforts for deriving gradients and updating in lower layers.

When are classes learned?

The number of neurons in convolutional layers, especially early ones, is very large. This makes SVCCA prohibitively expensive due to the large matrices involved. Now how do we handle this problem? Does this remind you of a method called dimensionality reduction? But for convolutional layers, irrespective of their size, a Discrete Fourier Transform (DFT) can be used on each channel in place of dimensionality reduction techniques. So DFT CCA was used to trace how knowledge about the target evolved throughout the network. It measured correlation levels of representations in each layer with the logits of each target class.

Imagine this technique as a dye test, where the dye is the correlation of representations with logits. The flow of the dye through the neural network for different types of classes would uncover how the classes were learned by the said network.

The experiment considered five different classes. The classes were firetruck and four breeds of dogs where two pairs of dog classes comprised of similar-looking breeds, namely terriers and husky. For different layers in the network, DFT CCA similarity was computed between the logit of that class and the Imagenet Resnet.

CCA Similarity using DFT between logits of five classes and layers in the Resnet Imagenet. Source: At the end of the article.

The semantic properties captured in CCA similarity were plotted for each of the five classes. We can see that the line corresponding to firetruck is clearly distinct from the two pairs of dog breeds, and the two lines in each pair of visually similar-looking dogs are both very close to each other. Firetruck also appears to be easier for the network to learn, with a greater sensitivity displayed much sooner.

In the coming articles, I would talk about the relevance of learned similarities in understanding training and generalization.

Sources:

SVCCA: Singular Vector Canonical Correlation Analysis for Deep Learning Dynamics and Interpretability

Not Your Vanilla Canonical Correlation Analysis

SVCCA and its applications

Written by Gatha Varma, PhD