Similarity Index Revisited

Gatha Varma, PhD
WiCDS
Published in
5 min readFeb 19, 2021

Part 5 of 5: Revisiting the concept of similarity index and introduction of a new player.

Photo by Engin Akyurt from Pexels

A relationship is about two things. First, appreciating the similarities, and second, respecting the differences

Similarity Indices and Their Invariance Properties

For a trained neural network, a matrix of activations M can be written as:

where s is the number of examples used to train m neurons.

A similarity index s(X, Y) would therefore be concerned with two activation matrices X and Y, where X contains activations for p1 neurons and Y contains activations for p2 neurons each trained for n examples. To have an effective similarity index, it is must possess certain invariance properties:

  • Orthogonal transformation, s(X, Y ) = s(X U, Y V ) for full-rank orthonormal matrices U and V. It is especially desirable for neural networks trained by gradient descent. Since Invariance to orthogonal transformation implies invariance to permutation, which is needed to accommodate symmetries of neural networks, the orthogonal transformation of the input does not affect the dynamics of gradient descent training.
  • Isotropic scaling, s(X, Y) = s(αX, βY ) for any α, β ∈ R+.
  • But not to linear transformation. A similarity index is said to be invariant to invertible linear transformation if it satisfies the condition, s(X, Y ) = s(X A, Y B) for any full rank matrices A and B. Neural networks trained from different random initializations develop representations with similar large principal components, and consequently, similar Euclidean distances between the examples. Invariance to invertible linear transformation would imply that the scale of directions in activation space was irrelevant, which is in direct disagreement with the meaningfulness of the distances and ignores this important aspect of representation. So, an index invariant to linear transformation = not suitable to compare learned representations.

Related Similarity Indices

Now that we have listed down the three important invariance properties of similarity indices, let us review the primary methods that are used to compare similarities between neural network representations.

Image by the author

There are also other methods like a comparison of alignment between individual neurons. This approach shifted the focus from the alignment between subspaces but does not yield promising results in the case of intermediate layers of neural networks. The other approach of mutual information captures non-linear statistical dependencies between variables, which in this case is the neuron alignment. The researchers however do not find the use of mutual information suitable for representation comparisons.

CKA as a Similarity Index

The notion of kernel alignment was introduced way back in the year 2001. It defined the principle to measure the degree of agreement between a kernel and a learning task and was widely used for kernel selection due to its effectiveness and low computational complexity. Since the training datasets could be assumed to be linearly separable in the feature space, the kernel alignment can be used as an evaluation measure for kernel learning and model selection. Therefore to uncover the complex interaction between the training dynamics and structured data, researchers at Google Brain including the father of convolution networks Geoffrey Hinton, proposed the use of Centered Kernel Alignment (CKA) as a similarity index in the year 2019.

Linear CKA is closely related to CCA and linear regression. It resembles CCA in the sense that eigenvectors, that explain the amount of variance in X or Y, are weighted by their eigenvalues. While SVCCA and projection-weighted CCA were motivated by the idea that eigenvectors with small eigenvalues were not significant, CKA incorporates this weighting symmetrically and can be computed without a matrix decomposition.

The team was able to show that CKA could determine the correspondence between the hidden layers of neural networks trained from different random initializations and with different widths; the scenarios where previously proposed similarity indexes had failed. Through the use of CKA, it was also determined that wider networks learned more similar representations, the similarity of early layers saturated at fewer channels than later layers. Also, the early layers, but not later layers, learned similar representations on different datasets.

Circling back to the questions that were raised in Part 1 of this series, let us see how CKA answered them:

  • Do deep neural networks with the same architecture trained from different random initializations learn similar representations?
Top: Linear CKA between layers of individual networks of different depths on the CIFAR-10 test set. Titles show the accuracy of each network. Bottom: Accuracy of a logistic regression classifier trained on layers of the same networks is consistent with CKA. Source given at the end of the article

The above figure shows CKA between layers of individual CNNs with different depths, where layers were repeated 2, 4, or 8 times. Doubling depth improved accuracy, but greater multipliers hurt it. At 8x depth, CKA indicated that representations of more than half of the network were very similar to the last layer.

  • Can we establish correspondences between layers of different network architectures?

CKA is equally effective at revealing relationships between layers of different architectures.

Linear CKA between layers of networks with different architectures on the CIFAR-10 test set. Source given at the end of the article

CKA indicated that, as networks are made deeper, the new layers are effectively inserted in between the old layers. Other similarity indices had failed to reveal meaningful relationships between different architectures

  • How similar are the representations learned using the same network architecture from different datasets?
Left: The similarity between the same layer of different models on the CIFAR-10 test set. Right: The similarity computed on CIFAR-100 test set. CKA was averaged over 10 models of each type (45 pairs). Source given at the end of the article

CKA can also be used to compare networks trained on different datasets. The figure shows that models trained on CIFAR-10 and CIFAR-100 developed similar representations in their early layers. These representations required training, but the similarity with untrained networks was found to be much lower.

CKA seems to be much better than previous methods at finding correspondences between the learned representations in hidden layers of neural networks. However, it remains an open question whether there exist kernels beyond the linear and RBF kernels that would be better for analyzing neural network representations.

With this story, I end the five-part series that explored the learned representations and their similarities among neural networks.

--

--

Gatha Varma, PhD
WiCDS
Writer for

Reseach Scientist @Censius Inc. Find more of my ramblings at: gathavarma.com