Learned Representations to understand Neural Network Training Dynamics

Gatha Varma, PhD

Published in

WiCDS

6 min readFeb 12, 2021

Part 4 of 5: Using neural network representations to understand different types of training dynamics.

Photo by **Karolina Grabowska** from **Pexels**

What sets you apart might feel like a burden but it’s not; it’s what makes you great.

Problem-solving through the ages has been driven by the ability to find how the problem fits with the knowledge already possessed by the problem solver, and if they remembered what solution was used in such cases. Remember those science experiments when we marveled at how some rewards and punishments taught chimps to push the right buttons? Now before you go off to watch YouTube videos on smart chimps, let us first see how similar strategies are used by neural networks to go about solving problems at hand. It is no longer about opening the correct doors or identifying dogs and cats. Deep learning is being used to draw like artists long dead, write stories, and make people in photographs sing, laugh, cry and dance.

So how do the neural networks see the problem at hand? Well, it is not just the neural networks but also machine learning algorithms that generalize the problem at hand. Or they tend to memorize the training data but might not be able to generalize the problem. In short, on one hand, we have a generalization approach, where the model is trained on accurate labels and the solutions are learned from the generalization of new data. On the other, you have neural networks that are trained on randomized labels and memorize the training data. Generative Adversarial Networks (GANs) are popular networks that rely on memorization and yet are known for their generalization ability measured from the perspective of the neural network-based distances. It has already been proven that the degree of memorization and generalization in deep neural networks depends not only on the architecture and training procedure but also on the training data itself.

Now you will ask if we are headed towards a world divided over memorization versus generalization? I leave the answer for perhaps some other time since a lot of research is still going on to uncover the nitty-gritty of neural networks learning dynamics. There are some boggling results like large-size neural networks that had memorized training data and gave low generalization errors, the existence of distribution-learning models such as GANs being able to generate new images despite being trained on a small number of samples. Such neural networks beat the curse of dimensionality. Their difficult training processes are riddled with issues like mode collapse and vanishing gradient, and yet the systems exhibit good generalization abilities. All such questions lead us to focus on the training dynamics of deep neural networks.

Enter the CCA

Coming back to our old friend Canonical Correlation Analysis (CCA), let us see how researchers were able to appreciate the differences in training dynamics of generalizing and memorizing CNNS.

CCA and its use to appreciate the difference between layers of neural networks have already been discussed here. In this article, we are focusing on different neural networks, having different training dynamics. Researchers at DeepMind and Google Brain built upon SVCCA to develop projection-weighted CCA for comparison among CNNs, where weighted means were used for canonical correlations and their relationship with the underlying representation. So a higher weight was assigned to a CCA vector that was more canonically correlated with the representation.

It must also be remembered that training data in the real world will contain noise as well. Since training dynamics are also impacted by the training data, how do the dynamics vary according to the ‘original signal’ and the accompanying ‘noise’? To answer this, the CCA similarities were compared between layer L at times t throughout training with the same layer L at the final time step T. It was found that the sorted CCA coefficients ρ continued to change well after the network’s performance had converged. It could also be assumed that the non-converged coefficients and their corresponding vectors represented the ‘noise’.

The next question that arose was if the CCA vectors that stabilized early in training remained stable. To test this, the CCA vectors were computed between layer L at the time step tₑₐᵣₗᵧ in training and time step T/2. The similarity between the top 100 vectors, which had stabilized early was found to remain stable; and the bottom 100 vectors, which had not stabilized with the representation at all other training steps, continued to vary and therefore likely represented noise. These results suggested that task-critical representations are learned by midway through training, while the noise only approaches its final value towards the end.

Coming back to generalization and memorization dynamics, the convergence behavior of networks trained on CIFAR-10 dataset were compared for the following conditions:

When trained on identically randomized labels or on the true labels,
For different network width, and
In a large sweep of 200 networks.

It was found that generalizing networks converged to more similar solutions than memorizing networks. This could be attributed to the more constrained nature of generalizing networks. Also, the memorizing networks were found to be similar to each other as they were to a generalizing network. This result suggested that the solutions found by memorizing networks were as diverse as those found across entirely different dataset labelings.

Pairwise CCA distance for five networks (generalizing and memorizing), and the projection-weighted CCA coefficient. The source is at the end of the story.

It was also found that at early layers, all networks converged to equally similar solutions, regardless of whether they were generalized or memorized (As shown in the figure above). This made sense as the feature detectors found in early layers of CNNs work regardless of the dataset labeling. However, at later layers, groups of generalizing networks converged to substantially more similar solutions than groups of memorizing networks. At the Softmax layer, sets of both generalizing and memorizing networks converged to highly similar solutions when CCA distance was computed based on training data. But for the test data, only generalizing networks converged to similar softmax outputs, again reflecting that each memorizing network memorizes the training data using a different strategy.

a) Pairwise CCA distance for five trained networks. b) Test accuracy was highly correlated with the degree of convergent similarity. The source is at the end of the story.

It has been an observation that the networks initialized and trained from the start with fewer parameters converge to poorer solutions than those derived from pruning large networks. So are the larger networks more likely to converge to similar solutions than smaller networks? To answer this question, groups of convolutional networks were trained with increasing numbers of filters at each layer. Projection weighted CCA was then used to measure the pair-wise similarity between each group of networks of the same size. It was found that larger networks converged to much more similar solutions than smaller networks.

Till now, I have talked about CCA and its uses to delve into learned representations and training dynamics. Here, I end my discussion on CCA and would talk about another similarity index in the next post. I would really appreciate your feedback and suggestions in case I missed out on something. Until next time.

Learned Representations to understand Neural Network Training Dynamics

Enter the CCA

Sources

Written by Gatha Varma, PhD