FAST AI JOURNEY: COURSE V3. PART 1. LESSON 7.

Documenting my fast.ai journey: PAPER REVIEW. VISUALIZING THE LOSS LANDSCAPE OF NEURAL NETS.

SUREN HARUTYUNYAN
13 min readJan 5, 2019

For the Lesson 7 Project, I decided to dive into the 2017 paper, named Visualizing the Loss Landscape of Neural Nets, by Hao Li, Zheng Xu, Gavin Taylor, Christoph Studer, Tom Goldstein, which was was presented during the 2018 NeurIPS Conference. The authors have made the code available here.

Our objective here is to understand the presented paper, go through all the sections step by step, and be able to describe its contents using the notions we have learned during the course.

NeurIPS Logo. Source: https://twitter.com/nipsconference.

1. Introduction.

In the paper the authors present the problem of: How the neural network architecture design, optimization function choice, etc., can affect the loss surface. Since, the main studies are theoretical and loss function evaluation has a prohibitive cost, they want to tackle the problem by visualizing the loss surface per se.

Figure 1. Source: https://arxiv.org/pdf/1712.09913.pdf.

The specific questions to answer here are:

  1. Why are we able to minimize highly non-convex neural loss functions?
  2. And why do the resulting minima generalize?

To answer these, the authors use visualizations for the neural loss functions, exploring how the network’s architecture affects the loss landscape. They also explore how the shape of the loss function affects its trainability and generalization errors of the neural net.

1.1. Contributions.

The main contributions of the work are the following:

  1. Proof of failures of previous visualization methods for loss functions.
  2. Presenting a visualization method based on “filter normalization”, allowing side-by-side comparisons of minimizers.
  3. Observing that with enough layers, the loss landscape changes from nearly convex to higly chaotic, giving rise to a low generalization error and a lack of trainability.
  4. Adding skip connections prevents this change to a chaotic landscape.
  5. Quantitative method of measuring the non-convexity.
  6. Showing that SGD optimization trajectories lie in a low dimensional space.

2. Theoretical Background.

The researchers observe that until now there have been mostly theoretical studies that optimize neural loss functions, and with restrictive assumptions (single hidden layers, etc.) near optimal solutions have been found. Although counterexamples with local minima have been achieved too.

Other works tackled the sharpness/flatness of local minima, tried several ways to define flatness, but without great avail. The quantitative measures of sharpness which were presented fail to determine generalization ability or are difficult to compute accurately.

3. The Basics of Loss Function Visualization.

First of all the authors start with the assertion that Neural Networks are trained on a corpus of feature vectors, for example images, and their corresponding labels. The training is performed by minimizing the loss, and measuring how well our weights predict a label from a data sample. These networks contain many parameters, thus, their loss functions are high-dimensional. Since we can only plot them in low dimensions (1D lines or 2D surfaces), we need methods to close the dimensionality gap.

3.1. 1-Dimensional Linear Interpolation.

We can plot loss functions in a simple way by choosing two sets of parameters θ and θ’, getting the values of the loss function, and plotting them in a line connecting these 2 points.

Using an (α) value as a scalar, we can define a weighted average of the form:

Weighted Average. Source: https://arxiv.org/pdf/1712.09913.pdf.

and plot its function, which has the form:

Function form of the Weighted Average. Source: https://arxiv.org/pdf/1712.09913.pdf.

For example, in 2015 Goodfellow, et al. took this path to study the loss surface along a line. They obtained the first value by an initial random guess, while the second is a minimizer obtained by SGD.

This method has been applied to different works, like studying the sharpness/flatness of the minima, the dependence of the sharpness on the batch-size, exploring the points between different minima, or obtaining different minima with different optimizers and plotting the line between them.

There are a couple of problems posed by the 1D linear interpolations:

  1. The difficulty to visualize non convexities, which was proven by Ian Goodfellow, et al. by observing a lack of local minima on the minimization trajectory. 2D methods prove that there are indeed extreme non-convexities, which are correlated to the difference in generalization between networks with different architectures.
  2. Ignoring batch normalization or invariance symmetries, leading to inaccurate 1D plots.

3.2. Contour Plots & Random Directions.

This approach can be used by choosing a center point (θ*) in the graph. Next we need to choose 2 direction vectors (δ and η). Finally, we can plot the 1D (line) function which has the form:

1D Function form. Source: https://arxiv.org/pdf/1712.09913.pdf.

The 2D (surface) function takes the form:

Formula 1. Source: https://arxiv.org/pdf/1712.09913.pdf.

This method has been used to explore the paths of different minimization methods and to prove that depending on the optimizer, we could find different local minima within the 2D surface.

The main problem with these strategy has been the high computational cost of producing high-resolution 2D surfaces, in the end capturing small regions in a low-resolution.

4. Proposed Visualization: Filter-Wise Normalization

The authors in the study use surface plots of the form described above, using a pair of vector that are sampled from a random Gaussian distribution.

The main problem of using random directions vectors is its failure to capture the geometry of the loss surface. This makes it impossible to compare 2 different optimizers, or networks.

This is produced because of te scale invariance in the network weights. If we are using ReLUs, multiplying by a number a the weights in one layer and dividing by the same number in the next, the network will no change. This invariance is more acute with batch normalization, since the outputs of each layer are re-scaled during the normalization.

We have to remember that neural networks are scale invariant, i.e., if a network with large and small weights are equivalent (the latter is a rescaling of the former), the difference between the loss functions are due to the scale invariance.

The authors remove this effect, by plotting the loss functions using filter-wise normalized directions. The directions for a network with parameters θ are obtained by:

  1. Produce a random Gaussian direction vector d, whose dimensions are compatible with the weights θ.
  2. Normalize the filters in d, so they have same norm of the respective filter in θ.

From the paper we can see that the authors performed the second step by replacing each filter of vector d. First, they obtained the Frobenius norm of θ and d. After they divided the vector d by the norm of d, and multiplied it by the norm of θ. As a simple refresher of the Frobenius norm, you can checkout this explanatory video.

An important factor to take into account, is that filter normalization is applied to both the Convolutional Layers and to the Fully Connected ones.

Next the researchers, show that the surface plots of the form:

Formula 1. Source: https://arxiv.org/pdf/1712.09913.pdf.

is an accurate representation of the natural distance scale of the loss surface when our vectors δ and η are filter normalized, proving also that the shape contour of the filter nomalized plots correlates better with the gneralization error.

5. The Sharp vs Flat Dilemma.

In this section they explore the difference between sharp and flat minimizers, since the claim that:

[…] small-batch SGD produces “flat” minimizers that generalize well, while large batches produce “sharp” minima with poor generalization.

has been disputed.

To explore the difference between sharp and flat minimizers they trained a CIFAR-10 classifier, with a 9-layer VGG network. They also apply batch normalization during a fixed number of epochs.

The author picked two types of batches:

  1. Large: size of 8192
  2. Small: size of 128.

θl and θs represent the weights obtained by performing an SGD using each batch size respectively.

Next the authors use the linear interpolation approach, commented earlier, to plot the loss value of the training and test data sets. The function contains both solutions, which has the form:

Formula 1. Source: https://arxiv.org/pdf/1712.09913.pdf.
Figure 2. Source: https://arxiv.org/pdf/1712.09913.pdf.

The obtained results, show that using a small batch size we obtain a flat results, while sharpness can be observed using a larger batch size. This result can be inverted if we start using weight decay. The authors observe that small batches generalize better, deeming sharpness and generalization comparisons as misleading.

To explain these differences in sharpness, they examine the weights of each minimizer. Using large batch size and 0 weight decay produce histograms that show smaller weights than when using smaller batch size. Like earlier, if we add weight decay, we can flip this result.

The inverted results are because, the small batches update the weights per epoch more times than the larger batches. These updates, accentuate even more the effect of the weight decay.

Even more, what we are observing as sharpness is not the inherent sharpness of the minimizers, but the weight scaling, which is irrelevant since our batch normalization will rescale the outputs to have unit variance.

5.1. Filter Normalized Plots.

The authors repeated the above experiment, but plotting the loss function using random filter-normalized directions, removing the differences caused by weight scaling. Although, there are sharpness differences between the large and small batch minima, these are much more subtle.

Figure 3. Source: https://arxiv.org/pdf/1712.09913.pdf.

In the end we can observe that sharpness correlates correctly with generalization error. For example, large batches produce sharper results, with higher test errors. The difference between test errors does not change, even if we use weight decay.

6. What Makes Neural Networks Trainable? Insights on the (Non)Convexity Structure of Loss Surfaces.

Another interesting result is that the ability to find global minimizers to neural loss functions is related to the architecture and to the initial training parameters. For example, another group of researchers were able to train deep architectures using skip connections, but unable to train similar architectures without them.

Figure 4. Source: https://arxiv.org/pdf/1712.09913.pdf.

The authors continue the paper, with an objective to answer these questions, using visualization methods:

  1. Do loss functions have significant non-convexity at all?
  2. If prominent non-convexities exist, why are they not problematic in all situations?
  3. Why are some architectures easy to train?
  4. And why are results so sensitive to the initialization?

The fundamental intuition, is that each architecture presents high differences in their non convexity structure, and these differences are correlated with the generalization error.

6.1. Experimental Setup.

To tackle the above problems and understand how the architecture affects on the surface structure, the authors trained 3 different networks and plotted the landscape surrounding the minimizers (obtained applying the filter-normalized random direction method).

The trained networks are the following:

  1. Standard ResNets: ResNets, with different number of layers, namely 10, 56 and 110, and optimized for the CIFAR-10 dataset.
  2. No-Skip ResNets: ResNets, without skip connections, producing networks similar to VGGs.
  3. Wide ResNets: ResNets, with more filters per layer, than the above CIFAR-10 optimized architectures.

The networks were trained using:

  1. CIFAR-10 dataset.
  2. SGD with Nesterov Momentum. This is a simple explanation of the Nesterov Momentum.
  3. Batch size of 128.
  4. Weight decay of 0.0005 for 300 epochs,
  5. Learning rate of 0.1 and a decay by 10 after epochs 150, 225, and 275. No-Skip ResNets were initialized with a rate of 0.01.

The authors show the results in 2D contour plots, with the 2 axes representing 2 random directions with filter-wise normalization.

6.2. The Effect of Network Depth.

The authors observed that as the network depth of the No-Skip ResNets increases, the loss surface transitions from convex to chaotic.

Figure 5. Source: https://arxiv.org/pdf/1712.09913.pdf.

Observe especially No-Skip ResNet-56 and No-Skip ResNet-110, because they display extreme sharpness, at the center of the minimizer.

6.3. Shortcut Connections to the Rescue.

Notice that once we add skip connections, we are preventing the above mentioned effect on our surfaces. More importantly, skip connections perform better on deeper networks. Compare the contours of (b) and (e) for example. Observe the jaggedness of the latter and the extreme sharpness of the center, while the former has a more smoother contour.

6.4. Wide Models vs Thin Models.

Figure 6. Source: https://arxiv.org/pdf/1712.09913.pdf.

The authors wanted to see the effects of the number of convolutional filters per layer. For this reason, they compare the CIFAR-optimized ResNets (bottom) with Sergey Zagoruyko’s and Nikos Komodakis’ 2016 Wide-ResNets (top), by multiplying the filter per layer k by 2, 4, and 8. As we can see, the network width produces flat minima, and wide convex regions. Skip connections widen the minimizers, and this width is preventing extreme changes. Also, observe that the more convex our contour, higher is the test error.

6.5. Implications for Network Initialization.

In Figure 5, we observed landscapes partitioned in regions of low loss value and convex contours, surrounded by regions of high loss value and non-convex contours. The authors pose the idea that these partitions indicate the importance of initialization values, explaining why random initialization strategies can help avoiding the non-convex countours. For example: SGD that could not train a 156 layer network, without skip connections, and low learning weight. The explanation is that networks that are deep enough, present non-convex landscapes with shallow convex regions (read local minima), so it will get stuck in these local minima, making it impossible to train.

6.6. Landscape Geometry Affects Generalization.

The authors continue with the idea that landscape geometry correlates with generalization. We have observed that:

  1. Flatter contours correlate to low test erors, giving weight to the filter normalization method.
  2. Networks without skip connections generalize worse, than the most convex ones, while Wide-ResNets generalize the best of all.

6.7. A note of caution: Are we really seeing convexity?

Since the loss surfaces presented in the paper, have suffered a dramatic dimensionality reduction, the authors are worried about the interpretation of the plots. Hence, they study the level of the convexity of the loss function.

The method used, was to calculate the the eigenvalues of the Hessian (called principle curvatures in the paper). So convex functions have non-negative curvatures (i.e.: a semi-definite Hessian), but non-convex functions have negative curvatures.

Keep in mind, that the curvatures of a dimensionality reduced plot are just weighted averages of the full-dimensional surfaces. Consequences of this:

  1. Non-convexity in the reduced surface, maps to presence of non-convexity in the full dimensional one.
  2. Convexity in the reduced surface, does not equate to convexity in the full dimensional one.

To explore, if there are remainders of non-convexity that the plots are not capturing, the authors computed the minimum and maximum eigenvalues (λmin and λmax) of the Hessian.

Figure 7. Source: https://arxiv.org/pdf/1712.09913.pdf.

Mapping the ratio of |λmin / λmax| values across the loss surfaces we can observe in the blue regions the presence of convexity. Observe that, there are some some convex looking regions in the surface plots we presented before, which correspond to regions with negative eigenvalues. These are insignificant, but represent the missing non-convexities that the the plot missed.

7. Visualizing Optimization Paths.

In the final Section the authors explore methods to visualize trajectories of different optimizers.

It seems that random directions fail to visualize correctly the variation in the trajectories. This has been proven twice:

Figure 8. Source: https://arxiv.org/pdf/1712.09913.pdf.
  1. (a) represents SGD iterations plotted on the plane, with 2 random directions.
  2. (b) represents SGD iterations plotted on the plane, but one direction is random and the other points straight to the the solution.

7.1. Why Random Directions Fail: Low-Dimensional Optimization Trajectories.

The orthogonality of the random vectors in a high dimensional space, and their cosine similarity between Gaussian random vectors in n dimensions, roughly:

Cosine similarity between Gaussian random vectors in n dimensions. Source: https://arxiv.org/pdf/1712.09913.pdf.

is problematic while the opimization trajectories lie in low dimensional spaces.

This means that a random vector is orthogonal to a low dimensional space that contains the optimization trajectory. In the end, if we make a projection onto the random direction, we won’t see any variation. The authors claim that the optimization path may be low-dimensional, because the random direction captures less variation than the vector that is pointing towards the path. To validate this, they use PCA.

7.2. Effective Trajectory Plotting using PCA Directions.

The authors to capture this variation in trajectories use non-random directions, with a PCA based approach, that measures the captured variation.

The notation if the following:

  1. θi denotes the weights at epoch i.
  2. θn denotes the final weights at last epoch.
  3. n denotes the number of epoch.

After n training epochs, they apply PCA to the following matrix M:

Matrix M to which we apply PCA. Source: https://arxiv.org/pdf/1712.09913.pdf.

and finally select 2 explanatory directions.

Figure 9. Source: https://arxiv.org/pdf/1712.09913.pdf.

The above plots represent the projected learning trajectories using normalized PCA for VGG-9. Left plot us batch size of 128, while the right ones of 8192.

The blue dots, are the optimizer paths, while the red one represent epochs where the learning rate was decreased. The axis show the variation that was captured in the descent path by that PCA direction.

We can observe that the trajectories move perpendicularly to the contours of the loss surface. The stochasticity is more accentuated during the later stages of the training, especially when weight decay and small batches are used. With weight decay and small batches the trajectory starts moving parallel to the contour, and when the learning rate is high it starts orbiting the solution. Once the learning rate is decreased at the red dot, the trajectory turns a little bit and goes straight into the local minimizer.

Finally, observe the descent path. Note that most of the variation is in the trajectory lies in a space of 2 dimensions. In the Figure 9, we can see that these paths move mostly in the direction of a nearby attractor. This proves what we have already seen in the Section 6, observing that landscapes that are non-chaotic, are characterized by having wide, and convex minimizers.

8. Conclusion.

The authors, showed a visualization technique that presents the consequences of the choices that can be made for Neural Network training. These visualization methods combined with advances in theory, can help achieve faster training, and better generalization.

Appendix.

In the appendix of the paper, the researchers plot and compare the visualizations detailed during the paper, changing network architecture, optimizer, and batch size, and finally comparing them.

Table 2. Source: https://arxiv.org/pdf/1712.09913.pdf.

--

--