Updates on Sharpness Aware Minimization part3(Machine Learning 2023)

2 min readNov 24, 2023

Sharpness-Aware Minimization and the Edge of Stability(arXiv)

Author : Philip M. Long, Peter L. Bartlett

Abstract : Recent experiments have shown that, often, when training a neural network with gradient descent (GD) with a step size η, the operator norm of the Hessian of the loss grows until it approximately reaches 2/η, after which it fluctuates around this value. The quantity 2/η has been called the “edge of stability” based on consideration of a local quadratic approximation of the loss. We perform a similar calculation to arrive at an “edge of stability” for Sharpness-Aware Minimization (SAM), a variant of GD which has been shown to improve its generalization. Unlike the case for GD, the resulting SAM-edge depends on the norm of the gradient. Using three deep learning training tasks, we see empirically that SAM operates on the edge of stability identified by this analysis.

2.Systematic Investigation of Sparse Perturbed Sharpness-Aware Minimization Optimizer (arXiv)

Author : Peng Mi, Li Shen, Tianhe Ren, Yiyi Zhou, Tianshuo Xu, Xiaoshuai Sun, Tongliang Liu, Rongrong Ji, Dacheng Tao

Abstract : Deep neural networks often suffer from poor generalization due to complex and non-convex loss landscapes. Sharpness-Aware Minimization (SAM) is a popular solution that smooths the loss landscape by minimizing the maximized change of training loss when adding a perturbation to the weight. However, indiscriminate perturbation of SAM on all parameters is suboptimal and results in excessive computation, double the overhead of common optimizers like Stochastic Gradient Descent (SGD). In this paper, we propose Sparse SAM (SSAM), an efficient and effective training scheme that achieves sparse perturbation by a binary mask. To obtain the sparse mask, we provide two solutions based on Fisher information and dynamic sparse training, respectively. We investigate the impact of different masks, including unstructured, structured, and N:M structured patterns, as well as explicit and implicit forms of implementing sparse perturbation. We theoretically prove that SSAM can converge at the same rate as SAM, i.e., O(logT/T−−√). Sparse SAM has the potential to accelerate training and smooth the loss landscape effectively. Extensive experimental results on CIFAR and ImageNet-1K confirm that our method is superior to SAM in terms of efficiency, and the performance is preserved or even improved with a perturbation of merely 50\% sparsity. Code is available at https://github.com/Mi-Peng/Systematic-Investigation-of-Sparse-Perturbed-Sharpness-Aware-Minimization-Optimizer

Updates on Sharpness Aware Minimization part3(Machine Learning 2023)

Written by Monodeep Mukherjee