Understanding Batch Normalization part1(Machine Learning)

Monodeep Mukherjee
3 min readNov 14, 2022
Photo by Steve Johnson on Unsplash
  1. LightNorm: Area and Energy-Efficient Batch Normalization Hardware for On-Device DNN Training(arXiv)

Author : Seock-Hwan Noh, Junsang Park, Dahoon Park, Jahyun Koo, Jeik Choi, Jaeha Kung

Abstract : When training early-stage deep neural networks (DNNs), generating intermediate features via convolution or linear layers occupied most of the execution time. Accordingly, extensive research has been done to reduce the computational burden of the convolution or linear layers. In recent mobile-friendly DNNs, however, the relative number of operations involved in processing these layers has significantly reduced. As a result, the proportion of the execution time of other layers, such as batch normalization layers, has increased. Thus, in this work, we conduct a detailed analysis of the batch normalization layer to efficiently reduce the runtime overhead in the batch normalization process. Backed up by the thorough analysis, we present an extremely efficient batch normalization, named LightNorm, and its associated hardware module. In more detail, we fuse three approximation techniques that are i) low bit-precision, ii) range batch normalization, and iii) block floating point. All these approximate techniques are carefully utilized not only to maintain the statistics of intermediate feature maps, but also to minimize the off-chip memory accesses. By using the proposed LightNorm hardware, we can achieve significant area and energy savings during the DNN training without hurting the training accuracy. This makes the proposed hardware a great candidate for the on-device training.

2.An Adaptive Batch Normalization in Deep Learning (arXiv)

Author : Wael Alsobhi, Tarik Alafif, Alaa Abdel-Hakim, Weiwei Zong

Abstract : Batch Normalization (BN) is a way to accelerate and stabilize training in deep convolutional neural networks. However, the BN works continuously within the network structure, although some training data may not always require it. In this research work, we propose a threshold-based adaptive BN approach that separates the data that requires the BN and data that does not require it. The experimental evaluation demonstrates that proposed approach achieves better performance mostly in small batch sizes than the traditional BN using MNIST, Fashion-MNIST, CIFAR-10, and CIFAR-100. It also reduces the occurrence of internal variable transformation to increase network stability

3.Understanding the Failure of Batch Normalization for Transformers in NLP (arXiv)

Author : Jiaxi Wang, Ji Wu, Lei Huang

Abstract : Batch Normalization (BN) is a core and prevalent technique in accelerating the training of deep neural networks and improving the generalization on Computer Vision (CV) tasks. However, it fails to defend its position in Natural Language Processing (NLP), which is dominated by Layer Normalization (LN). In this paper, we are trying to answer why BN usually performs worse than LN in NLP tasks with Transformer models. We find that the inconsistency between training and inference of BN is the leading cause that results in the failure of BN in NLP. We define Training Inference Discrepancy (TID) to quantitatively measure this inconsistency and reveal that TID can indicate BN’s performance, supported by extensive experiments, including image classification, neural machine translation, language modeling, sequence labeling, and text classification tasks. We find that BN can obtain much better test performance than LN when TID keeps small through training. To suppress the explosion of TID, we propose Regularized BN (RBN) that adds a simple regularization term to narrow the gap between batch statistics and population statistics of BN. RBN improves the performance of BN consistently and outperforms or is on par with LN on 17 out of 20 settings, involving ten datasets and two common variants of Transformer Our code is available at https://github.com/wjxts/RegularizedBN

--

--

Monodeep Mukherjee

Universe Enthusiast. Writes about Computer Science, AI, Physics, Neuroscience and Technology,Front End and Backend Development