How does Batch Normalization REALLY Work?(It's not about Internal Variate Shift)

Guan

Published in

工人智慧

11 min readNov 16, 2020

自 Batch Normalization 在 2014 被發明，在 Deep Learning 的領域中就隨處可見，雖其後偶有變化如 Group Normalization 等，但其地位仍舉足輕重，因為實驗表明，它能夠

加速訓練速度，讓 Neural Network 更快收斂
附帶 Regularization，讓 Neural Network 更加 Generalized

針對第一點，Ian Goodfellow 甚至說過

Before BN, we thought that it was almost impossible to efficiently train deep models using sigmoid in the hidden layers. We considered several approaches to tackle training instability, such as looking for better initialization methods. Those pieces of solution were heavily heuristic, and way too fragile to be satisfactory. Batch Normalization makes those unstable networks trainable ; that’s what this example shows.(source)

然而，Batch Normalization 如何使得不穩定的訓練變得穩定，仍然成謎。

Internal Covariate Shift

當然，我們都知道官方說法是 BN 能夠有效減少 “Internal Covariate Shift”，而對於越深層的網絡，這種擾動所造成的蝴蝶效應尤其明顯。

聽起來好像很有道理，但有幾個疑問：

1. Internal Covariate Shift 到底是什麼

在原始論文中，是這樣描述 ICS 的

We define Internal Covariate Shift as the change in the distribution of network activations due to the change in network parameters during training. To improve the training, we seek to reduce the internal covariate shift.

但怎麼樣的 “change” 才是 ICS？我們還有下圖：

(b)(c) 是將 layer input 輸入到 sigmoid 後，將 quantile 15, 50, 85 分別依 training step 畫下。可見 (b) 更不穩定，即表現更大的 ICS，加了 BN 的 (c) 則有更小的 ICS。

事實上，這樣的定義還是很模糊。為什麼只是 quantile 15, 50, 85？不同的資料集會有不同的 quantile 分佈應該是很直覺的事情，這樣的量化依然有意義嗎？

再來，上圖的確展現了明顯的不同，但是當比較都很平滑的兩個分佈，我們要如何評量哪個帶有更大的 ICS ？

2. Batch Normalization can reduce internal covariate shift？

接續上文，最廣為人知的 ICS 定義是

Informally, ICS refers to the change in the distribution of layer inputs caused by updates to the preceding layers.
- How Does Batch Normalization Help Optimization? (2018, Shibani Santurkar et al.)

在原始論文中的 Fig.1 卻不是依照這樣的定義設計實驗的。它事實上僅有討論各層中 quantile 15, 50, 85 的值，和該層「分布」實在相差甚遠。

我們需要從非正式定義中，定義一個正式的 ICS 定義，並檢視 BN 是否真的減少了 ICS。

3. 更小的 ICS = 更穩定的訓練？

我們暫且擱置上文談到不正式的 ICS 定義，甚至 BN 是不是真的能減少 ICS 也不說，實驗本身的邏輯也是有疑慮的。BN 能夠使訓練更穩定，也能夠使模型有更小的 ICS，但這並不代表，更小的 ICS 會有更穩定的訓練。

我們可以設計實驗是：
一、不使用 BN，直接干預 ICS，使得 ICS 與穩定訓練有直接因果。
或是
二、使用 BN，直接干預 ICS ，使得更大的 ICS 仍能穩定訓練(註一)。

這樣我們就能夠證明，ICS 確實是訓練不穩定的來源，或是 ICS 與不穩定的訓練毫無關係。

How Does Batch Normalization Help Optimization?

對於 BN 與 ICS 對訓練的關係有太多需要釐清，這篇論文除了點出並解決上述三個疑問外，也針對 BN 的運作之謎提出另一個假說。

1. Does BatchNorm’s performance stem from controlling internal covariate shift?

在 CIFAR-10 上設計一個實驗，分別比較

a. 無 BN 的 Standard VGG
b. 有 BN 的 Standard VGG
c. 有 BN 的 Standard VGG ，且在每個 BN 後加上一層 Random Noise ，表現出無法被學習的 Internal Variate Shift

可以看到，Standard + BN 與 Standard + Noisy BN 在左圖的訓練表現中幾乎毫無差異。但是右圖中，尤其在 Layer #13 ，前後 step 所表現的 distribution of activations 有明顯差異，也就是帶有很大的 ICS。

附錄中我們可以得到另一張圖，闡述 ICS 並不會導致不穩定的訓練

故，我們一直以來認為的，降低 Internal Covariate Shift 能夠穩定訓練是有問題的。

2. Is BatchNorm reducing internal covariate shift?

我們重新定義 Internal Covariate Shift，並將它稍微改寫，從網絡輸入的分佈改變，聚焦在 Gradient 上，畢竟，我們最終想討論的是訓練表現，而此訓練為 gradient descent-based training algorithms。

Is there a broader notion of internal covariate shift that has such a direct link to training performance? And if so, does BatchNorm indeed reduce this notion? […] To answer this question, we consider a broader notion of internal covariate shift that is more tied to the underlying optimization task. (After all the success of BatchNorm is largely of an optimization nature.)

Formal Definition of Internal Covariate Shift

讓 ICS 為 i-th Layer 前的權重變化前後，Gradient of i-th Layer 的差值。

The difference between G and G0 thus reflects the change in the optimization landscape of Wi caused by the changes to its input. It thus captures precisely the effect of cross-layer dependencies that could be problematic for training.

使用這個定義，我們比較 Standard VGG 和 Standard VGG with BN，得到

上圖顯示，BN 在新定義的 ICS 上仍沒有效果，不但不會減少，甚至會增加 ICS。

自此，Internal Coariate Shift 假說已經全數告破。

3. Why does BatchNorm work?

既然如此，為什麼 BN 能夠穩定訓練？

本文提出另一個假說 : The smoothing effect of BatchNorm

Indeed, we identify the key impact that BatchNorm has on the training process: it reparametrizes the underlying optimization problem to make its landscape significantly more smooth.

上圖 (a) 為 gradient of the loss at i-th step during training，(b) 為 difference of gradient of the loss between 2 steps。可以見到使用了 BN 的訓練在每一步都有十分穩定的 gradient ，換句話說，網絡權重的更新非常穩定，並不會突然的劇烈改變，直覺上，這代表我們有更平滑的 loss landscape。

對於 (c)，本文提出了兩項指標用以量化 loss function 的平滑程度。

a. Lipschitzness
一個函數的 Lipschitzness 越好，表示它僅能夠在一個更小的範圍內變化，參考下圖：