An intro to self-normalising neural networks (SNN)
making vanilla nets cool again
I really like this paper by Klambauer et al and wrote a summary of the theory. It reminds me of my studies in analysis in those heady days before I got into data science. I’ve also posted a very hacky gist to get a feel for SELUs (it’s embedded below). All errors are my own and comments/corrections are gratefully received.
Why should I care about this?
Most recent advances in deep learning come from CNNS and RNNs. This is mostly because deep networks are better at learning complex models and the training of deep CNNs/RNNs can be made robust.
State of the art performance rarely comes from a ‘vanilla’ (fully connected) neural net because training by SGD gets unstable after a few layers. Using SNNs, the authors can train very deep vanilla nets to get state of the art performance on the 121 tasks in the UCI repository, drug discovery and an astronomy task. For most non-perception tasks, practitioners still mostly use non-neural net based approaches. The SNN may shift this.
This field is largely empirical — mathematical proofs are a rarity and should be appreciated, encouraged and examined for the insight they provide into the mechanics of learning.
Normalization most often means transforming inputs to zero-mean and unit variance. This is often done as a pre-processing step. It speeds up learning and improves accuracy. Why?
- Normalization makes the values of different features comparable
- During training, the weights and parameters adjust those values
- This can mess up the scaling again, despite the pre-processing, which can cause the gradients to get out of control. This hurts learning. So normalisation needs to be applied during training
Many other normalization methods exist: batch normalization, layer normalization, weight normalization etc, but SGD and dropout perturb these kinds of normalisation (and they can be tricky to code), leading to high variance in training error. CNNs and RNNs get around this by sharing weights (though RNNs are still subject to exploding/vanishing gradients). The effect gets worse with depth, so deep vanilla networks tend to suck.
The key idea is to prove that there exists a fixed point for the mapping from the mean and variance of activations from one layer to the next.
The mathematical tool for this is the Banach fixed point theorem. Using the theorem requires proving that there is a domain of Ω of μ, ν values for which the mapping g is a contraction and never maps to values outside of Ω. This latter part is usually the painful thing to prove. The authors have a great 90 page appendix, featuring a computationally assisted proof of this. Heroes.
The authors introduce a new activation function and a new kind of dropout to make this fixed point mechanism work for them.
The SELU activation function is defined as
Here α and λ are solved for in the equations resulting from find a fixed point μ, ν = g(μ, ν). The SELU looks like this:
Normal dropout (randomly setting weights to 0 with some probability) would ruin the desired mean and variance, so this needs to be amended. The mean is preserved by scaling the activations. Preservation of the variance is achieved by applying an affine transformation. This affine transformation exploits the fact dropout works well for RELUs because 0 is in the low variance region for that activation. For SELU, the low variance limit is SELU(x)=-λα=:α’ as x tends to -∞. So α-dropout randomly sets inputs to α’ instead of 0. This transformation adds 2 more parameters that are solved for the specific desired fixed point of (μ, ν)=(0,1).
Constructing neural nets this way ensures that the distribution of neuron activations remains stable. That is, the mean and variance of the data through each layer remains near 0 and 1 respectively.
Theorem 1 — Under some conditions on the weights, the map g has a stable and attracting fixed point, meaning that SNNs really are self-normalising.
The intuition here is that high variance in one layer is mapped to low variance in the next layer and vice versa. This works because the SELU decreases the variance for negative inputs and increases the variance for positive inputs. The decrease effect is stronger for very negative inputs, and the increase effect is stronger for near-zero values.
Theorem 2 — the mapping of variance is bounded from above so gradients cannot explode
Theorem 3—the mapping of variance is bounded from below so gradients cannot vanish
Lo and behold, it works, just look how smooth this training curve is! Check out the code below for more empirical evidence that things work out nicely.
Vanilla nets are cool again. Deeper networks competitive with more complex architectures can be trained using SNN.
There is a TensorFlow implementation of SELU, so expect to see loads of them coming up.