Swish Activation Function by Google

Random Nerd
3 min readOct 22, 2017

--

Swish Activation Function

The choice of activation functions in Deep Neural Networks has a significant impact on the training dynamics and task performance. Currently, the most successful and widely-used activation function is the Rectified Linear Unit (ReLU), which is f(x)=max(0,x). Although various alternatives to ReLU have been proposed, none have managed to replace it due to inconsistent gains. So Google Brain Team has proposed a new activation function, named Swish, which is simply f(x) = x · sigmoid(x). Their experiments show that Swish tends to work better than ReLU on deeper models across a number of challenging data sets. For example, simply replacing ReLUs with Swish units improves top-1 classification accuracy on ImageNet by 0.9% for Mobile NASNetA and 0.6% for Inception-ResNet-v2. The simplicity of Swish and its similarity to ReLU make it easy for practitioners to replace ReLUs with Swish units in any neural network.

Swish Activation Function Image Source

With ReLU, the consistent problem is that its derivative is 0 for half of the values of the input x in ramp Function, i.e. f(x)=max(0,x). As their parameter update algorithm, they have used Stochastic Gradient Descent and if the parameter itself is 0, then that parameter will never be updated as it just assigns the parameter back to itself, leading close to 40% Dead Neurons in the Neural network environment when θ=θ. Various substitutes like Leaky ReLU or SELU (Self-Normalizing Neural Networks) have unsuccessfully tried to devoid it of this issue but now there seems to be a revolution for good.

Sigmoid, Hyperbolic Tangent, ReLU and Softplus comparison

Swish is a smooth, non-monotonic function that consistently matches or outperforms ReLU on deep networks applied to a variety of challenging domains such as Image classification and Machine translation. It is unbounded above and bounded below & it is the non-monotonic attribute that actually creates the difference. With self-gating, it requires just a scalar input whereas in multi-gating scenario, it would require multiple two-scalar input. It has been inspired by the use of Sigmoid function in LSTM (Hochreiter & Schmidhuber, 1997) and Highway networks (Srivastava et al., 2015) whereself-gated’ means that the gate is actually the ‘sigmoid’ of activation itself.

We can train deeper Swish networks than ReLU networks when using BatchNorm (Ioffe & Szegedy, 2015) despite having gradient squishing property. With MNIST data set, when Swish and ReLU are compared, both activation functions achieve similar performances up to 40 layers. However, Swish outperforms ReLU by a large margin in the range between 40 and 50 layers when optimization becomes difficult. In very deep networks, Swish achieves higher test accuracy than ReLU. In terms of batch size, the performance of both activation functions decrease as batch size increases, potentially due to sharp minima (Keskar et al., 2017). However, Swish outperforms ReLU on every batch size, suggesting that the performance difference between the two activation functions remains even when varying the batch size.

Obviously, the real potential can be adjudged only when we use it for ourselves and analyze the difference. I find it simpler to use Activation functions in a functional way by defining a fn that returns x * F.sigmoid(x). This article at this moment is just an overview. Please feel free to implement Swish your way and do share your experience. Thank You for your time and enjoy Machine Learning!

--

--