Swish: A self-gated Activation Function

Recently, Google Brain has proposed a new activation function called Swish. It is showing some remarkable performance increase in the networks like Inception-ResNet-v2 by 0.6% and Mobile NASNet-A by 0.9%.

Aakash Bindal
Techspace
4 min readAug 23, 2019

--

Before diving into mathematical details and results,

Let’s first understand what is Activation Function and why we use it.

At the heart of every deep network there lies a linear transformation followed by an activation function such as ReLU, sigmoid etc. We need an activation function because we want our neural networks to be able to learn non-linear complex function to predict which they are made for but since only multiplying inputs with weights and adding with biases only covers linearity in out model, this model would not be able to predict something which are complex. As, we want to make our neural networks to be able to mimic any complex function in the world we need to have certain type of non-linearity in our model.

Now that we have covered the base, Let’s go into deep with Swish:

Swish is defined as:

where σ(x) is sigmoid function which is equal to 1/(1 + exp(-x))

You might be asking, Why would I need an another activation function if I already have a good activation function in the form of ReLU?

I know, you love ReLU but the problem with the ReLU is that it’s derivative is 0 for negative values of x which results in no updation of neurons in our neural networks which is obviously bad. We are using gradient descent just because we need to update the neurons but if their gradient will be 0 then no update will take place. That is the reason Leaky ReLU came into existence but it doesn’t help to the extend we intended for it.

ReLU is defined as: f(x) = max(0, x)

Like ReLU, Swish is bounded below and unbounded above. But, unlike ReLU swish is a smooth, non-monotonic function which doesn’t give 0 to negative values and it’s success shows that gradient preserving property of ReLU(having gradient equals to 1 when x > 0) is not that important as we initially thought and this was the reason why we use ReLU in the first place.

Derivative of swish function is calculated here.

Remember, I have written “self-gated” in the heading of the story. Let’s talk about it at a basic level:

Self-Gating is the technique inspired by the use of sigmoid function in LSTMs and Highway Networks. An advantage of self-gating is that it only requires a single input whereas normal gates requires multiple scalar inputs. Due to this Swish can easily replace ReLU as it also takes only a single scalar input.

Properties of Swish:

  1. Unboundedness: Unlike sigmoid and tanh functions, Swish is unbounded above which makes it useful near the gradients with values near to 0. This feature avoids Saturation as training becomes slow near 0 gradient values.
  2. Smoothness of the curve: Smoothness plays an important role in generalization and optimization. Unlike ReLU, Swish is a smooth function which makes it less sensitive to initializing weights and learning rate.
  3. Bounded Below: Like most of the activation functions out there, Swish is also bounded below which helps in strong regularization effects. Like ReLU and softplus, Swish produces produces negative outputs for small negative inputs due to its non-monotonicity. The non-monotonicity of Swish increases expressivity and improves gradient flow, which is important considering that many preactivations fall into this range.

Implementation:

We can easily implement Swish function, with just 1 line of code in tensorflow,

--

--

Aakash Bindal
Techspace

I am Computer Vision and Image Processing enthusiast. I like to learn the core of every algorithm which is basically mathematics.