Swish function Experimentation on 6 models

Published in

Analytics Vidhya

3 min readJul 29, 2021

Recently I used EfficientNet for one of my project, the prediction accuracy was much more better than the previously trained models. I knew it was going to happen, since EfficientNet is the latest, it ought to be better. But why was it better? This question intrigued me a lot.

So I said to myself — Time to look underneath. And lo behold, underneath the new architecture which included compound scaling and more there was one new thing — Swish function.

A new activation function. It is represented as x*sigmoid(x).

A better way of representing is — x*sigmoid(b*x) where b is a trainable parameter. What’s good about the new function ?

Continuous function, unlike RELU which is linear.
Avoids vanishing gradient problem.
A trainable parameter to better tune the gradient flow.
Solves the Dying RELU problem.

Seeing the use of Swish function in the EfficientNet paper, an idea occurred in my mind.

What would happen if Relu was replaced by Swish ?

Let the experiments begin. No chemicals were used in the experiments and no animals were harmed.

For the data, I chose Caltech-UCSD Birds-200–2011.

Number of categories: 200
Number of images: 11,788
Split: 70, 20, 10

As for the models, I selected DenseNet 121, MobileNet, MobileNetV2, ShuffleNet, SqueezeNet and ResNet50.

As for the parameters, they were kept constant.

Optimizer — Adam
Batch size: 32

Each model was trained for 25 epochs from scratch i.e. no ‘imagenet’ weights. First with ReLU and then again, by replacing all activation functions with Swish.

The results were amazing. Changing a simple activation function had this much difference on accuracy. (Regarding the ResNet model, there was a slight convolution mistake while creating it, thus it’s accuracy is low but a model’s a model.)

If it’s good, then why is not everybody using it?

With great activation function, comes great computing cost.

Swish function contains sigmoid, which contains exponential function. Computing exponential value takes time, compared to the simple RELU function. This is clearly reflected in the training time.

The epoch time has increased considerably. The values in the table are approx average of all the epochs.

Deep Learning models are all about trial and error. Finding the model, fine tuning it, adding more data, retraining. In all of these, time becomes a important constraint, thus depending on the user case and tradeoff, swish function must be used.

Moreover, all the above mentioned models have their new versions, released with activation function as Swish or optimized for further improvement.

Swish function Experimentation on 6 models

What would happen if Relu was replaced by Swish ?

If it’s good, then why is not everybody using it?

Written by Sumeet Badgujar