Swish function Experimentation on 6 models

Sumeet Badgujar
Analytics Vidhya
Published in
3 min readJul 29, 2021

Recently I used EfficientNet for one of my project, the prediction accuracy was much more better than the previously trained models. I knew it was going to happen, since EfficientNet is the latest, it ought to be better. But why was it better? This question intrigued me a lot.

So I said to myself — Time to look underneath. And lo behold, underneath the new architecture which included compound scaling and more there was one new thing — Swish function.

A new activation function. It is represented as x*sigmoid(x).

Swish Function

A better way of representing is — x*sigmoid(b*x) where b is a trainable parameter. What’s good about the new function ?

  • Continuous function, unlike RELU which is linear.
  • Avoids vanishing gradient problem.
  • A trainable parameter to better tune the gradient flow.
  • Solves the Dying RELU problem.

Seeing the use of Swish function in the EfficientNet paper, an idea occurred in my mind.

What would happen if Relu was replaced by Swish ?

Let the experiments begin. No chemicals were used in the experiments and no animals were harmed.

Photo by Bermix Studio on Unsplash

For the data, I chose Caltech-UCSD Birds-200–2011.

  • Number of categories: 200
  • Number of images: 11,788
  • Split: 70, 20, 10

As for the models, I selected DenseNet 121, MobileNet, MobileNetV2, ShuffleNet, SqueezeNet and ResNet50.

As for the parameters, they were kept constant.

  • Optimizer — Adam
  • Batch size: 32

Each model was trained for 25 epochs from scratch i.e. no ‘imagenet’ weights. First with ReLU and then again, by replacing all activation functions with Swish.

The results were amazing. Changing a simple activation function had this much difference on accuracy. (Regarding the ResNet model, there was a slight convolution mistake while creating it, thus it’s accuracy is low but a model’s a model.)

Comparison Table

If it’s good, then why is not everybody using it?

With great activation function, comes great computing cost.

Processing time comparison

Swish function contains sigmoid, which contains exponential function. Computing exponential value takes time, compared to the simple RELU function. This is clearly reflected in the training time.

The epoch time has increased considerably. The values in the table are approx average of all the epochs.

Deep Learning models are all about trial and error. Finding the model, fine tuning it, adding more data, retraining. In all of these, time becomes a important constraint, thus depending on the user case and tradeoff, swish function must be used.

Moreover, all the above mentioned models have their new versions, released with activation function as Swish or optimized for further improvement.

--

--

Sumeet Badgujar
Analytics Vidhya

A guy interested in Data Science and Ex-Machine Learning Engineer, doing data analysis and fun AI projects. “Ore wa Kaizoku Ou ni naru!”