Swish in depth: A comparison of Swish & ReLU on CIFAR-10

Jaiyam Sharma
5 min readOct 22, 2017

--

Hi there, this is a continuation from my previous blog post about SWISH activation function recently published by a team at Google. If you are not familiar with the Swish activation (mathematically, f(x)=x*sigmoid(x)), please be sure to check out the paper for an in-depth understanding or my blog post for a TLDR. According to the paper, Swish often performs better than ReLUs. But many people have pointed out that it is also more computationally intensive than ReLUs. In my previous post I showed how Swish performs relative to ReLU and sigmoids on a 2 hidden layer neural network trained on MNIST. Although a simple 2 layer network is a good starting point, one cannot really generalize the results to most problems one might encounter in practical use cases. In this post, I will compare the performance of Swish on Convolutional Networks (CNN) and show you exactly how slow is Swish relative to ReLUs. Lets get started.

BYOC (Build Your Own CNN)

Since I was going to make my code open source, I decided to write it such that someone with more computational resources can just change one line of code and train as deep a model as they like. This is the idea behind BYOC. To make this possible, I used a ResNet like architecture with its famous computational bottleneck units consisting of 3 convolutional layers of 1x1, 3x3 and 1x1 convolutions, as shown below:

Structure of the computational bottleneck (source: ResNet paper)

With this idea, I designed my model such that adding convolutional layers is as simple as setting a variable n_layers to some number. Hence the acronym BYOC. Feel free to check out my GitHub repo and train a better model. For the purposes of this blog, I trained 2 models, a 6 layer and a 12 layer one. These are end to end convolutional networks and there is no fully connected layer except at the output.

Results:

To recap, I am comparing three activations: ReLU, standard swish and swish_beta (f(x)=x*sigmoid(beta*x), where beta is learned during training). Since the objective was to compare activation functions and not to build a great model, I did not tune the hyperparameters and trained for only 10 epochs. The results I got do not paint a rosy picture for Swish activation. Here are the results for training accuracy from the six layer network:

Performance

Its clear to see that ReLU performs quite well here, much better than Swish. The variables for all networks were initialized starting from the same random seed. So, initialization is not a factor here. I really wanted Swish to do better especially considering the fact that it is more computationally involved. A comparison of training time for these activations is shown below:

Training time

To be clear, I have compared the time of forward and backward passes through the whole network, as opposed to just the time of applying the activations, since one won’t use the activations in isolation. On my AWS g2.2xlarge AMI, with a batch size of 128 on an average ReLU took 200 milliseconds to make one full pass, Swish took 11.2% more time, and swish_beta took 12% more time than ReLU.

Inference time

For practical applications, it is more important to know the inference time of any network since when a network is deployed on a product, we only want to do inference. Here again, as expected, Swish is slower. With a batch size of 100 samples, on an average, ReLU took 44 milliseconds, whereas Swish took ~21% more time and swish_beta took ~28% more time.

12 layer Network:

The results from 12 layer network are similar. ReLU has higher accuracy, and is much faster during training and inference. The results are shown below:

Accuracy

Train time:

Inference time:

Interestingly, I the gap between ReLU and Swish for training and inference has increased as the network has gotten deeper. This is not a great sign. Since even a ‘light weight’ ResNet is also about 50 layers deep, it would be just terrible if the gap between training and inference time keeps on increasing with the number of layers. But one cannot really extrapolate from just two data points (6 & 12 layers). I wanted to train an 18 layer network as well, but the results from these two cases made me less motivated to do so. I really wanted Swish to work better than ReLUs but, at least in these experiments, it didn’t. I would be happy to have been proven wrong as more people apply it to real world problems, because I, like everyone else, want to be able to train better models.

Takeaways:

Here are a few takeaways from these results:

  1. Swish does not really perform as well as in these experiments as I expected. ReLUs consistently beat Swish on accuracy.
  2. There is a large difference between training times required for these activations. Even if Swish performs better than ReLUs on a problem, the time required to train a good model will be about 15–20% more than ReLUs.
  3. More importantly, the run time performance of Swish seems to be much slower than ReLUs, by 20–30% or more. This is a non trivial slowdown for cases where real time performance is needed.
  4. Between standard Swish and Swish_beta, the beta version certainly performs better. If you find that Swish works for your problem and if you care ddeply about the accuracy, it is a good idea to try learning the parameter beta during training as well. Of course, the higher accuracy would come at the cost of at higher training time and inference time, as we saw from the graphs above.

Code

All the code for reproducing these results and training more models with BYOC is uploaded on my GitHub. If you find any bugs or have difficulty in understanding the code, feel free to contact me.

--

--

Jaiyam Sharma

Blogs about replicating research papers in machine learning