Soft Sign Activation Function with Tensorflow [ Manual Back Prop with TF ]

GIF from this website

So the paper “Quadratic Features and Deep Architectures for Chunking” have been around since 2009, however I recently found about the activation function introduced in that paper while reading “Understanding the difficulty of training deep feed-forward neural networks” (I also did a paper summary on that matter, if anyone is interested in that please click here.)

Just for fun lets compare many different cases to see which gives us the best results.

Case a) Tanh Activation Function with AMS Grad
Case b) Soft Sign Activation Function
with AMS Grad
Case c) ELU Activation Function
with AMS Grad



Soft Sign Activation Function

Red Line → Soft Sign Activation Function
Blue Line → Tanh Activation Function
Green Line → Derivative for Soft Sign Function
Orange Line → Derivative for Tanh Activation Function

As seen above we can directly observe the fact that soft sign activation function is more smoother than tanh activation function. (Specifically this functions grows poly-nominally rather than exponentially.) And this gentler non-linearity actually results in better faster learning. In bit more detailed explanation, researchers have found with soft sign have prevented neurons from being saturated resulting in more effective learning. (For more information please read this blog post.)


Network Architecture / Data Set

Red Box → Input Image
Black Box → Convolution Operation with Different Activation Functions
Orange Box → Soft Max Operation for Classification

As seen above, the base network architecture we are going to use is the all convolutional network presented in ICLR 2015. Finally the data set that we are going to use evaluate our network is CIFAR 10 data.


Results: Case a) Tanh Activation Function with AMS Grad (CNN)

Left Image → Train Accuracy / Cost Over Time
Right Image → Test Accuracy / Cost Over Time

As seen above for image classification task using tanh with all convolutional network seems to take a longer time to converge. At the end of 21th epoch the model was only able to achieve 69% accuracy on the training images (with 68% accuracy on testing images) however the model seems to do a great job at regularization.


Results: Case b) Soft Sign Activation Function with AMS Grad (CNN)

Left Image → Train Accuracy / Cost Over Time
Right Image → Test Accuracy / Cost Over Time

I was actually surprised that the soft sign activation function perform worse than tanh. Taking a longer time to converge does not always mean that the activation function is bad, however for this set up soft sign activation function might not be the optimal choice.


Results: Case c) ELU Activation Function with AMS Grad (CNN)

Left Image → Train Accuracy / Cost Over Time
Right Image → Test Accuracy / Cost Over Time

With image classification it seems like traditional relu like activations are the best choice. Since not only they have the highest accuracy on the training/testing images, but also converges faster.


Interactive Code

For Google Colab, you would need a google account to view the codes, also you can’t run read only scripts in Google Colab so make a copy on your play ground. Finally, I will never ask for permission to access your files on Google Drive, just FYI. Happy Coding! Also for transparency I uploaded all of the training logs on my github.

To access the code for case a click here, for the logs click here.
To access the code for case b click here, for the logs click here.
To access the code for case c click here, for the logs click here.


Final Words

Also, this blog post did an amazing job comparing many other activation functions as well, so please check it out if anyone is interested.

If any errors are found, please email me at jae.duk.seo@gmail.com, if you wish to see the list of all of my writing please view my website here.

Meanwhile follow me on my twitter here, and visit my website, or my Youtube channel for more content. I also implemented Wide Residual Networks, please click here to view the blog post.


Reference

  1. Turian, J., Bergstra, J., & Bengio, Y. (2009). Quadratic features and deep architectures for chunking. Proceedings Of Human Language Technologies: The 2009 Annual Conference Of The North American Chapter Of The Association For Computational Linguistics, Companion Volume: Short Papers, 245–248. Retrieved from https://dl.acm.org/citation.cfm?id=1620921
  2. Softsign as a Neural Networks Activation Function — Sefik Ilkin Serengil. (2017). Sefik Ilkin Serengil. Retrieved 28 May 2018, from https://sefiks.com/2017/11/10/softsign-as-a-neural-networks-activation-function/
  3. Deep study of a not very deep neural network. Part 2: Activation functions. (2018). Towards Data Science. Retrieved 28 May 2018, from https://towardsdatascience.com/deep-study-of-a-not-very-deep-neural-network-part-2-activation-functions-fd9bd8d406fc
  4. (2018). Proceedings.mlr.press. Retrieved 28 May 2018, from http://proceedings.mlr.press/v9/glorot10a/glorot10a.pdf
  5. [ Paper Summary ] Understanding the difficulty of training deep feed-forward neural networks. (2018). Medium. Retrieved 28 May 2018, from https://medium.com/@SeoJaeDuk/paper-summary-understanding-the-difficulty-of-training-deep-feed-forward-neural-networks-ee34f6447712
  6. Derivative Hyperbolic Functions. (2018). Math2.org. Retrieved 28 May 2018, from http://math2.org/math/derivatives/more/hyperbolics.htm
  7. Implementation of Optimization for Deep Learning Highlights in 2017 (feat. Sebastian Ruder). (2018). Medium. Retrieved 28 May 2018, from https://medium.com/@SeoJaeDuk/implementation-of-optimization-for-deep-learning-highlights-in-2017-feat-sebastian-ruder-61e2cbe9b7cb
  8. Wolfram|Alpha: Making the world’s knowledge computable. (2018). Wolframalpha.com. Retrieved 28 May 2018, from http://www.wolframalpha.com/input/?i=f(x)+%3D+x%2F(1%2B%7Cx%7C)
  9. Numpy Vector (N, ). (2018). Numpy Vector (N,1) dimension -> (N,) dimension conversion. Stack Overflow. Retrieved 29 May 2018, from https://stackoverflow.com/questions/17869840/numpy-vector-n-1-dimension-n-dimension-conversion
  10. CIFAR-10 and CIFAR-100 datasets. (2018). Cs.toronto.edu. Retrieved 29 May 2018, from https://www.cs.toronto.edu/~kriz/cifar.html
  11. [ ICLR 2015 ] Striving for Simplicity: The All Convolutional Net with Interactive Code [ Manual…. (2018). Towards Data Science. Retrieved 29 May 2018, from https://towardsdatascience.com/iclr-2015-striving-for-simplicity-the-all-convolutional-net-with-interactive-code-manual-b4976e206760
  12. [ Google ] Continuously Differentiable Exponential Linear Units with Interactive Code [ Manual Back…. (2018). Medium. Retrieved 29 May 2018, from https://medium.com/@SeoJaeDuk/google-continuously-differentiable-exponential-linear-units-with-interactive-code-manual-back-2d0a56dd983f