Exotic Flora: Activation Functions You Haven’t Heard Of
So recently I was working on my project to understand ChatGPT by reading the papers that it recommended and I came across something interesting. In the paper “Improving Language Understanding by Generative Pre-Training” they used an activation function I didn’t recognize.
I was obviously aware that there exist activation functions outside of the usual subjects (namely: Step, Sigmoid, TanH, ReLU and Softmax), but this was the first time I had seen one not only in use, but used in a landmark paper.
So I decided to find some more strange activation functions and collect 3 of them in a list. My goal here is to find at least one you’ve never heard of before. If I succeed you have to leave a comment telling me which one was new to you. If you know all of them then tell me one I missed.
If this is popular then I’ll do the same for Optimizers and Objective functions as well. Make sure you subscribe to get updates on all my articles!
GELU
GELU, or Gaussian Error Linear Units, is the activation function from Generative Pre-Training that inspired me to write this article. Its formulation is incredibly interesting. First proposed in the 2016 paper of the same name[1], GELU claims to to accomplish two tasks at once by combining the benefits of dropout and ReLU style activation.
It accomplishes this by taking the neurons output and multiplying it by either 0 or 1, with the choice being determined randomly based on the Bernoulli Distribution.
Where erf (the error function) is defined as below. The important thing to note is that this function is bounded between [-1, 1].
What this means is that the probability that the outcome of a neuron is dropped increases as x gets smaller. So neurons with smaller values are more likely to be dropped. In practice we can approximate GELU with either of the equations below to increase computation speed.
Results and Uses
Testing in the paper found that GELU outperformed ReLU and another less common activation function called ELU on many different tasks. Using GELU in deep neural networks enhanced learning on image classification tasks in multiple benchmarks, and in tagging parts of speech in Tweets (a task I’m not sure I could do at all).
Plus, as was mentioned above, it was used to great effect in the Generative Pre-Training paper, which is one of the foundations of Language Models. A pretty ringing endorsement to me.
Is there a PyTorch implementation?
There sure is!
Swish
First of all, if you’re the kindof person who is reading this article, you probably want to check out the 2017 paper “Searching for Activation Functions” [3]. In addition to the fact that it has a title like a Pixar film, it’s a really interesting dive into activation functions generally.
Sometimes it can feel like choosing an activation is more art and less science. This paper is a great discussion about activation functions generally, how they are structured/grouped, and factors that appear to make some functions more successful than others.
All that aside, the function they land on is simply to multiply the output x of each neuron by the sigmoid function of x multiplied by a parameter beta.
The authors argue that this function could be thought of as a smooth transition between a linear function and ReLU. As beta approaches 0 it approaches a linear function and as beta approaches infinity it becomes ReLU like. They propose that beta could be learned as an extra parameter during training. As we can see below though, the most common outcome is actually beta near to one.
Results and Uses
Testing in the paper found that Swish outperformed activation functions like ReLU, ELU, GELU and others on many different tasks. The problem? The best value for beta varied across tasks. This makes it a bit hard to actually use it because it turns out people aren’t super fond of parameters in their activation functions to worry about.
Is there a PyTorch implementation?
Sortof? — There is an implementation for the function as long as we drop the trainable parameter in Swish and just set it to one. This is equaivalent to a function called SiLU — Sigmoid Linear Units. It’s just GELU using the Logistic Distribution instead of Bernoulli. There is an implementation in Tensorflow though it is unclear if they have the tuneable parameter there either.
Hardswish
Just like the name suggests, this function is a modification of Swish. It comes from an image modeling paper called “Searching for MobileNetV3” [3]. If you’re interested in computer vision as a discipline, you’ve probably heard of the MobileNet models. They are a set of image models that are built to run on CPUs, specifically they are built to run on mobile phones, so they need to be accurate and light.
As such, Hardswish is an attempt to preserve the improved accuracy of Swish over ReLU, while removing the need to calculate the sigmoid function. On limited hardware like a mobile phone CPU, this can be a costly computation. Instead they propose to use a piece-wise function.
This looks a lot like Swish over it’s range.
Results and Uses
As previously mentioned, this function is much easier to compute on CPU than Swish, and it was used extensively in the MobileNetV3 paper. The main gains for this function are not in accuracy. Its primary advantage is that it’s much faster. Compared to Swish it was 6ms faster (which was 10% of the total program runtime), and it was only 1ms slower than ReLU while providing more accurate results.
Is there a PyTorch implementation?
Conclusion
Well I sure hope you found this exploration of some of the lesser known activation functions to be useful (or at very least entertaining).
Don’t forget to leave a comment if at least one of these was new to you, and let me know if you have any strange ones you think I should know about. Be sure to checkout my ongoing project to understand ChatGPT and make sure you subscribe to see more of my stuff!
References
[1]Hendrycks, D. and Gimpel, K., “Gaussian Error Linear Units (GELUs)”, arXiv e-prints, 2016. doi:10.48550/arXiv.1606.08415.
[2]Howard, Andrew, et al. “Searching for mobilenetv3.” Proceedings of the IEEE/CVF international conference on computer vision. 2019.
[3]Ramachandran, Prajit, Barret Zoph, and Quoc V. Le. “Searching for activation functions.” arXiv preprint arXiv:1710.05941 (2017).