Meet Mish — New State of the Art AI Activation Function. The successor to ReLU?
A new paper by Diganta Misra titled “Mish: A Self Regularized Non-Monotonic Neural Activation Function” introduces the AI world to a new deep learning activation function that shows improvements over both Swish (+.494%) and ReLU (+ 1.671%) on final accuracy.
Our small FastAI team used Mish in place of ReLU as part of our efforts to beat the previous accuracy scores on the FastAI global leaderboard. Combining Ranger optimizer, Mish activation, Flat + Cosine anneal and a self attention layer, we were able to capture 12 new leaderboard records!
As part of our own testing, for 5 epoch testing on the ImageWoof dataset, we can say that:
Mish beats ReLU at a high significance level (P < 0.0001). (FastAI forums, @ Seb)
Mish has been tested on over 70 benchmarks, ranging from Image Classification, Segmentation and Generation and compared against 15 other activation functions.
I made a PyTorch implementation of Mish, dropped it in for ReLU with no other changes and tested it with a broad spectrum of optimizers (Adam, Ranger, RangerLars, Novograd, etc) on the difficult ImageWoof dataset.
I found Mish delivered across the board improvements in training stability, average accuracy (1–2.8%) and peak accuracy (1.2% — 3.6%), matching or exceeding the results in the paper.
Below is Ranger Optimizer + Mish compared to the FastAI leaderboards:
This was achieved by simply replacing ReLU with Mish in FastAI’s XResNet50, and running with various optimizers (Ranger results above). No changes to anything else including learning rate. **Note — It’s very likely better results will be achieved with optimized learning rates for Mish. The paper suggests lower learning rates for reference versus ReLU.
Mish checks all the boxes of what an ideal activation function should be (smooth, handles negatives, etc), and delivers in a broad suite of initial testing. I have tested a large suite of new activation functions over the past year and most fall down going from papers usually based on MNIST with trivial neural networks, to testing on more realistic datasets. Thus, it appears Mish may in fact finally deliver a new state of the art activation function for deep learning practitioners with with a solid chance of overtaking the long reigning ReLU.
I provide Mish via PyTorch code link below, as well as a modified XResNet (MXResNet) so you can quickly drop Mish into your code and immediately test for yourself!
Let’s step back though, and understand what Mish is, why it likely improves training over ReLU, and some basic steps on using Mish in your neural networks.
What is Mish?
I think it’s simpler to see Mish in code, but the simple summary is Mish =
x * tanh(ln(1+e^x)).
For reference, ReLU is x = max(0,x) and Swish is x * sigmoid(x).
The PyTorch implementation of Mish:
The Mish function in Tensorflow:
x = x *tf.math.tanh(F.softplus(x))
How does Mish compare to other activation functions?
The Mish image from the paper shows testing results of Mish versus a number of other activations. This is the result of up to 73 tests on a variety of architectures for a number of tasks:
Why does Mish perform well?
Being unbounded above (i.e. positive values can go to any height) avoids saturation due to capping. The slight allowance for negative values in theory allows for better gradient flow vs a hard zero bound as in ReLU.
Finally, and likely most importantly, current thinking is that smooth activation functions allow for better information propagation deeper into the neural network, and thus better accuracy and generalization.
That being said, I’ve tested a number of activation functions that also met many of these ideals, and most failed to perform. The main difference here is likely the smoothness of the Mish function at nearly all points on the curve.
This ability to push info via smoothness of the activation curve for Mish is exhibited below, in a simple test from the paper where more and more layers, without an identity function, were added to a test neural network. As the layer depth increases, ReLU accuracy rapidly declines, followed by Swish. By contrast Mish continues to preserve accuracy far better and that is likely due to it’s ability to propagate information better:
How you can put Mish to work in your neural nets!
Source code for Mish in PyTorch and FastAI is available at github in two places:
1 — Official Mish github: https://github.com/digantamisra98/Mish
2 — Unofficial Mish with inline for speed: https://github.com/lessw2020/mish
3 -Our repo for setting the FastAI records with Ranger and Mish (part of MXResNet):
Copy mish.py to your relevant directory, and include it, and then point your networks activation function to it:
Alternatively, FastAI users can load and test with FastAI’s XResNet modified to use Mish instead of ReLU. Copy the file mxresnet.py to your local directory or path, and include:
Next specify the relevant ResNet size (18, 34, 50, 101, 152) and include it as the architecture for loading the learner (cnn_learner, etc). Below is how I loaded it as mxresnet50, with the Ranger optimizer:
And you are up and running with a Mish XResNet!
ReLU has some known weaknesses, and yet in general it performs and is lightweight computationally. Mish has a stronger theoretical pedigree, and in testing delivers, on average, superior performance over ReLU in terms of both training stability and accuracy.
There is only a minor increase in complexity (V100 GPU and Mish, adds about 1 second per epoch vs ReLU) and given the improved training stability and final accuracy seems well worth the minor time increase.
Ultimately, after having tested a large number of new activation functions this year, Mish has the lead here, and I suspect has a strong chance of becoming the new ReLU for AI going forward.
Put Mish to the test in your deep learning and feel free to provide feedback below, positive or negative!
PyTorch drop in code — Mish function and Mish XResNet: https://github.com/lessw2020/mish