Mish Activation Function In YOLOv4

Miracle R
Clique Community
Published in
4 min readJun 24, 2020

This is my first blog and I have decided to write a small description about one of the activation functions used in the YOLOv4. YOLO, or You Only Look Once, is a one-shot object detection technique that was introduced by Joseph Redmon and Ali Farhadi in 2016 and there are already 4 versions of the technique. Here, we will take a look at YOLOv4, specifically its performance optimizers, the two “Bags” of optimization functions used: the “Bag of Freebies (BoF)” which is used during the training time and “Bag of Specials (BoS)”, used during the inference.

The Bag of Specials contains low computational cost modules for both the backbone and the detector of the YOLOv4 architecture. These are:

Source: YOLOv4 — Part 3: Bag of Specials | VisionWizard

Here, we can see that the Mish activation function is present in both the backbone and the detector. So, what makes it “special”? Let’s understand a bit more about this activation function.

Mish Activation Function:

Mish is a smooth, non-monotonic activation function, that can be defined as:

f(x) = xtanh(ς(x))

where, ς(x) = ln(1+e^x), is a softmax activation function.

Source: YOLOv4 — Part 3: Bag of Specials | VisionWizard

This is very similar to another activation function called Swish function, that can be defined as:

Source: Swish: Booting ReLU from the Activation Function Throne

The reason why Mish function is used in YOLOv4 is because of its low cost and its various properties like it’s smooth and non-monotonic nature, unbounded above, bounded below property improves its performance when compared with other popularly used functions like ReLU (Rectified Linear Unit) and Swish. The properties of Mish are explained in detail below:

  1. Unbounded above and bounded below: Unbounded above is a desirable property for any activation function since it avoids saturation which causes the training to slow down drastically. Hence, speeding up the training process. The bounded below property helps in achieving strong regularization effects (fits the model properly). (This property of Mish is similar to the properties of ReLU and Swish with a range [≈0.31, ∞)).
  2. Non-monotonic function: This property helps preserve the small negative values, hence stabilizing the network gradient flow. Most commonly used activation function like ReLU [f(x) = max(0, x)], and Leaky ReLU [f(x) = max(0, x), 1] fail to preserve the negative values as their differentiation is 0, and hence most of the neurons do not get updated.
  3. Infinite order of Continuity and Smooth: Mish being a smooth function is good with improvement of results as it is better at generalization and effective optimization of results. In the figure, the drastic change in the smoothness of landscape of a randomly initialized neural network between ReLU and Mish is visible. In the case of Swish and Mish, however, the landscape remains more or less similar.
Source: Mish: A Self Regularized Non-Monotonic Neural Activation Function

4. High computational cost, but better performance: It is costly as compared to ReLU but shows better results in the deep neural networks as compared to ReLU.

Source: Mish: A Self Regularized Non-Monotonic Neural Activation Function

5. Self Gating: This property is inspired from Swish function, where the scalar input is provided to the gate. It is advantageous over point-wise activation functions like ReLU which take in a single scalar input without requiring to change the network parameters.

Python Implementation:

The Mish function can be implemented in python using the PyTorch as follows:

import torch
import torch.nn as nn
import torch.nn.functional as F
class Mish(nn.Module):
def __init__(self):
super().__init__()
def forward(self, x):
return x*(torch.tanh(F.softplus(x)))

Conclusion:

Mish function has outperformed popularly used activation functions like ReLU and Swish in over 70 different criteria of problems on challenging datasets like CIFAR-10, CIFAR-100, CalTech-256, ASL etc. The figure given below shows the performance of Mish, Swish, and ReLU on CIFAR-10 dataset with various models, it can easily be inferred from the figure that Mish outperforms the Swish function by approximately 0.494% and ReLU by 1.671%, hence making it most accurate of the three:

Source: Mish: A Self Regularized Non-Monotonic Neural Activation Function

In YOLOv4, a combination of Mish function + CSPDarknet53 is used which although a little bit costly improves the accuracy of detection by a significant amount, hence making Mish one of the “Specials”.

Source: YOLOv4: Optimal Speed and Accuracy of Object Detection

REFERENCES:

[1] Mish.

[2] YOLOv4 Paper.

[3] Mish Code.

--

--