Flatten-T Swish: A Thresholded ReLU-Swish-like Activation Function for Deep
Learning

Joshua Chieng
4 min readAug 1, 2018

--

Activation functions are essential for deep learning methods to learn and perform complex tasks such as image classification. Since the rise of deep
learning in 2012, Rectified Linear Unit (ReLU) has been widely used and chosen as default activation function by the deep learning community. Although ReLU has been the favorite to the community, however, the hard zero property of ReLU heavily hindering the flow of the negative values from propagating through the network. Consequently, the deep neural network has not been benefited from the negative representations. Here, an activation function called Flatten-T Swish (FTS) that leverage the benefit of the negative values is proposed. In order to verify its performance, I evaluated FTS with ReLU and several recent activation functions. Each activation function is trained using MNIST dataset on five different deep fully connected neural networks (DFNNs) with depth vary from five to eight layers. For the reason of fair evaluation, all DFNNs are using the same configuration settings. Based on the experimental results, FTS with a threshold value, T= -0.20 has the best overall performance. As compared with ReLU, FTS (T= -0.20) improves MNIST classification accuracy by 0.13%, 0.70%, 0.67%, 1.07% and 1.15% on wider 5 layers, slimmer 5 layers, 6 layers, 7 layers and 8 layers DFNNs respectively. Apart from this, I also noticed that FTS converge twice as fast as ReLU. Even though there are many other existing activation functions, but I selected ReLU as the baseline activation function to be compared.

How FTS (T=0.00) and ReLU look like:

FTS (T=0.00) vs. ReLU

Mathematically, ReLU is defined as:

ReLU function

To construct the FTS activation function, first amends the original ReLU function by multiple its linear identity part (when x ≥ 0) with Sigmoid activation function. Where the idea can be simply expressed by FTS(x) = ReLU(x) * Sigmoid(x) or:

FTS function, without threshold

To tackle the ReLU’s hard zero problem particularly during the forward propagation, a threshold value T is added to the FTS to allow negative values flow through the network. Suggesting that the value for T is set to be less than zero in order to benefit the network with the representations in the negative form. With T added in the function, the function will eventually return all negative values as T when input value falls into x < 0 domain. Ultimately, FTS with T added is expressed as follows:

FTS function with Threshold T
The illustration of FTS at T < 0

Let see how the derivative of the FTS looks like.

Derived ReLU vs derived FTS

I tried to run FTS on five DFNNs with various depth from 5 to 8 layers. Here is the descriptions of the models.

Network architectures of five different DFNNs

FTS was taken to compare with other activation functions too. The experiment shows that FTS with T=-0.20 give the best overall performance across all five DFNNs on MNIST classification. The results are as follow:

The test accuracy of FTS with vary of T and existing activation functions. The “score” column aggregate the number of times of each FTS outperforming the best result obtained by the existing activation function. Notes: The values with asterisk (*) noted indicate the best results obtained by the existing activation functions, while the values in bold indicate the results of FTS that are outperformed the result noted with asterisk (*) on respective DFNN model.

Apart from that, I also realized that FTS (T = -0.20) convergence about 2 times faster than ReLU.

The training and testing curves of FTS (T = -0.20) and ReLU on DFNN-8 network

As conclusion, FTS activation function has a small threshold value T is added to allow negative representations to be flown through the entire network, particularly during the forward propagation. This property enables the network benefits from the negative representations and lead to better predictive performance. Meanwhile, FTS retains the sparsity property during backpropagation where its derivative returns zero at x < 0, which is an important element to reduce the computational complexity. The experiment shown that the FTS particularly with T = -0.20 outperformed other existing activation functions consistently in all five DFNNs with various depth. Specifically, by comparing with ReLU baseline, FTS (T = -0.20) improves MNIST classification accuracy by 0.13%, 0.70%, 0.67%, 1.07% and 1.15% on
DFNN-5a, DFNN-5, DFNN-6, DFNN-7 and DFNN-8 respectively. In addition, the experiment also observed that FTS does speed up convergence about 2 times faster than ReLU.

Publication:
Chieng, H. H., Wahid, N., Pauline, O., & Perla, S. R. K. (2018). Flatten-T Swish: a thresholded ReLU-Swish-like activation function for deep learning. International Journal of Advances in Intelligent Informatics, 4(2).

URL: http://www.ijain.org/index.php/IJAIN/article/view/249

--

--

Joshua Chieng

PhD candidate (Information Technology) @ Universiti Tun Hussein Onn Malaysia. Research area: Deep Learning & Activation Function.