Gulcehre’s noisy activation function

vc
vclab
Published in
2 min readAug 7, 2016

This is part of an experiment on studying applicability of neural network.

Saturation of activation function

In deep learning, learning becomes harder when activation function is saturated. It is because exploding back-propagated gradients may lead to oscillating weights once opposite signed error is propagated back. The solution proposed by Gulcehre et al. is to introduce noise into activation function to escape from saturation regime easier, and the amount of noise introduced is proportional to magnitude of saturation of nonlinearity. Preventing activation functions from saturation reduces chance for oscillating weights, and helps stabilize the learning.

Experiment results

A neural network with 6 hidden layers and noisy hyperbolic tangent as activation function (except the last hidden layer) is trained with categorical cross entropy as objective function and adam as optimizer.

m = Sequential()
m.add( Dense( 512, input_dim=X.shape[ 1 ], init='glorot_normal', activation=ntanh ) )
m.add( Dense( 512, init='glorot_normal', activation=ntanh ) )
m.add( Dense( 512, init='glorot_normal', activation=ntanh ) )
m.add( Dense( 512, init='glorot_normal', activation=ntanh ) )
m.add( Dense( 512, init='glorot_normal', activation=ntanh ) )
m.add( Dense( 512, init='glorot_normal', activation='tanh' ) )
m.add( Dense( Y.shape[ 1 ], init='glorot_normal', activation='softmax' ) )
m.compile( loss='categorical_crossentropy', optimizer='adam' )

Another neural network with same settings but hyperbolic tangent as activation function is trained for comparison.

m = Sequential()
m.add( Dense( 512, input_dim=X.shape[ 1 ], init='glorot_normal', activation='tanh' ) )
m.add( Dense( 512, init='glorot_normal', activation='tanh' ) )
m.add( Dense( 512, init='glorot_normal', activation='tanh' ) )
m.add( Dense( 512, init='glorot_normal', activation='tanh' ) )
m.add( Dense( 512, init='glorot_normal', activation='tanh' ) )
m.add( Dense( 512, init='glorot_normal', activation='tanh' ) )
m.add( Dense( Y.shape[ 1 ], init='glorot_normal', activation='softmax' ) )
m.compile( loss='categorical_crossentropy', optimizer='adam' )
Training loss for neural network with noisy activation function (green) and one without noise introduced (blue)
The 25th percentile and 75th percentile of activated values in each hidden layers for neural network with noisy activation function during learning
The 25th percentile and 75th percentile of activated values in each hidden layers for neural network with usual hyperbolic tangent during learning

References

Bengio, Yoshua, Patrice Simard, and Paolo Frasconi. “Learning long-term dependencies with gradient descent is difficult.” IEEE transactions on neural networks 5.2 (1994): 157–166.

Gulcehre, Caglar, Marcin Moczulski, Misha Denil, and Yoshua Bengio. “Noisy Activation Functions.” Proceedings of The 33rd International Conference on Machine Learning. 2016.

Noisy units. https://github.com/caglar/noisy_units

--

--