GELU activation

Shaurya Goel
2 min readJul 21, 2019



Activations like ReLU, ELU and PReLU have enabled faster and better convergence of Neural Networks than sigmoids.

Also, Dropout regularizes the model by randomly multiplying a few activations by 0.

Both of the above methods together decide a neuron’s output. Yet, the two work independently from each other. GELU aims to combine them.

Also, a new RNN regularizer called Zoneout stochastically multiplies the input by 1.

We want to merge all 3 functionalities by stochastically multiplying the input by 0 or 1 and getting the output value (of the activation function) deterministically.

We chose this distribution since neuron’s input follow a normal distribution, especially after Batch Normalization.

But the output of any activation function should be deterministic, not stochastic. So, we find the expected value of our transformation.

Since Φ(x) is a cumulative distribution of Gaussian distribution and is often computed with the error function, hence we define Gaussian Error Linear Unit (GELU) as-

GELU (μ=0, σ=1), ReLU and ELU (α=1)

