Data Scientist Interview Guide: Understanding Activation Functions

What are activation functions? When to use which one? How to implement each one in neural networks?

Published in

CodeX

16 min readMay 3, 2023

A common set of question in data science machine learning interview is focused on model architecture choices or design choices. In this post we cover one such design choice — Activation functions. What are they, when to use which one, what are the advantages and disadvantages of using certain activation functions and how to implement them in Tensorflow.

Activation Functions

Activation functions are mathematical functions used in neural networks to introduce non-linearity into the model. In the following image, the function ‘g’ would be the activation function.

We need to use activation functions because linear transformations of the input data (i.e. ‘g’ would be the identity function, f(x) = x) in a neural network results in a linear output, regardless of the number of layers or neurons used in the model. This limits the model’s ability to learn complex patterns in the data. For example, in a binary classification problem, a model without non-linearity would be limited to fitting a linear decision boundary between the two classes, as shown in Fig.2 (left-side).

However, by adding non-linearity to the model, such as activation functions, the model is able to learn more complex non-linear decision boundaries that can fit more complex patterns in the data. For example, the model can learn decision boundaries that are curved or more complex than a straight line, as shown in Fig 2 (right-side) .

Now, lets look at some of the most popular activation functions, their advantages and disadvantages and how to use them.

Sigmoid

The sigmoid activation function is a mathematical function that takes any input value and maps it to a value between 0 and 1. It is defined as:

f(x) = 1 / (1 + e^-x)
Where e is Euler’s number (2.71828…) and x is the input value.

Sigmoid activation. Image credit to PyTorch

The sigmoid function is commonly used in binary classification problems, where the goal is to predict a binary output, such as yes or no, 0 or 1, true or false. The output of the sigmoid function can be interpreted as the probability of the input belonging to the positive class.

Advantages
1. The output of the sigmoid function is always between 0 and 1, which can be interpreted as a probability.
2. It is differentiable, which means it can be used in gradient-based optimization algorithms, such as stochastic gradient descent.

Disadvantages
1. The output of the sigmoid function saturates and becomes flat when the input is too large or too small, which can cause the gradients to become very small and hence stop optimization, a problem known as the vanishing gradient problem.
2. The output of the sigmoid function is not zero-centered, which can cause convergence problems in some neural network architectures.

The sigmoid activation function is best used in situations where the output of the neural network needs to be interpreted as a probability, such as binary classification problems. Here is an example of how the sigmoid activation function can be used as part of a neural network in TensorFlow.

import tensorflow as tf

# Define a neural network with a single hidden layer
model = tf.keras.Sequential([
 tf.keras.layers.Dense(10, input_shape=(784,), activation='sigmoid'),
 tf.keras.layers.Dense(1, activation='sigmoid')
])

# Compile the model with binary crossentropy loss and Adam optimizer
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

# Train the model on some data
model.fit(x_train, y_train, epochs=10, batch_size=32)

For further documentation on how to use the sigmoid activation function in TensorFlow, you can refer to the official TensorFlow documentation: https://www.tensorflow.org/api_docs/python/tf/keras/activations/sigmoid

ReLU (Rectified Linear Unit)

ReLU (Rectified Linear Unit) is most commonly used activation function in deep learning neural networks. It works by mapping any negative input to zero and any positive input to its own value. Mathematically, ReLU is defined as:

f(x) = max(0, x)
where x is the input to the activation function, and f(x) is the output. When x is negative, f(x) will be zero, and when x is positive, f(x) will be x itself.

ReLU activation. Image credit to PyTorch

Advantages
1. ReLU is computationally efficient and easy to implement
2. It helps to avoid the vanishing gradient problem, which can occur when using other activation functions
3. ReLU has been shown to be effective in deep learning models, achieving state-of-the-art results in many applications

Disadvantages
1.When the input is negative, the output is always zero, which can lead to the “dead neuron” problem where the neuron stops learning and does not contribute to the model’s performance
2. ReLU is not a smooth function, which can cause some optimization algorithms to fail

ReLU is a good choice for hidden layers in deep neural networks and it can also be used in the output layer for regression problems where the output range is not restricted.

Here is an example of how the ReLU activation function can be used as part of a neural network in TensorFlow.

import tensorflow as tf

# Define a neural network with a single hidden layer
model = keras.Sequential([
 keras.layers.Dense(32, activation='relu', input_shape=(784,)),
 keras.layers.Dense(10, activation='softmax')
])

# Compile the model with binary crossentropy loss and Adam optimizer
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

# Train the model on some data
model.fit(x_train, y_train, epochs=10, batch_size=32)

The above code creates a neural network with a dense layer of 32 neurons using ReLU activation function and an output layer of 10 neurons using the softmax activation function (discussed later). For more information on using ReLU in TensorFlow, you can refer to the TensorFlow documentation: https://www.tensorflow.org/api_docs/python/tf/keras/layers/ReLU

Tanh (Hyperbolic Tangent)

The Tanh (Hyperbolic Tangent) activation function is a commonly used in the hidden layers. It maps any input value to a value between -1 and 1, which makes it useful for tasks that require outputs to be bounded between -1 and 1. The Tanh function is defined as:

f(x) = (exp(x) — exp(-x)) / (exp(x) + exp(-x))
Here, exp(x) represents the exponential function, which is e raised to the power of x.

Tanh activation. Image credit to PyTorch

Advantages
1. It is a stronger function than the sigmoid activation function, which means it can be more effective in deeper neural networks.
2. It produces negative outputs, which can be useful in some cases. For example in neural networks that are used for image classification tasks the input data is often normalized to have zero mean and unit variance, which means that the input values can be both positive and negative. The Tanh function can be used in the hidden layers of such networks to help normalize the activations and bring them into a similar range as the input data.

Disadvantages
1. It is prone to the vanishing gradient problem when used in very deep neural networks.
2. It can lead to bias shift and overfitting

Note: What is bias shift? Since ReLU provides only non-zero ouptut on positive input, the output tends towards the positive side. In addition to ReLU output, if the bias of the layers are also initialized such that majority of them are positive then model will be influence by the net high positive signal, and overfit to niche patterns in the data that cause this.

The Tanh activation function can be used in situations where the output of the neural network needs to be bounded between -1 and 1 (e.g. image classification or sentiment analysis) or when the hidden layer activation output needs to be kept normalized.

Here is an example of how to use the Tanh activation function in a neural network using TensorFlow.

import tensorflow as tf

# Define a neural network with a single hidden layer
model = tf.keras.models.Sequential([
 tf.keras.layers.Dense(64, activation='tanh'),
 tf.keras.layers.Dense(32, activation='tanh'),
 tf.keras.layers.Dense(10, activation='softmax')
])

# Compile the model with binary crossentropy loss and Adam optimizer
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

# Train the model on some data
model.fit(x_train, y_train, epochs=10, batch_size=32)

In this example, the `Dense` layers of the neural network are using the Tanh activation function. The `softmax` activation function is used in the output layer for multi-class classification. Further documentation on the use of activation functions in TensorFlow can be found in the TensorFlow documentation — https://www.tensorflow.org/api_docs/python/tf/keras/activations

Leaky ReLU

Leaky ReLU (Rectified Linear Unit) is a variation of the ReLU activation function that overcomes the “dying ReLU” problem, where the neuron can become inactive during training and not recover. The Leaky ReLU allows a small, non-zero gradient when the input is negative, which helps to prevent the neuron from dying. The function is defined as follows:

f(x) = x if x > 0 else ax
where `a` is a small positive slope for the negative part of the function, typically set to 0.01.

LeakyReLU Activation. Image credit to PyTorch

Advantages
1. It avoids the “dying ReLU” problem, where the gradient of the neuron can become zero during training and the neuron will stop updating. The small negative slope allows the neuron to have a non-zero gradient, even for negative inputs.
2. It combats bias shift since neuron are allowed to pass small negative signals to the output.

Disadvantages
1. The negative slope is a hyperparameter that needs to be tuned, which can add complexity to the model.
2. While Leaky ReLU can help prevent the gradient from vanishing for negative inputs, it can still cause the gradient to vanish for very large positive inputs. This can make it difficult to train deep networks with many layers.

Leaky ReLU should be used when there is a risk of “dying ReLU” problem, i.e., when many neurons in the network can have negative inputs OR a more continuous activation function is needed. Here’s an example of how to use the Leaky ReLU activation function as part of a neural network in TensorFlow:

import tensorflow as tf

# Define the model architecture
model = tf.keras.models.Sequential([
    tf.keras.layers.Dense(32, activation=tf.nn.leaky_relu, input_shape=(input_dim,)),
    tf.keras.layers.Dense(64, activation=tf.nn.leaky_relu),
    tf.keras.layers.Dense(10, activation=tf.nn.softmax)
])

# Compile the model
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])

# Train the model
model.fit(train_x, train_y, epochs=10, validation_data=(val_x, val_y))

In this example, we’ve used the tf.nn.leaky_relu function as the activation function for the first two dense layers of the neural network. The input_shape argument in the first layer specifies the shape of the input data to the model. For further documentation on using the Leaky ReLU activation function in TensorFlow, you can refer to the official TensorFlow documentation — https://www.tensorflow.org/api_docs/python/tf/nn/leaky_relu

Softplus

The Softplus activation function is a smooth and continuous function that is a variation of the ReLU activation function. It maps any input value to a value between 0 and infinity. The math behind the Softplus function is:

f(x) = log(1 + exp(x))
where x is the input to the function.

Softplus activation. Image credit to PyTorch

Advantages
1. It has a range of output values between 0 and infinity, which can be useful in some cases.
2. It is computationally efficient to calculate.

Disadvantages
1. It is not zero-centered, which can cause problems with convergence based on the neural network architectures.
2. It can be sensitive to the initial values of the weights in the network, which can affect the training process.

The Softplus activation function is best used in the hidden layers of neural networks, where its smoothness and computational efficiency can be beneficial.

Here’s an example of how the Softplus activation function can be used in a neural network in TensorFlow:

import tensorflow as tf

model = tf.keras.Sequential([
 tf.keras.layers.Dense(64, activation='softplus'),
 tf.keras.layers.Dense(10, activation='softmax')
])model.compile(optimizer='adam',
 loss='sparse_categorical_crossentropy',
 metrics=['accuracy'])

In this example, the `Dense` layer with 64 neurons uses the Softplus activation function, while the output layer uses the Softmax activation function for multiclass classification.

For further documentation on using the Softplus activation function in TensorFlow, you can refer to the following link: https://www.tensorflow.org/api_docs/python/tf/keras/activations/softplus

ELU (Exponential Linear Unit)

The ELU (Exponential Linear Unit) is a smooth and continuous function that allows negative values. It is defined as:

f(x) = x if x > 0 else alpha * (exp(x) — 1)
where alpha is a hyperparameter that controls the value of the function for negative inputs. A common value for alpha is 1.0.

Advantages
1. ELU can help to reduce the bias shift and avoid overfitting in neural networks.
2. It has been shown to outperform other activation functions like ReLU and its variants in cases such as regression (where output should take negative values), with imbalanced data (ELU can help prevent the vanishing gradient problem when some inputs have very large positive or negative values)
3. It is a smooth and continuous function, which can help more in the convergence of gradient-based optimization algorithms.
4. ELU can help to avoid the dead neuron problem that can occur with ReLU activation function.

Disadvantages
1. The exponential function used in the function can be computationally expensive.
2. The value of alpha needs to be carefully chosen to balance the advantages of the function.

ELU activation function can be used in situations where the ReLU activation function and its variants are not performing well, such as in deep neural networks with many layers. It can also be used in any neural network architecture that requires a smooth and continuous activation function. Here’s an example of how ELU activation function can be used in a neural network built using TensorFlow:

import tensorflow as tf
# Define the ELU activation function
def elu(x):
 return tf.where(x > 0, x, tf.nn.elu(x))
# Define the neural network architecture
model = tf.keras.models.Sequential([
 tf.keras.layers.Dense(64, activation=elu),
 tf.keras.layers.Dense(32, activation=elu),
 tf.keras.layers.Dense(10, activation='softmax')
])
# Compile the model
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])
```

Further documentation on how to use ELU activation function in TensorFlow can be found in the TensorFlow documentation: https://www.tensorflow.org/api_docs/python/tf/keras/activations/elu

GELU (Gaussian Error Linear Unit)

The GELU (Gaussian Error Linear Unit) activation function is a smooth, non-linear function that is used in deep learning models. It is defined as:

`GELU(x) = 0.5 * x * (1 + erf(x / sqrt(2)))`
where `erf` is the error function.

GELU activation. Image credit to PyTorch

Advantages
1. It has shown to perform well in deep neural networks, especially in natural language processing (NLP) tasks.
2. It is computationally efficient and can be easily implemented in neural network architectures.

Disadvantages
1. It may not perform as well in image recognition tasks compared to ReLU and its variants.
2. It may not be as stable as other activation functions, especially when using large learning rates.

The GELU activation function can be used in situations where you have deep neural networks, and you are dealing with NLP tasks. For example, it can be used in text classification, sentiment analysis, and language translation tasks.

Here’s an example of how you can use the GELU activation function as part of a neural network in TensorFlow:

import tensorflow as tf

# define the GELU activation function
def gelu(x):
 cdf = 0.5 * (1.0 + tf.math.erf(x / tf.sqrt(2.0)))
 return x * cdf# define a simple neural network with GELU activation function
model = tf.keras.Sequential([
 tf.keras.layers.Dense(64, activation=gelu),
 tf.keras.layers.Dense(32, activation=gelu),
 tf.keras.layers.Dense(10, activation='softmax')
])# compile the model
model.compile(optimizer='adam',
 loss='categorical_crossentropy',
 metrics=['accuracy'])

Here is a link to the TensorFlow documentation on the GELU activation function: https://www.tensorflow.org/api_docs/python/tf/keras/activations/gelu

Softmax

SoftMax activation function is a widely used activation function in the output layer of neural networks for multiclass classification problems. It maps the input values to a probability distribution over multiple classes. SoftMax is used to ensure that the sum of the predicted probabilities is equal to 1, which is a requirement in multiclass classification problems.

The math behind SoftMax can be described as follows: given a vector of inputs z = [z1, z2, …, zn], the SoftMax function computes the probability p_i of the i-th class as follows:

p_i = exp(z_i) / sum(exp(z_j)), for i = 1, 2, …, n
Here, exp(z_i) is the exponential function of z_i, which ensures that the probabilities are positive, and the denominator is the sum of exponential functions of all input values. The output of SoftMax is a vector of probabilities [p1, p2, …, pn] that adds up to 1.

Advantages
1. It is useful in multiclass classification problems, where the goal is to predict the probability distribution over multiple classes.
2. It ensures that the sum of predicted probabilities is equal to 1, which is required in probability distribution. It produces a smooth probability distribution, which makes it easy to interpret the results.

Disadvantages
1. It can be sensitive to outliers in the input data, which can affect the predicted probabilities.
2. It assumes that the output categories are mutually exclusive, meaning that an input can only belong to one category. If the output categories are not mutually exclusive, then the Softmax function may not be appropriate, and an alternative activation function, such as the Sigmoid or Tanh function, may be more suitable.

The best situation in which SoftMax should be used is when the output of the neural network is a probability distribution over multiple classes, and the goal is to predict the most probable class for a given input.

Here’s an example of how to use SoftMax activation function in the output layer of a neural network in TensorFlow:

import tensorflow as tf

# Define the neural network model
model = tf.keras.Sequential([
 tf.keras.layers.Dense(64, activation='relu'),
 tf.keras.layers.Dense(10, activation='softmax')
])

# Compile the model
model.compile(optimizer='adam',
 loss='categorical_crossentropy',
 metrics=['accuracy'])

In this example, we define a neural network model with two dense layers. The last layer has 10 output nodes, which correspond to the 10 classes in our multiclass classification problem. We set the activation function of the last layer to SoftMax to ensure that the output is a probability distribution over the 10 classes.

For further documentation on how to use SoftMax in TensorFlow, please refer to the official TensorFlow documentation:
https://www.tensorflow.org/api_docs/python/tf/keras/activations/softmax

Swish

Swish is a smooth activation function that is a combination of the sigmoid and ReLU activation functions. It has shown to outperform both sigmoid and ReLU in terms of traning and generalization performance, in some research papers. (Note: Performance of different activation functions can vary widely depending on the specific neural network architecture, input data, and other factors. Therefore, it is important to carefully evaluate the performance of different activation functions in each specific application.)

The math behind the Swish activation function is

f(x) = x * sigmoid(beta * x),
where x is the input to the activation function, and beta is a trainable parameter that controls the shape of the function. The sigmoid function scales the input to a value between 0 and 1, and the multiplication with the input value x results in a smooth, non-linear function.

Swish activation. Image credits to lazyprogrammer

Advantages
1. It is computationally efficient and has a simple mathematical formulation.
2. It has been shown to outperform the ReLU and sigmoid activation functions in some cases.
3. It is smooth and has a non-monotonic property, which can help to prevent overfitting.

Disadvantages
1. It is a relatively new activation function and has not been extensively studied.
2. The value of the beta parameter can have a significant impact on the performance of the activation function, and it may require some tuning.

The Swish may be particularly useful in deep neural networks as an alternative to other functions. (Note: Further research on Swish is yet to be published, so stay tuned!)

Here’s an example of how to use the Swish activation function in a neural network using TensorFlow:

import tensorflow as tf

# Define the Swish activation function
def swish(x):
 return x * tf.sigmoid(x)

# Define a simple neural network with Swish activation function in the hidden layer
model = tf.keras.Sequential([
 tf.keras.layers.Dense(128, activation=swish),
 tf.keras.layers.Dense(10, activation='softmax')
])

Here’s a link to the official TensorFlow documentation on activation functions, including the Swish activation function: https://www.tensorflow.org/api_docs/python/tf/keras/activations/swish

Hardswish

Hardswish is similar to the Swish activation function but has been designed to be more computationally efficient. Hardswish is defined as follows:

f(x) = max(0, min(x + 3, 6)) * x / 6
In this function, x is the input to the function, and f(x) is the output. First, x is added to 3, and then the result is clipped between 0 and 6. The clipped result is then multiplied by x and divided by 6.

Hardswish activation. Image credit to PyTorch

Advantages
1. Hardswish is faster than Swish and ReLU, because it has a simpler mathematical form.
2. Hardswish is easy to implement and does not require any additional parameters or tuning.

Disadvantages
1. Hardswish is not as expressive as some other activation functions, such as ELU and SELU.
2. Hardswish is less common than some other activation functions, so there may be less documentation and support available for it.

Hardswish is most useful in situations where computational efficiency is a primary concern, such as applications with limited computational resources.

Here is an example of how to use Hardswish as part of a neural network in TensorFlow:

import tensorflow as tf

model = tf.keras.models.Sequential([
 tf.keras.layers.Dense(64, activation=tf.nn.hardswish),
 tf.keras.layers.Dense(10, activation=tf.nn.softmax)
])

model.compile(optimizer='adam',
 loss='sparse_categorical_crossentropy',
 metrics=['accuracy'])

For further documentation on using Hardswish in TensorFlow, you can refer to the following link: https://www.tensorflow.org/api_docs/python/tf/nn/hardswish

Conclusion

This post introduced what activation functions are and why they are used in neural network. We saw 10 of the most common actiavtion function used and when to use them. Its best to use the suggestions mentioned in the post and experiment with the suitable acitvation fuctions to produce the best model architecture for your specific solution.

Credits: This post was written with help from ChatGPT. Check out some of the prompts I used below

You are an experience data scientist with years of practical experience building and tuning machine learning models. Suggest the 10 most useful activation function that someone new to data science should know about, to help build effective models.
Dont recommend function just because they are used in textbook or docs. Recommend functions that are actually used in code.

Explain XYZ activation function in simple term. Include all the math behind it.
Explain the advantage and disadvantages and list them as bullet point.
Explain the best situation in which this activation function should be used.
Show how it can be used as part of neural network in TensorFlow. Link to further documentation in TensorFlow as well