Six Tools to Improve Your ANN (Part 1/2)

Nikita Volzhin
6 min readJun 17, 2024

--

Intro

Neural networks are a very powerful set of tools in machine learning. On one hand, it provides huge flexibility and can solve numerous tasks, on the other hand, one often feels lost in this pile of hyperparameters, different layer types, and many other things to tune. I will talk about 6 instruments to improve your ANN, which every ML engineer must know, and show you how to use them in practice using Keras with TensorFlow backend. Specifically, I will cover:

  • Selection of the activation function (this article)
  • Initialization (this article)
  • Normalization layer (this article)
  • Learning rate scheduling (next article)
  • Optimizer (next article)
  • Hyperparameters tune (next article)

A Bit of Theory

The first three techniques listed above are here to tackle the vanishing/exploding gradient problem. What is this problem about? Due to the way the gradient is calculated, it becomes too small/too large in the early layers of the network (the ones closer to the input layer). The weights in the neural network are updated during training by subtracting the learning rate multiplied by the gradient of the loss function with respect to the weights (at least in gradient descent). Thus, weights are either barely or drastically changed with exploding/vanishing gradient problem, both of which prevent the model from converging to the optimal solution.

gradient descent formula. Phi is a matrix of weights, eta is a learning rate, nabla J is gradient of loss function wrt to weights

Researchers have proven if the variance of the gradient stays the same after flowing through any layer, the problem of the unstable gradient may be solved. Although it sounds hard it can be achieved effortlessly using the right initialization function. Reminder: initialization is a function that randomly creates initial weights in the layers. There are research papers discovering which initialization should work well with which activation function. To avoid diving into exciting Math foundation, I will just provide you this info in the table below:

All these activation functions might seem overwhelming, but in reality, most of them are slightly modified copies of each other. ReLU (rectified linear unit) was very popular for a long time because, in comparison with sigmoid or tahn, it reduces the risk of vanishing gradient. However it has some issues too. Namely, if it gets as an input negative number, the output of the ReLU will be just 0. In other words, some parts of your ANN may just die, constantly outputting zeros. So all these Leaky ReLU, ELU, SELU were created to fix this, and for values less than 0, they output small negative number. (I graphed these functions below)

ReLU, ELU, Leacky ReLU, and SELU activation functions

In general, we can sort them by their performance in the following way:

SELU > ELU > leaky ReLU (and its variants) > ReLU > tanh > logistic

NB: SELU is basically ELU but SELU also normalizes the network keeping std of the layers output equal to 1 and the mean equal to 0. That is great for avoiding the vanishing gradient problem, BUT in order to work one must standardize the inputs and one must not use batch normalization, any regularization, or DroupOut layers (but one can use Alpha Dropouts instead). So to make life easier you can just keep calm and use ELU.

Also in this article, I will show how to add a normalization layer right in the model. It is not a secret that the inputs in the ANN should be scaled, and preferably normalized. It can be done in the preprocessing part, but then in production one will have to normalize the inputs again before feeding data in the model to get the predictions. I will show you how to avoid it by including a normalization layer directly in the model.

Practice

How about we start with a default ANN and then see how the performance changed after applying all of the above-mentioned strategies? I will provide the most important passages of the code, but you can find the whole .ipynb file in my GitHub repository:

Default setting

I will use the Keras fanshion mnist dataset, which contains 70 000 28x28 pictures of garments in 10 different categories.

28 by 28 picture from mnist fashion dataset

The task is to predict the label of the garment (here are 10 types of them) based on its picture, that is a multiclass classification task. Each pixel in the pictures is encoded into a number from 0 to 255 depending on its darkness, so I divide everything by 255 to scale them to range from 0 to 1:

X_test = X_test / 255.0
X_train = X_train / 255.0
X_valid = X_valid / 255.0

Next, I create a simple ANN, compile it, and train on this dataset.

model = keras.models.Sequential([
keras.layers.Flatten(input_shape=[28, 28]),
keras.layers.Dense(100, activation='sigmoid'),
keras.layers.Dense(100, activation='sigmoid'),
keras.layers.Dense(10, activation='softmax')
])

model.compile(loss='sparse_categorical_crossentropy',
optimizer='sgd',
metrics=['accuracy'])

model.fit(X_train, y_train, validation_data=(X_train, y_train), epochs=10)

Note that here I provide validation data to measure the model’s performance after each epoch on it. This will give a more clear picture of the learning process and can capture overfitting. Next, I evaluate the network on the test set

model.evaluate(X_test, y_test)

And get the accuracy 0.778. Sad, let’s see how to do better.

Modifications

First of all, simply scaling the input to the scale from 0 to 1 is not worth a penny! For the greatest performance, the input to the neural network must be normalized (i.e. standard deviation of 1 and mean of 0 preferably with normal distribution). Luckily, there is a layer in Keras which can do this for you:

norm_layer = keras.layers.Normalization()

Automatically it will take the first batch, measure its std and mean, and convert all the values based on them. Needless to say, this practice is not the best as the first batch may not be representative. Alternatively, we can adapt this layer to the training set.

NB: we can adapt the layer only after compiling the model!

model2 = keras.models.Sequential([
norm_layer,
keras.layers.Flatten(input_shape=[28, 28]),
keras.layers.Dense(100, activation='sigmoid'),
keras.layers.Dense(100, activation='sigmoid'),
keras.layers.Dense(10, activation='softmax')
])

model2.compile(loss='sparse_categorical_crossentropy',
optimizer='sgd',
metrics=['accuracy'])

#adapting normalization layer
norm_layer.adapt(X_train)

model2.fit(X_train, y_train, validation_data=(X_valid, y_valid), epochs=10)

After training the new model, let’s check the accuracy:

model2.evaluate(X_test, y_test)

I got an accuracy of 0.830, which seems more promising. But I know we can do better. Let’s get rid of this outdated ReLU function and set the right initializer (NB: by default, keras uses Glorot initialization with a uniform distribution).

Earlier I said we could use ELU and enjoy the life, but since the inputs are normalized and so far we have not used any regularizations/dropouts we can securely use SELU and hence LeCun initialization:

model3 = keras.models.Sequential([
norm_layer,
keras.layers.Flatten(input_shape=[28, 28]),
keras.layers.Dense(100, activation='selu', kernel_initializer='lecun_normal'),
keras.layers.Dense(100, activation='selu', kernel_initializer='lecun_normal'),
keras.layers.Dense(10, activation='softmax')
])

compiling, training, and testing this model we get now an accuracy of 0.874! Almost 10 percent points more than in the beginning!

To be continue…

We will try to boost this result even more in the next article using learning rate scheduling, selecting the right optimizers, and hyperparameters tuning. Here is the link to it ;)

So take a short break and go forward to check it out as well! For now, thank you for reading I hope it was helpful!

--

--