Six Tools to Improve Your ANN (Part 2/2)

Nikita Volzhin
5 min readJun 17, 2024

--

Intro

In the previous article, I wrote about how to improve the performance of your multilayer perceptron model using various tools. Here is the link if you have not read it yet:

There I briefly explained what is the problem of unstable gradient, how to choose the right activation function, and initialization to solve this problem, and showed the usage of the normalization layer. This is the second part, where I will explain three more tools to improve your model, namely

  • Optimizer
  • Learning rate scheduling
  • Hyperparameters tune

and show how to use them in practice with the Keras library on the tensorflow backend.

Theory

Imagine a golf ball rolling across a field. Eventually, it will stop at some pit or preferably at a golf hole. Optimizing weight works in a similar way: the field is the surface of a loss function, the position of the ball is the current weight, and the optimizer is the physical laws describing the movement of the ball. On the internet I found this amazing GIF showing this graphically:

The optimizer dictates how the weights should change at the end of each epoch. As in golf, the ultimate goal is to get to the golf hole, and the goal of the optimizer is to converge to the lowest loss, that is the optimal solution, as quickly as possible. It will take a lot of time to describe mathematically each of them here, so instead if you are interested you can check out this Medium article:

In general, you should try first adaptive optimizers, as they perform best in most of the tasks. So to say, try Nadam, then Adam first, and if you are not satisfied with their performance go to classical SGD.

Another hyperparameter to tune is the learning rate. A low learning rate results in slow convergence, while a high learning rate may prevent the model from converging at all. However, one should note that having a constant learning rate during the whole period of training may be impractical; instead one can schedule it! For example, you set a high initial learning rate and schedule it to decrease as learning proceeds. Returning to the analogy with a golf ball, in such a case it will quickly move to the place around the hole and then slow down to avoid jumping over it.

I graphed some of the algorithms for learning rate scheduling below

In general Exponential scheduling works fine, and moreover, it is implemented in tf. keras, so you don’t have to define it yourself!

In this article, I will also cover random search for hyperparameters tuning, but the best way to understand it is to see it in practice, so let’s code!

Practice

As usual, you can find my full code on the GitHub repository.

Here I will continue the improvement of the model we used in the previous article, so the setting is the same. We are using the fashion mnist dataset to predict clothes type based on its image. Let’s begin with the optimizer. One specifies the optimizer when compiling the model. There are two ways to do this:

model.compile(..., optimizer='sgd', ...)
model.compile(..., optimizer=keras.optimizers.SGD(), ...)

These two lines produce identical results, however, in the second case, you can specify some hyperparameters for the optimizer such as learning rate. For example, one can set it as a number:

model.compile(..., optimizer=keras.optimizers.SGD(learning_rate=0.001), ...)

Or provide learning rate scheduling, which we just discussed. To do this we must define such a schedule, for instance:

lr_schedule = tf.keras.optimizers.schedules.ExponentialDecay(
0.1, #initial_learning_rate
10000,#decay steps
0.8)#decay rate

I selected ExponentialDecay, however you can find many more options in the TensorFlow API. In my case, you should provide the initial learning rate, decay step, and decay rate (see the formula below).

Exponential decay formula: eta is a learning rate, eta_0 is an initial learning rate, r is decay rate, t is step, s is decays teps

After we defined a schedule we just pass it as a learning rate parameter to the optimizer, for example:

model.compile(..., optimizer=keras.optimizers.SGD(learning_rate=lr_schedule), ...)

We are almost there. Now let’s select the number of layers and neurons in each layer. Intuition is a way to go, but it would be way smarter to use hyperparameter tuning. For it, you need to install the keras_tuner library if you do not already have it:

%pip install -q -U keras-tuner

Then we have to define a function which will take kteras_tuner.HyperParameters object to define hyperparameters and the range of possible uses. For example,

def build_model(hp):
#number of hidden layers
n_hidden = hp.Int("n_hidden", min_value=0, max_value=8, default=2)
#number of neurons in ech layer
n_neurons = hp.Int("n_neurons", min_value=16, max_value=256)
#initial learning rate
learning_rate_initial = hp.Float("learning_rate", min_value=1e-4, max_value=1e-2,
sampling="log")

#learning rate schecule
lr_schedule = tf.keras.optimizers.schedules.ExponentialDecay(
learning_rate_initial,
10000,
0.8)

optimizer = keras.optimizers.Nadam(learning_rate=lr_schedule)

model = tf.keras.Sequential()
model.add(norm_layer)
model.add(tf.keras.layers.Flatten(input_shape=[28, 28]))
for _ in range(n_hidden):
model.add(tf.keras.layers.Dense(n_neurons, activation="selu", kernel_initializer='lecun_normal'))
model.add(tf.keras.layers.Dense(10, activation="softmax"))
model.compile(loss="sparse_categorical_crossentropy", optimizer=optimizer,
metrics=["accuracy"])
return model

Now you can start the search with these two commands:

random_search_tuner = kt.RandomSearch(
build_model, objective="val_accuracy", max_trials=10, overwrite=True,
directory="my_fashion_mnist", project_name="my_rnd_search", seed=42)

random_search_tuner.search(X_train, y_train, epochs=10,
validation_data=(X_valid, y_valid))

These commands will start the search. It will randomly select hyperparameters and train a model with them on 10 epochs. Repeating this 10 times, the search will save the best model in terms of validation accuracy in the directory my_fashion_mnist/my_rnd_search which it creates automatically. After running this we can access the best parameters:

top_params = random_search_tuner.get_best_hyperparameters(num_trials=1)
top_params[0].values

In my case, those are {‘n_hidden’: 6, ‘n_neurons’: 235, ‘learning_rate’: 0.00044412650444693207}

and we can access the model without having to train it again

best_model = random_search_tuner.get_best_models(num_models=1)[0]

Done! Now we are ready to estimate the final accuracy:

best_model.evaluate(X_test, y_test)

I got the result of 0.884. if you remember in the previous article we started from 0.778. So we improved the accuracy by almost than 11 percent points!

Summary

In these two articles, I showed 6 cool tools which can improve the quality of your model and prevent unstable gradient problems: activation function, initialization, normalization layer, learning rate scheduling, optimizer, and hyperparameters tuning. Make sure to get familiar and play around with them. There are many more useful tools that can increase the accuracy of your ANN even more, such as overfitting prevention techniques, about which I will write a separate article. Continue exploring!

P.S. Love data, science, and data science!

--

--