Neural Network Hyperparameter Optimization with Hyperopt

Icaro
11 min readMar 6, 2024

--

A while back I wrote about using Machine Learning to predict if my favorite soccer team, Arsenal, would ever win the Premiership again. We created a Neural Network that would predict results of future matches and we trained it using a kaggle dataset of all the Premiership matches ever played up to March 2020.

One of the results of going through this process was that our neural network overfitted to the training set and one of the follow-up items I needed to to look into is how I could make our model perform better. One way to do that is to try out different combinations of Hyperparameters so we can increase our model accuracy. So, in this article we are going to explore Hyperparameter Optimization using one of the most popular libraries available, Hyperopt.

Our focus will be in the most practical aspects of Hyperparameter Optimization, namely how we do produce code that will allows us to do this? I will assume that you are familiar with Neural Networks and how to train them so I will only provide very basic definitions when needed. I won’t bother to go over all the math behind this process, there are many online articles that explain it way better than I can 😂

Hyperparameters

Hyperparameters are parameters that control how the Machine Learning model will learn. Think of them as the top level parameters a Data Scientist needs to choose before training the model in order to maximize the performance of said model. Performance, in this case, means maximizing or minimizing a metric. A couple of the most common metrics we focus on are:

  • Accuracy: we try to maximize accuracy so the model can predict the desired outcome as accurately as possible.
  • Loss: we try to minimize loss. Loss is the penalty for a bad prediction so the lower the loss, the better our model is in predicting the desired outcomes.

We restrict our exploration to a Neural Network since that’s the model we have created for this exercise.

Neural Network Hyper Parameters

Let’s start with just a basic view of a neural network, as shown in Figure 1 below.

Figure 1. Sample Neural Network Architecture

Basically anything that we choose before training could be thought of as a Hyperparameter. However, there are a few that are the most common hyperparameters and those are the ones we will focus on optimizing. For a neural networks those are:

  • Number of Neurons per hidden layer: usually a neural network will have one or more hidden layers and each layer will have a number of neurons as shown in Figure 1 above.
  • Dropout Rate: The Dropout rate instructs the model to randomly drop a few nodes during training and is used to reduce overfitting, which is the problem we are trying to solve. The dropout rate is, in general, a small number between 0.2 to 0.8.
  • Activation Function: The activation function calculates the output of the node based on the node’s inputs and weights. There are many modern choices that we can use (ReLu, Sigmoid, Softmax, etc).
  • Optimization Algorithm: The optimization algorithm is used to change the attributes of a neural network such as weights and learning rate. In practice, you will usually select an optimizer and a learning rate speed. Some of the most common optimizers used are Adam, Adagard, Adamax, Stochastic Gradient Descent, etc.
  • Learning Rate: The learning rate controls the change in amounts the weights are updated during training. In practice, you choose a small number for the learning rate, somewhere between 0.0 and 1.0.
  • Epochs: how many epochs should we use to train our model? How many are too many? 100? 50?

Another Hyperparameter that we can optimize is the number of hidden layers the neural network can have. We are not optimizing that one in this iteration but will include it in a future post.

In all the cases above, we, as Data Scientists, can make educated choices as to what would be an initial value we can use for each Hyperparameter. There are many rules of thumb that we can start with. For example, a well-known rule of thumb for the number of neurons in a hidden layer is that it should be between the size of the input layer and the size of the output layer. To this I would say: thanks for nothing 😂. In my case, the input layer has 124 features and the output layer has 3 values (home team wins, away team wins or there’s a tie). So we are left with a value between 3 and 124. There are still a lot of values there.

Now let’s add those values plus the possible value combinations for our other hyper parameters and we end up with this simplified version of the our hyper parameter space:

Parameter Space = 122 (suggested number of neurons for the first hidden layer) x 122 (suggested number of neurons for the second hidden layer) x 9 (activation function choices: relu, sigmoid, softplus, softsign, tanh, selu, elu, exponential, softmax) x 7 (optimizer choices: SGD, Adam, RMSprop, Adagrad, Adamax, Nadam, Ftrl) x 91 (number of epochs, we assume a 10–100 universe in this case) x 7 ( dropout rate values: 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8) x 10 (learning rate values: 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0) = 94,811,080 possible combinations!

Good luck trying to do that by hand! 😁 Of course, we can follow some rules of thumb to minimize our universe…. or we can write code that will let us find the best parameters for our use case and that is what we are going to do here.

Neural Network Hyperparameter Optimization Choices

There is a long list of articles that you can find online about the top 3 choices we have and the math behind these algorithms. Since we concentrate on the practical aspects of getting code to work, we will just briefly mention the available choices and why we selected one over the others.

In general there are 3 choices for algorithms to use for hyperparameter tuning:

  • Grid Search: uses a different combination of all the hyper parameters and its values, calculates the performance for each combination and selects the best value. This is a very time consuming and expensive process since it will try every possible value combination.
  • Randomized Search: Similar to Grid Search but instead of trying every possible combination, it tests a randomly selected sample of combinations.
  • Bayesian Optimization: the main difference here is the past 2 algorithms do not take into account the results of previous choices. Bayesian Optimization does by looking at the score of the previous round. Bayesian Optimization, therefore, does not randomly pick the next set of values to try and optimizes the choice. This, in practice, means Bayesian Optimization will most likely find the best hyperparameter choices faster than the other 2 algorithms.

I chose Bayesian Optimization and now we can take a look at the code necessary to make this happen for a Neural Network.

Bayesian Optimization Libraries and Hyperopt

There are a few python library choices that implement Bayesian Optimization. In a future post, we will go over my experience using the 3 most popular ones for trying to find the best hyper parameters for my neural network. In this post, I will show you the code for Hyperpot, one of the 3 most popular choices.

Hyperopt Code

These are the steps we have to perform in order to use Hyperopt:

  • Install Hyperopt
! pip install hyperopt
  • Import all the necessary libraries
import hyperopt
from hyperopt import hp, fmin, tpe, Trials, space_eval, STATUS_OK
  • Define an objective function: The objective function includes the initial version of the neural network as shown below.
def objective_fun(params):
#defining the initial version of the neural network
model = Sequential()
model.add(layers.Flatten())
model.add(layers.Dense(params['hiddenLayerOne'], activation=params['activation'], input_dim=124))
model.add(layers.Dropout(params['dropout']))
model.add(layers.Dense(params['hiddenLayerTwo'], activation=params['activation']))
model.add(layers.Dropout(params['dropout']))
model.add(layers.Dense(3, activation='softmax'))

model.compile(optimizer=params['optimizer'](params['learning_rate']),
loss='categorical_crossentropy', metrics='accuracy')

input_shape = X_train.shape
model.build(input_shape)
es = EarlyStopping(monitor='val_loss',mode='min',
verbose=1,patience=15)

model.fit(X_train, y_train, validation_data=(X_val, y_val))

score, acc = model.evaluate(X_val, y_val, verbose=0)
print('Test accuracy:', acc)
return {'loss': -acc,
'status': STATUS_OK,
'model': model,
'params': params}
  • Define your parameter space:
param_space = {
"activation": hp.choice("activation",['relu', 'sigmoid', 'softplus', 'softsign', 'tanh', 'selu',
'elu', 'exponential', 'softmax']),
"optimizer": hp.choice("optimizer",[SGD, Adam, RMSprop, Adagrad, Adamax, Nadam, Ftrl]),
"learning_rate": hp.uniform("learning_rate",0.001,1),
"epochs": hp.uniform("epochs",10,100),
"hiddenLayerOne": hp.uniform("hiddenLayerOne",10,100),
"hiddenLayerTwo": hp.uniform("hiddenLayerTwo",10,100),
"dropout": hp.choice("dropout", [0.1, 0.4, 0.6])
}

A few things to note from the 2 code snippets above:

  • Our neural network has 2 hidden layers and the hyperparameters that Hyperopt will help us to optimize are: activation function, optimizer, learning_rate, number of epochs, number of neurons on each hidden layer and dropout rate.
  • Possible activation, optimizer and dropout values are each in an array so Hyperopt will use the values in the array for optimization.
  • The number of neurons for each hidden layers is between 10 and 100.
  • The number of epochs is also between 10 and 100.
  • The learning rate is a number between 0.01 and 1 so Hyperopt will try with values in that range.
  • We are optimizing for accuracy and what is returned by the objective function is accuracy.
  • We will pass the parameters to the objective function later but notice how the neural network in the objective function expects parameters, i.e.,
model.add(layers.Dropout(params['dropout']))

where ‘dropout’ is the parameter that Hyperopt will pass to the objective function.

  • These parameters and their values are defined in param_space.
  • Now that we defined the objective function and the parameter space we initialize the Trials object:
trials = Trials()

The Trials object keeps a history of every iteration/trial tried and we can use it at the end for a variety of things. For now, we just initialize it as shown above.

  • Now we are ready to start the process. We kick it off with the following code snippet:
best_params = fmin(
fn=objective_fun,
space=param_space,
algo=tpe.suggest,
max_evals=200,
trials=trials)

The fmin function takes the following parameters:

  • our objective function: we defined the objective function above.
  • our parameter space: which fmin will pass to our objective function.
  • algo: The algorithm to use for optimization. In this case we are using tpe.suggest. TPE stands for Tree-structured Parzen Estimator and uses Bayesian Optimization.
  • Max_evals: the maximum number of models to try.
  • trials: the trials object will save statistics about all of the models we tried and it can be accessed once the process is done.

and Figure 3 below shows some of the output while fmin is running:

Figure 3. Partial output of running Hyperopt’s fmin function

Once it is done you should get output like the one shown in Figure 4below.

Figure 4. Final output of running Hyperopt’s fmin function

The output is telling us it ran the 200 trials, it took around 3 second per trial and the best accuracy was 64%. Great, now what? Figure 5 below shows how to print the best parameters Hyperopt found with the following code

Figure 5. Printing the best Hyperparameter values Hyperopt found

Excellent, now we have the best values Hyperopt found. The next question is what to do next? Well, in looking online a lot of the Hyperopt tutorials end up right here. They tell you how to print the best Hyperparameters found and ‘use them’ in your neural network but they don’t tell you how.

We aim to remedy that by using these parameters in the following way: We will create the neural network again using these values and train the model again using the training set. I’m sure there are other ways of doing it but this was the fastest way in my case. So the code to create the ‘best’ version of our neural network is below:

final_model = Sequential()
final_model.add(layers.Flatten())
final_model.add(layers.Dense(23, activation='elu', input_dim=124))
final_model.add(layers.Dropout(0.4))
final_model.add(layers.Dense(13, activation='elu'))
final_model.add(layers.Dropout(0.4))
final_model.add(layers.Dense(3, activation='softmax'))

optimizer = Adam(learning_rate=0.0093)


final_model.compile(loss = 'categorical_crossentropy', optimizer=optimizer, metrics=['accuracy'])

Notice how the Neural Network uses the Hyperparameter values found by Hyperopt. Now we train it using our training set as shown below:

historhistory = final_model.fit(X_train, y_train, epochs=87)

The tail end of the training output is shown in Figure 6 below and shows our model has 62% accuracy.

Figure 6. Final output after training our final model with the training set

We can plot accuracy and loss with the code shown below to see how they changed while the model was being trained.

#accuracy history
plt.plot(history.history['accuracy'])
plt.title('model accuracy')
plt.ylabel('accuracy')
plt.xlabel('epoch')
plt.legend(['train'], loc='upper left')
plt.show()

#loss history
plt.plot(history.history['loss'])
plt.title('model loss')
plt.ylabel('loss')
plt.xlabel('epoch')
plt.legend(['train'], loc='upper left')
plt.show()

and the accuracy and loss graphs are shown in Figures 7 and 8 below.

Figure 7. Model Accuracy Chart
Figure 8. Model Loss Chart

Finally we can evaluate our model against the test set. If the model shows a big drop in accuracy we overfitted and have to find a way to reduce that. However, that should not be the case since we used Hyperparameter Optimization to avoid that. Evaluating against our test set produces the results shown in Figure 9 below.

Figure 9. Evaluating our Neural Network Accuracy using the test set

Our test set accuracy (61%) is very similar to the accuracy we got when training the model with the training set (62%) so we can say we do not see overfitting in this case, Yay! Just for comparison, in my previous iteration, without using Hyperparameter optimization, our neural network model had a 97% accuracy using the training set and a 58% accuracy when using the testing set, a huge drop! and an indication of overfitting.

This step completes the process of using Hyperopt to optimize our neural network’s Hyperparameters. If we continue using this neural network the next step is to actually make predictions with this version of the model.

Next Steps regarding Accuracy

What else can we do to improve our accuracy? Well, a couple of things:

  • Hidden layers as a hyper parameter: Notice the number of hidden layers is fixed in our neural network. We can make it a hyperparameter and let Hyperopt find out what is the optimal number of hidden layers.
  • Get more data: Our dataset ends in 2020 and we are in in 2024. We are missing a couple of seasons’ worth of results. Having these results would probably help us.

Conclusion

We looked at a full end-to-end case of how to use Hyperopt and Bayesian Optimization to find the best Hyperparameters for our neural network. Hyperopt makes the whole process much easier than doing this using some manual process, don’t you think? 😁

In a future post I will compare my experience using Hyperpot, BayesOpt and skopt for Hyperparameter Optimization. Also, in a future post I will show the whole process of using the English Premiership games results kaggle dataset and Hyperparameter optimization to predict if Arsenal will win the Premiership in the 2023–2024 season.

Thanks for reading and clap below, if you feel inclined to do so.

Feel free to connect with me on Linkedin and reach out with any questions you might have.

--

--

Icaro

Writing about various Machine Learning and software engineering topics when the day job is not too crazy