Jumping into hyperparameters

Andy Elmsley
The Sound of AI
Published in
5 min readApr 18, 2019

Welcome back, gradient descenders! Last time we introduced the backpropagation algorithm to train our neural networks. I strongly encourage you to read that post if you haven’t already, as this week we’ll pick up straight where we left off — by exploring probably the most sci-fi sounding aspect of machine learning: hyperparameters.

Hyperparameters: almost as exciting as hyperspace?

A quick side note before we begin: from this week forward we’ll be publishing a new Code of AI post every two weeks instead of every week. This will hopefully give me enough time to deeply address some more complex concepts and models, and also give you enough time to digest them.

What are hyperparameters?

We left off last week with an exploration challenge — I asked you to tweak various parameters of our neural network and its training: the network size, the amount of training data and the learning rate. These parameters (and others) are generally referred to as hyperparameters. But what does that mean exactly, and what’s the difference between a hyperparameter and a normal parameter?

To understand this, let’s remember what we said in the last post about what we mean when we say that our machine learning model is learning. This is a process of optimising a model by adjusting it’s parameters to minimise the error. With an ANN, the model is a bunch of neurons and connections, and the parameters are the connection weights. The weights are a set of parameters that are internal to the model. We don’t directly set them from the outside, and their value is set by the training algorithm. These are called model parameters.

Hyperparameters are parameters that we do have ‘control’ over as machine learning programmers. Depending on the model you’re using, you’ll have a number of different hyperparameters to tweak — but for an MLP, you only really have the abovementioned ones to play with.

Let’s now explore how these hyperparameters work together. For each one we’ll l train ten different networks and use the average error as an indication of which one performed best. “Why train ten different networks?”, I hear you ask. Well, since each MLP is initialised with random weights, sometimes the network will take longer to find an optimal solution or, in some cases, never find a good solution at all. Running the training multiple times reduces the chance of this happening. (This idea is similar to a concept known as cross-validation which we’ll cover in a later Code of AI post.) For now, remember this: always train multiple models before you stick with a hyperparameter choice.

Network size

The network size — the number of hidden layers and how many neurons there are in each layer — is a perfect example of a hyperparameter which has a direct impact on the model parameters. If a multilayer perceptron has one input, one output and one hidden layer with three nodes, we know that the network has six (1x3 + 3x1) trainable weights. However, if we add a second hidden layer of three neurons, we know that the network has 15 (1x3 + 3x3 + 3x1) trainable weights. As you add more nodes and more layers, the number of parameters can grow faster than you might expect!

In this next code snippet we’ll explore a number of different network sizes and report the error for each size.

The results will look something like this. The first number refers to the hidden layers; the second is the number of model parameters and the last number is the average error.

As you can see, for our toy dataset the network size doesn’t really make too much of a difference. Although there is an important lesson here: sometimes a simpler architecture is better. Or, to put it another way, you should optimise the size of your network to be the smallest network possible that achieves your goal. You can see this in the final example, where the error is actually bigger for the largest network. Adding more layers and nodes increases the complexity of the network, which not only makes it more difficult to train, but also increases the RAM and CPU requirements for the model. Is one more layer really worth that 0.001 improvement? You be the judge — but only if you’re 99.999% sure.

Amount of training data

One of the easiest ways to play with a hyperparameter is to control the amount of data that is fed into the model during training. Simple problems like the often-used XOR function have complete datasets of every possible input and output pair, but most of the time this isn’t feasible. Instead, we have to work with a sample of data and hope that the ML model will generalise well after training.

The next snippet of code explores a range of dataset sizes and shows the results.

Even in this simple example, the trend is clear: more data = better results. I’ve lost count of the amount of times I’ve told a client, “We need more data”, or read in a research paper’s conclusion that “collecting more data may yield better results” (I think I’m guilty of that one myself).

Adding more data helps the model learn in two ways. First, it avoids the problem of overfitting, which is where the model doesn’t generalise well to new inputs — we’ll talk more about overfitting in the next Code of AI post. Secondly, it helps the model learn by giving it more backpropagation runs to do in one epoch. This means that the more data you add, the more training time you’re giving the model.

Learning rate

The learning rate is the one and only hyperparameter for our implementation of backpropagation gradient descent. There are more complex gradient descent trainers and optimisers available that have more parameters to play with, but learning rate is a simple input that controls how much to follow the gradient and adjust the weights when learning. The general rule of thumb is that bigger learning rates will learn faster, but may step over the optimal points, whereas lower learning rates will learn more slowly but may get stuck in local optima.

ML guru Andrew Ng has a great rule of thumb for how to go about optimising hyperparameters such as learning rate, called the rule of 3. Essentially, you pick a number to start with (usually 1 or 0.1 for learning rate) and then try multiplying it or dividing it by three, rounding where it makes sense, until you find the optimum. This is the approach we’ve taken in the code below:

The learning rate is an interesting hyperparameter to explore because it can greatly affect your model’s performance. Here we can see that the optimal learning rate seems to be around 0.3, but will depend strongly on the network size and complexity of the problem. Now, if only there was a learning rate hyperparameter for human brains…

Hyperparameters optimised

That’s where we’ll leave it for this week. We’ve introduced the concept of hyperparameters, and explored three hyperparameters associated with our MLPs and gradient descent training.

Next week, we’ll finally be able to train a model on some real-world data. We’ll also discuss some of the perils of overfitting and how to mitigate that with cross-validation.

As always, you can find the source code for all the above examples on our GitHub.

To begin your AI-coding training at day one, go here. And give us a follow to receive updates on our latest posts.

--

--

Andy Elmsley
The Sound of AI

Founder & CTO @melodrivemusic. AI video game music platform. Tech leader, programmer, musician, generative artist and speaker.