Grow Neural Nets —Part 1: Intro to Hyperparameters

Start building your own neural nets and evolve it step by step to advanced machine learning models

Grow neural networks step by step

In my previous post, Learning Neural Nets, I introduced how to start learning neural nets, one of the most important concepts in modern machine learning models. Once you start understanding how neural nets, you will wonder how advanced machine learning models such as Recurrent Neural Nets (RNNs), Convolutional Neural Nets (CNNs), or Generative Adversarial Networks (GANs).

Imagine you build your own advanced machine learning models from a minimal neural nets. Isn’t that cool? The purpose of this ”Grow Neural Networks” series is to guide you through the journey of evolving the seed neural nets to more sophisticated models.

In this particular entry, I would like to introduce how to add hyperparameters commonly used in neural networks.

Get excited? Let’s start!!

Hyperparameters vs Parameters

Before diving into understanding hyperparameters in depth, it is necessary to distinguish hyperparameters from parameters. In short, hyperparameters are fixed before training to tune the training model itself; parameters are dynamically adjusted while training within the training model.

In neural nets training, weights of synapses are parameters. Remember, while neural nets training, what we are updating is the weights of synapses. More generally, machine learning is the process of adjusting parameters given sample input and output so that the resulting model can predict an output from a new input.

What we are going to taking a closer look is hyperparameters; variables of a machine learning model that we tune before a training. Hyperparameters impact the efficiency of the training process and the accuracy of the resulting model. Understanding hyperparameters allows you to build more complex customised neural nets and unleash its further potential.

Now, what are hyperparameters and how they work then?

First Look: Hyperparameters in Action

To feel is an effective way to assimilate new concepts. Above all, before diving into the trench of machine learning theories, I would like you to play around with TenserFlow’s Neural Network Playground with full of joy.

Tensor Flow’s Playground

Yeah! I understand your excitement by just watching what’s happening in the process of neural nets learning with this beautiful GUI. By rightly tuning, it can solves complex problems (did you try solving the spiral?). Now, I would love you to take a closer look what parameters you can tune here. These all are fairly interesting and effective ideas regarding tuning neural nets. Your wisdom and creativity will come in play around with that. Let’s dive in!

Table of Contents of Hyperparameters

In neural nets model, by tuning hyperparameters, we will see significant difference for its performance and accuracy. To understand these deeply is the key to be neural nets wizard. It is fast-forward though, I am going to briefly introducing each hyperparameters we saw in the simulator. We are going to implement each of them so do not worry even if you do not understand at this point.

I separated the hyperparameters into three categories:

  1. Neural Nets Structure: the most intuitive hyperparams we can observe in the model. Generally speaking, more layers and nodes get the model powerful but of course there’s drawback with over complexity.
  2. Numeric Operations: not obvious but definitive secret source of sophisticated neural nets. You would have plenty choice of hyperparams under this category. May be hard to grasp at the beginning, it worth to stick to learn this.
  3. Dataset Preconditioning: ground-truth of any machine learning processes is the data. Without preparing data well, the model cannot succeed.

Neural Nets Structures

  • Choice/Tweak of Features : X1, X2, X1², X2², X1X2, sin(X1), sin(X2)…

A feature is “an individual measurable property of a phenomenon being observed”. We can assume X1 as a collection of features that contains different n individual features than X2. We can apply mathematical operations before feeding it in like the list of features X1², X2², X1X2, sin(X1), and sin(X2) to make the model more powerful. Normalize entire features? Reduce the number of features? Generate new features from the currently available features? Degree of freedom on this hyperparams is large and its impact is so.

  • Number of hidden layers : 0…

A hidden layer is an intermediate set of neurons made by mapping weights and non-linear transformation (activation function) onto neurons in the previous layer. If your data is simple like you can separate them by drawing a straight line, you wont need any hidden layer. Beyond that, it is said one hidden layer is sufficient for the large majority of conventional problem. However, we need more of them when we apply it to more complex problems that we are going to explore.

  • Number of nodes in each hidden layer : 1…

This means the number of nodes in a layer. In a layer, each nodes holds different intermediate computational values since each is mapped with different weights configuration. According to the author of “Introduction to Neural Networks for Java”, Jeff Heaton, ‘the optimal size of the hidden layer is usually between the size of the input and size of the output layers’. This only can be applied for simpler and conventional model of neural nets though, it is good guidance to consider often times.

Numeric Operations

  • Learning Rate : 0.00001, 0.0001, 0.001, 0.003, 0.01, 0.03, 0.1, 0.3, 1, 3, 10…

Step size of adjustments of weights/parameters in each iteration. Though how it impacts is different between machine learning architectures, it should be optimized to find its best value to achieve faster training process and more accurate result. Higher value give faster learning but it may cause the model to fail to converge. Lower value gives higher chance to converge but may cause the model to learn too slow and even stuck in a local minima/maxima. Step size should be fixed? No, over time we may vary it. It is one of techniques.

  • Activation function : ReLU, Tanh, Sigmoid, Linear…

Different activation function changes how we convert weighted values through synapses for a better training. Generally speaking, ReLU is a primary choice (avoid gradient vanishing, efficient without pre-training in deep neural nets). Leaky ReLU or Maxout comes second and Tanh third choice. Sigmoid and linear cannot be used if the output should be multi-class. Activation functions are really interesting topic when we microscope it actually.

  • Regularization : L1, L2…

Regularization controls the capacity of neural nets to prevent overfitting, which a model gets too optimized for the training data set and loose generality to predict accurately for new data. Sampling data is constrained in real-world. There’s no perfect data set and often times it contains ill-posed variance. Regularization injects our preferences on weights into available data in order to mitigate the real-world sampling limits. Or just we suspect the model is too powerful, we can introduce regularization as a counterforce against it.

  • Regularization rate : 0.001, 0.003, 0.01, 0.03, 0.1, 0.3, 1, 3, 10…

Strengthen of regularization applied with the type of reguralization. Of course, if we set this too high, leaner will not be able to learn enough and its predictions be suffered.

  • Problem type : classification or regression

In classification problem, the output is supposed to be discrete value like a label (0 or 1, red, green, or blue, cat or dog etc…). In regression problem, the output is supposed to be continuous value (-0.1234, -0.0012, 1.2345, 12345 etc…)

Dataset Preconditioning

  • Ratio of training to test : 10%, 20%, … , 90%

This item does not directly related to the training process. In short, the test set is not used to train the model but to validate the generalization of it after training. As suggested in regularization, it is important to check whether the model is generalized enough for new inputs and the ratio determines how careful we become on the validation.

  • Noise : 0, 5, 10, 15, 20, 25, 30, 35, 40, 45, 50…

Often times, to prevent overfitting, noise is injected to a training process. This is a set of examples inserted in the original training data set and it makes it more difficult for neural nets to learn. The more noise added, the stronger the degree of overfitting prevention. However, if you make it too noisy then a training may fail to build sufficient model to accurately predict.

  • Batch size : 1, 2, 3, 4, 5, …. 10, … , 20, … , 30

Batch size defines how many samples that going to be propagate through network. The smaller batch size is, the less memory is required. But if it is set too small, the estimate of gradients gets less accurate.

Hyperparameters as a set of powerful tools

Each hyperparameters are there for making the machine learning, in particular the above set of them; the neural nets learning, better. In practice, these hyperparameters are mutual dependent and by combining them produces much different results. We would not know the best set of tuning values without understanding mutual impacts. However, for the sake of learning, starting learning one hyperparmeter at a step will help you to become a wizard of neural nets faster.

From now one, I will introducing each of hyperparameters one by one. Eventually, we will realilze we are actually learning and implementing the more advanced model of neural nets like RNNs, CNNs, and GANs!

Stay tune!

Any question or thoughts? Reach out to me:)

I am always welcome to answer any questions or feedback on my entries. I am also the one just started learning neural nets/machine learning and know more and better about it through interactions.