DL : Hyperparameters Tuning for Neural Network

Part 2 of Deep Learning Specialization

Pisit J.

Follow

Published in

Sum up As A Service

4 min readMay 24, 2019

--

1. Network Hyperparameter

Model complexity

1.1 Number of hidden layers & hidden units per layer

Complexity of Network depends on Number of hidden layers, more than Number of hidden units per layer.

More deep the network is (More layers), More Complex the network gets.

1.2 Activation function of hidden layers

Sigmoid
Tanh
Relu
Leaky Relu

2. Learning Hyperparameter

Weight & Bias update
Computational cost
Memory efficiency

2.1 Learning rate

à¸à¸¥à¸à¸²à¸£à¸à¹à¸à¸«à¸²à¸£à¸¹à¸à¸ à¸²à¸à¸ªà¸³à¸«à¸£à¸±à¸ learning rate — link

2.2 Number of Epoch

2.3 Batch size of training data

Stochastic = 1 training data
Mini-Batch
Batch = whole training data

à¸à¸¥à¸à¸²à¸£à¸à¹à¸à¸«à¸²à¸£à¸¹à¸à¸ à¸²à¸à¸ªà¸³à¸«à¸£à¸±à¸ mini batch stochastic gradient descent — link

Ã Â¸ÂÃ Â¸Â¥Ã Â¸ÂÃ Â¸Â²Ã Â¸Â£Ã Â¸ÂÃ Â¹ÂÃ Â¸ÂÃ Â¸Â«Ã Â¸Â²Ã Â¸Â£Ã Â¸Â¹Ã Â¸ÂÃ Â¸Â Ã Â¸Â²Ã Â¸ÂÃ Â¸ÂªÃ Â¸Â³Ã Â¸Â«Ã Â¸Â£Ã Â¸Â±Ã Â¸Â mini batch stochastic gradient descent — link

2.4 (Learning-rate) Optimizer

Momentum
RMSProp
Adam

3. Initializing Hyperparameters

3.1 Input normalization

Input normalization >> Zero mean & Unit variance link

3.2 Batch normalization

z = vector of linear combination WX+b, before activation

3.3 W, b initialization

Zero initialization

b= ok

W= bad (does not break model’s symmetry)

Random initialization

W =ok (break model’s symmetry)

random * big constant (c>1) = bad (exploding gradient)

random * small constant (0<c<1) = good

Xavier initialization (for Tanh activation)

à¸à¸¥à¸à¸²à¸£à¸à¹à¸à¸«à¸²à¸£à¸¹à¸à¸ à¸²à¸à¸ªà¸³à¸«à¸£à¸±à¸ xavier initialization

He initialization (for Relu activation)

à¸à¸¥à¸à¸²à¸£à¸à¹à¸à¸«à¸²à¸£à¸¹à¸à¸ à¸²à¸à¸ªà¸³à¸«à¸£à¸±à¸ he initialization

Reference — https://www.kdnuggets.com/2018/06/deep-learning-best-practices-weight-initialization.html

4. Regularizing Hyperparameter

Reduce overfitting

4.1 L2 regularization

lambda = regularization parameter

More lambda

W close to 0
Less complex

Less lambda

W far from 0
More complex

4.2 Dropout

5. Tuning Technique

5.1 Grid search vs. Random search

Grid search

Based on assumption that every hyperparameter is equally-improve your model.
High computational cost (Too many possible hyperparameter sets to test and find the best one).

Random search

Based on assumption that there are just some of hyperparameters that can significantly improve your model.
Low computational cost (just some random hyperparameter sets to test, find significant hyperparameters, then thoroughly test that ones).