10 Hyperparameters to keep an eye on for your LSTM model — and other tips

Kuldeep Chowdhury
Geek Culture
Published in
9 min readMay 24, 2021
Hyperparameter tuning— grid search vs random search

Deep Learning has proved to be a fast evolving subset of Machine Learning. It aims to identify patterns and make real world predictions by mimicking the human brain. Models based on such kinds of neural network topology has applications in virtually every industry. The most important step among all the (integral) steps is perhaps the training of such a model so that it is capable of making robust predictions in any new testing data. It is thus pertinent to choose a model’s hyperparameters (parameters whose values are used to control the learning process) in such a way that training is effective in terms of both time and fit (whether the model “knows” the training data too well, or too poor; to constrict any form of overfitting or underfitting).

This article talks about LSTM in particular, a unique kind of recurrent neural network (RNN) capable of learning all the long term dependencies in the dataset. Recurrent neural networks are a class of neural networks which deal with temporal data. Long short-term memory (LSTM) has a similar control flow as a recurrent neural network in the sense that it processes the data while passing on information as it propagates forward. The actual difference lies in the operations within the cells of the long short-term memory network. These operations allow the LSTM to keep or forget information. LSTMs enable backpropagation of the error through time and layers hence helping preserve them. An LSTM (Long short-term memory) model is an artificial recurrent neural network (RNN) architecture which has feedback connections, making it able to not only process single data points, but also entire sequences of data. This article address all such hypermeters for an LSTM model necessary to improve the performance and what values are used as best practice.

Before we get into the tuning of the most relevant hyperparameters for LSTM, it is worth noting that there are ways to let your system find the hyperparameters for you by using optimizations tools. These methods are useful to bypass more manual processes in identifying good hyperparameters and tuning them. In Python, some such tools are:

Keras Tuner — https://blog.tensorflow.org/2020/01/hyperparameter-tuning-with-keras-tuner.html

Bayesian Optimization — https://github.com/fmfn/BayesianOptimization

Grid search — https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html

It should be kept in mind that many such hyperparameters are volatile, in the sense that different values (or even same values and different runs) may yield different results. So make sure you always compare models and performance by tweaking these hyperparameters to get the optimum results.

Relevant Hyperparameters to tune:

1. NUMBER OF NODES AND HIDDEN LAYERS

The layers between the input and output layers are called hidden layers. This fundamental concept is what makes deep learning networks being termed as a “black box”, often being criticized for not being transparent and their predictions not being traceable by humans. There is no final number on how many nodes (hidden neurons) or hidden layers one should use, so depending on the individual problem (believe it or not) a trial and error approach will give the best results.

As a general rule of thumb, one hidden layer will work with most simple problems and two layers with reasonably complex ones. Also, while many nodes (with regularization techniques) within a layer can increase accuracy, fewer number of nodes may cause underfitting.

2. NUMBER OF UNITS IN A DENSE LAYER

Method: model.add(Dense(10, …))

A dense layer is the most frequently used layer which is basically a layer where each neuron receives input from all neurons in the previous layer — thus, “densely connected”. Dense layers improve overall accuracy and 5–10 units or nodes per layer is a good base. So the output shape of the final dense layer will be affected by the number of neuron / units specified.

3. DROPOUT

Method: model.add(LSTM(…, dropout=0.5))

Every LSTM layer should be accompanied by a dropout layer. Such a layer helps avoid overfitting in training by bypassing randomly selected neurons, thereby reducing the sensitivity to specific weights of the individual neurons. While dropout layers can be used with input layers, they shouldn’t be used with output layers as that may mess up the output from the model and the calculation of error. While adding more complexity may risk overfitting (by increasing nodes in dense layers or adding more number of dense layers and have poor validation accuracy), this can be addressed by adding dropout.

A good starting point is 20% but the dropout value should be kept small (up to 50%). The 20% value is widely accepted as the best compromise between preventing model overfitting and retaining model accuracy.

4. WEIGHT INITIALIZATION

Ideally, it is better to employ different weight initialization schemes according to what activation function is used. However, more commonly a uniform distribution is used while choose initial weight values. It is not possible to set all weights to 0.0 as the asymmetry in the error gradient is brought out by the optimization algorithm; to begin searching effectively. Different set of weights results in different starting points of the optimization process, potentially leading to different final sets with different performance characteristics. Weights should finally be initialized randomly to small numbers (an expectation of the stochastic optimization algorithm, otherwise known as stochastic gradient descent) to harness randomness in the search process.

5. DECAY RATE

The weight decay can be added in the weight update rule that makes the weights decay to zero exponentially, if no other weight update is scheduled. After each update, the weights are multiplied by a factor slightly less than 1, thereby preventing them from growing to huge. This specifies regularization in the network.

The default value of 0.97 should be enough to start off.

6. ACTIVATION FUNCTION

Activation functions are what defines the output of a node as either being ON or OFF. These functions are used to introduce non-linearity to models, allowing deep learning models to learn non-linear prediction boundaries. Technically, activation functions can be included in the dense layers but splitting them into them into different layers makes it possible to retrieve the reduced output of the density layer.

Again, choice of activation layer depends on the application, however, the rectifier activation function is most popular. Specific situations entail specific functions. For example, sigmoid activation is used in the output layer for binary predictions and softmax is used to make multi-class predictions (softmax gives your ability the ability to interpret the outputs as probabilities.

Method: The process is to create user defined functions and have it return the output associated with any specific activation function. For example, here is a sigmoid activation function:

def sigmoid(x):

return 1/(1+np.exp(-x))

Sigmoid (log-sigmoid) and hyperbolic tangent are some of the more popular activation functions adopted in LSTM blocks.

7. LEARNING RATE

This hyperparameter defines how quickly the network updates its parameters. Setting a higher learning rate accelerates the learning but the model may not converge (a state during training where the loss settles to within an error range around the final value), or even diverge. Conversely, a lower rate will slow down the learning drastically as steps towards the minimum of loss function will be tiny, but will allow the model to converge smoothly.

Usually a decaying learning rate is preferred and this hyperparameter is used in the training phase and has a small positive value, mostly between 0.0 and 0.1.

8. MOMENTUM

The momentum hyperparameter has been researched into to integrate with RNN and LSTM. Momentum is a unique hyperparameter which allows the accumulation of the gradients of the past steps to determine the direction to go with, instead of using the gradient of only the current step to guide the search.

Typically, the value is between 0.5 to 0.9.

9. NUMBER OF EPOCHS

This hyperparameters sets how many complete iterations of the dataset is to be run. While theoretically, this number can be set to an integer value between one and infinity, this should be increased until the validation accuracy starts to decrease even though training accuracy increases (and hence risking overfitting).

A pro move is to employ the early stopping method to first specify a large number of training epochs and stop training once the model performance stops improving by a pre-set threshold on the validation dataset.

10. BATCH SIZE

This hyperparameter defines the number of samples to work on before the internal parameters of the model are updated. Large sizes make large gradient steps compared to smaller ones for the same number of samples “seen”.

Widely accepted, a good default value for batch size is 32. For experimentation, you can try multiples of 32, such as 64, 128 and 256.

Some other tips:

Apart from tuning the hyperparameters, here are some tips to for training your LSTM or RNN model.

1. OPTIMIZATION SETUP

· Adaptive learning rate: To better handle the complex training dynamics of recurrent neural networks (that a plain gradient descent may not address), adaptive optimizers such as Adam is recommended.

· Gradient clipping: Spikes in gradient can mess up parameters during training. This can be prevented by first plotting the gradient norm (to see its usual range) and then scaling down those gradients that exceeds this range.

· Normalizing the loss: Adding the loss terms along the sequence and then dividing them by the maximum sequence length. This will average out the loss across the batch and in turn make it easier to reuse the hyperparameters between experiments.

· Truncated backpropagation: Any form of recurrent network may struggle with learning long sequences due to vanishing and noisy gradients. Even though LSTM specifically designed to address the vanishing gradient problem, it is worth noting how some professionals recommend training on overlapping chunks of around 200 steps instead, gradually increasing the chunk length during training.

2. NETWORK STRUCTURE

· Gated Recurrent unit: GRU is an alternative cell design that uses fewer parameters and computes faster compared to LSTM.

· Layer normalization: Another way to speed up learning and improve final performance is by adding layer normalization to all the linear mappings of the recurrent network.

· Feed-forward layers: It is possible to enable your model project the data into a space with simpler temporal dynamics by pre-processing the input with feed-forward layers. This helps increase the performance.

3. MODEL PARAMETERS

· Learned initial state: Large loss terms are caused in the first few time steps as a result of initializing the hidden state as zeroes thereby rendering the model to focus less on the actual sequence. Training the initial state as a variable can improve performance.

· Bias due to forget gate: Recurrent networks can take a while to learn to remember information from the last time step. This can be improved by initializing the bias for LSTM’s forget gate to 1, enabling it to remember more by default. Similarly, for GRUs, the bias needs to be initialized to -1.

· Regularization: Regularization methods such as dropout are well known to address model overfitting.

Final thoughts:

Open source libraries such as Keras has freed us from writing complex codes to make complex deep learning algorithms and every day more research is being conducted to make modelling more robust. While these tips on how to use hyperparameters in your LSTM model may be useful, you still will have to make some choices along the way like choosing the right activation function. It is important to remember that not all results tell an unbiased story. For example, the smallest improvements in loss can end up making a big difference in the perceived quality of the model. If the training loss does not improve multiple epochs, it is better to just stop the training. Otherwise the evaluation loss will start increasing. In the end, best results come by evaluating outcomes after testing various configurations.

References:

1. Time Series — LSTM Model. (https://www.tutorialspoint.com/time_series/time_series_lstm_model.htm#:~:text=It%20is%20special%20kind%20of,layers%20interacting%20with%20each%20other.)

2. Illustrated Guide to LSTMs and GRUs. (https://towardsdatascience.com/illustrated-guide-to-lstms-and-gru-s-a-step-by-step-explanation-44e9eb85bf21)

3. MomentumRNN” Integrating Momentum into Recurrent Neural Networks. (https://arxiv.org/abs/2006.06919#:~:text=We%20study%20the%20momentum%20long,%2Dthe%2Dart%20orthogonal%20RNNs)

4. Keras — Dense Layer. (https://www.tutorialspoint.com/keras/keras_dense_layer.htm)

5. A comparative performance analysis of different activation functions in LSTM networks for classification. (https://link.springer.com/article/10.1007/s00521-017-3210-6#:~:text=The%20most%20popular%20activation%20functions,functions%20have%20been%20successfully%20applied.)

6. Adam: A method for stochastic optimization. (https://arxiv.org/pdf/1412.6980.pdf)

7. Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation. (https://arxiv.org/pdf/1406.1078.pdf)

8. Layer Normalization. (https://arxiv.org/pdf/1607.06450.pdf)

9. Tips for Training Recurrent Neural Networks. (https://danijar.com/tips-for-training-recurrent-neural-networks/)

--

--