Dropout for Regularizing Deep Neural Networks
Deep learning NN is likely to overfit its training dataset as making our predictions inappropriate.To reduce this issue, in the market there are other ways, but it requires the additional computational expenses and maintaining multiple models. In this case, dropping out some nodes will best increase the accuracy avoiding overfitting in our dataset.
Overfitting Problem
In model learning where the statistical noise in the training dataset, increases the poor performance because a model is evaluated in test data (new data). Thus, it will give us higher error comparatively to training data. In some ways, you can avoid this by using K-fold method and then use NN by setting enough epochs. Where it will run first small collection of models without missing any data using K-fold, bagging and giving results using NN layers to see the impact.
Though, this ‘ensemble’ technique has several restrictions according to our purpose. In that case ‘Dropout’ function is very useful to avoid unnecessary nodes.
Randomly Drop Nodes
The Dropout function randomly drops the nodes, having an effect of ignoring some nodes making a layer be treated as a different nodes and connectivity with previous nodes. Though, there are some bad sides of this effect such as making training process noisy and forcing some nodes which are not connected by their previous nodes or having null effect on them to join for a further NN process for the different layers. In this technique we can also use it on the actual feature set (Unhidden layer) or after/before any hidden layer.
Methods of Dropping Out
There are new hyperparameters for dropping out nodes according to our data and purpose. To do so one method is considering probability at which the outputs of a layer are dropped out or the inverse probability. This helps us to find which nodes have insignificants effect on our dataset and removing those only. After dropping out those values our dataset will assign different weights to our new training model. Now we will implement on those and removing next chosen nodes. This is called rescaling and sometimes ‘inverse dropout’ and does not require any weights modification during training. Basically, it uses an inverse probability (1-p) for the next step of dropout in NN.
In both PyTorch and Keras libraries the operation is being performed by this method as not changing test outputs and dropping at a scaled dropping rate. Suppose, if our dropping rate is set 25%, states that 25% of the inputs are being dropped on a random basis.
Here, a constraint is imposed on the weights for each hidden layer, ensuring that the maximum norm of the weights does not exceed a value of 5.Using kernel constraint argument on the Dense class when constructing the layers.
SGD (Stochastic Gradient Descent) is a method for an optimization basically you use it as an algorithm which estimate an error gradient for the specific current state in training period time zone. Then it updates the weights of the models each node in a layer using the back-propagation methods.
Now, in our problem using SGD which has mainly two components
- Learning Rate
- Momentum
Both seems the same but have different effects.
Learning rate can help to converge the optimization method as having small learning rate indicates higher number of epochs needs to train our data because normally to include all the data you have to find the gradient descent of the whole dataset and having higher learning rate means a model is analyzing faster weights change in their nodes. In other words, if the learning rate is smaller then your model might stuck in a process at a specific direction.
Momentum can accelerate training that means we can also use it as another way of dropping our dataset. Consider a model has 200 columns or features and fixing our moment from 20%(lower) to 90%(higher) will lead to an understanding from the output that our increment in momentum is increasing fitness of our train data’s accuracy with test dataset. Means higher momentum will help us to lose some information about the features at a certain amount that will decreasing overfitting.
Update Oct/2016: Updated for Keras 1.1.0, TensorFlow 0.10.0 and scikit-learn v0.18.
Update Mar/2017: Updated for Keras 2.0.2, TensorFlow 1.0.1 and Theano 0.9.0.
Update Sep/2019: Updated for Keras 2.2.5 API.
sgd = SGD(lr=0.1, momentum=0.9)Keras also provides LearningRateScheduler callback that allows you to specify a function that is called each epoch in order to adjust the learning rate. It allows you to specify a function that is called each epoch in order to adjust the learning rate.
Selecting important features from the data using Random forest Classifier
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegressionX_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.25, random_state=1)
sel = SelectFromModel(RandomForestClassifier(n_estimators=100, random_state=0, n_jobs=-1))
sel.fit(X_train, y_train)
print(sel.get_support())
tree_small = sel.estimator_[27]# In[190]:
# Get numerical feature importances
importances = tree_small.feature_importances_# List of tuples with variable and importance
feature_importances = [(feature, round(importance, 2)) for feature, importance in zip(X_train, importances)]
# Sort the feature importances by most important first
feature_importances = sorted(feature_importances, key = lambda x: x[1], reverse = True)
# # Print out the feature and importances
# [print(‘Variable: {:20} Importance: {}’.format(*pair)) for pair in feature_importances];feature_importances
Some Observations:
- Dropout roughly doubles the number of iterations required to converge. However, training time for each epoch is less.
def build_classifier():
classifier = Sequential()
classifier.add(Dense(units=26, kernel_initializer='glorot_uniform', activation='relu'))
classifier.add(Dense(units=25, kernel_initializer='glorot_uniform', activation='relu'))
classifier.add(Dense(units=25, kernel_initializer='glorot_uniform', activation='relu'))
classifier.add(Dense(units=1, kernel_initializer='glorot_uniform', activation='sigmoid'))
sgd = SGD(lr=20, momentum=0.2)
classifier.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
return classifier
classifier = KerasClassifier(build_fn=build_classifier, batch_size=20, nb_epoch=3)
accuracies_train = cross_val_score(estimator=classifier, X=X_train, y=y_train, cv=10, n_jobs=1)
print('Accuracy Mean:' , accuracies_train.mean())
print('Accuracy STD:', accuracies_train.std())3. With H hidden units, each of which can be dropped, we have
2^H possible models. In testing phase, the entire network is considered and each activation is reduced by a factor p.
4. We can conclude that with increasing the dropout, there is some increase in validation accuracy and decrease in loss initially before the trend starts to go down.
