Learning from learning curves
After trying out some XGBoosting on the Kaggle Bosch problem, I thought I’d give neural networks a go. It would be nice to get some insight into how the neural network is learning, so I can improve it for my task. So I plotted several learning curves of a neural network, to see what’s happening under the hood.
Data is related to the Kaggle Bosch challenge.
Original data: consists of numeric, date related (numeric) and categorical data related to features collected for each part along an assembly line, and the objective is to predict if a part is rejected or not (Binary classification)
Feature Transformation: Categorical features are one-hot encoded, and some extra features have been created.
Feature transformation pipeline is 1) Transform categorical features + add additional features 2) Normalize the data 3) Feature selection to remove poor features
This is an imbalanced class problem, and to alleviate this slightly, I use Stratified K-Fold (K=3) to cross validate the model on the data and keep the classes balanced.
Data contains about 2000 features after feature creation/transformation. A lot I know, but I want to see if a model can learn the signal, with my original + hand-coded features. (Later, I will try out with a smaller feature set)
Metric to assess the model is MCC (Matthrews Correlation Coeff): https://en.wikipedia.org/wiki/Matthews_correlation_coefficient
Learning the signal with neural networks
In all graphs,
the top graph depicts the loss, over epochs, for training and validation sets for all CV folds. (Validation loss is plotted with a dashed line)
the second graph depicts the MCC (accuracy) for training & validation sets.
I start of with a simple model, a multi layer perceptron, with
one hidden layer, linear activation on hidden units, and 200 hidden units, trained over 50 epochs.
As you can see, this model depicts high bias. Training & validation datasets both exhibit similar curves of loss and accuracy. The loss & accuracy both seem to plateau off, so we can improve on this model, by changing some model parameters.
Second model is a slight modification to the first. I use a non-linearity instead of linear activation function
one hidden layer, ReLU activation on hidden units, and 200 hidden units, trained over 50 epochs.
As you can see adding a non-linearity has led to overfitting. 200 units maybe too many.
So lets see what happens when the number of hidden units are reduced.
one hidden layer, ReLU activation on hidden units, and 13 hidden units, trained over 50 epochs.
It looks a lot better, even though there is some bias, the loss has reduced (accuracy has improved slightly). It looks like one fold is not performing as well as the other two. I’m not sure what the reason for this is, all three folds have the same ratio of +ve/-ve class. But lets see if shuffling the dataset fixes this behaviour.
Looks like the shuffling helped. Now all three folds have similar results.
Would my model improve if I trained it longer? There is already evidence that the validation error is starting to go up, but let’s confirm this with a plot.
one hidden layer, ReLU activation on hidden units, and 13 hidden units, trained over 150 epochs.
Well that was no good.
What if we tried to de-correlate the inputs using some technique like PCA?
one hidden layer, ReLU activation on hidden units, and 10 hidden units, trained over 50 epochs on PCA transformed data
I’m not entirely sure why this made it so much worse. Its possible that the true function is non-linear so applying a linear transformation would add bias to the model.
[UPDATE] While a single hidden layer with non-linearity should already be powerful enough to learn the function, I added one more hidden layer, with larger number of units, but along with a higher dropout for regularization.
two hidden layers, with dropout=0.6, ReLU activation on hidden units, and 300 hidden units, trained over 60 epochs
This is looking much better. Validation error has reduced, and the accuracy is also much higher.
It’s clear that my best model is still quite poor in terms of accuracy (worse on test data/public leaderboard). One of the checks as recommended in the CS231n Stanford course is to “make sure you can achieve zero cost“ on a small subset of data. I have not been able to achieve this with the current model/features.
I think I might have to go back to the feature engineering drawing board. Its possible the features themselves are not strong enough for my model to learn.
I’d like to hear your thoughts of what other steps might be used to tune the model (with the same feature set). Maybe weight decay, and other kinds of regularization might work. I’ll add to the results as I perform the tests.