Optimizing Neural Networks — Where to Start?
Developing intuitions through building and tuning a neural network using Keras in Google Colab
There are so many parameters and hyperparameters (all referred to as parameters hereon) to tune with a neural network, so where to start?
In Professor Andrew Ng’s Deep Learning Specialization courses, he gives the following guideline:
- Start with learning rate;
- Then try number of hidden units, mini-batch size and momentum term;
- Lastly, tune number of layers and learning rate decay.
These are great tips. But, to make them part of our skills, we need intuition :) To achieve that, I built a customizable neural network class in Python and conducted a series of experiments to verify the ideas. Let’s see!
Setting up the Environment
We’ll use Google Colab for this project, so most of the libraries are already installed. Since we’ll train neural networks, it’s important to use GPU to speed up training.
To enable GPU, just go to “Runtime” in the dropdown menu and select “Change runtime type”. You can then verify by hovering mouse over “CONNECTED” in the top right corner:
Getting the Data
In this project, we’ll use the Pima Indians Diabetes Dataset, for these reasons:
- The dataset is challenging, the top accuracy results are only around 77% which gives us the opportunity to do a lot of model tuning;
- The dataset is small with only 768 rows and 9 columns. This allows us to train faster and thus makes it possible to do 10-fold cross-validation for a better representation of model performance.
Although we can download the dataset manually, for reproducibility, let’s download it from Kaggle. Since we need to do it using Kaggle’s API, we’ll first create the API token by visiting the “My Account” page on Kaggle. This will download a kaggle.json
file to your computer.
Next, we need to upload this credential file to Colab:
from google.colab import files
files.upload()
Then we can install Kaggle API and save the credential file in the “.kaggle” directory.
!pip install -U -q kaggle
!mkdir -p ~/.kaggle
!cp kaggle.json ~/.kaggle/
Now we can download the dataset:
!kaggle datasets download -d uciml/pima-indians-diabetes-database
This dataset will be downloaded to your current working directory which is the “content” folder in Colab. As files get deleted every time you restart your Colab session, it’s a good idea to save files in your Google Drive. You just need to mount the drive using below code and save there:
from google.colab import drive
drive.mount('/content/gdrive')
Once it’s mounted, you’ll be able to load data directly from Google Drive with the “/content/gdrive” path. Mounting your Google Drive will also come in handy later when you need to save plot files.
A Baseline Model with XGBoost
XGBoost is known as the go-to algorithm thanks to its high accuracy and efficiency. Let’s give it a try!
t1 = time()
clf = xgb.XGBClassifier()
cv = StratifiedKFold(n_splits=10, shuffle=True, random_state=seed)
scores = cross_val_score(clf, X, y, cv=cv)
t2 = time()
t = t2 - t1print("Mean Accuracy: {:.2%}, Standard Deviation: {:.2%}".format(scores.mean(), scores.std()))
print("Time taken: {:.2f} seconds".format(t))
Then we got 74.88% accuracy and it took only 0.35 seconds! If we standardize the features and test again, we’ll get a result of 76.31%! This result is already very close to the state-of-the-art accuracy on this dataset.
Creating the Model
To be able to test different models, we need the capability of creating models on the fly. Meanwhile, we also need to test the model and provide results. Both needs point me to object-oriented programming. I then create the following class for testing. I’ll explain the technical details of this and the following section in a separate post.
Automating the Tests
Since we need to test many different combinations of parameters and need to save the results, it’s important to automate the test process. Again, let me show and not tell since details will be explained in a later post:
Baseline Neural Network Model
Let’s start with a baseline model with the following default parameters:
- input_dim=8
- num_layers=2
- num_units=8
- activation=’relu’
- activation_out=’sigmoid’
- loss=’binary_crossentropy’
- initializer=’random_uniform’
- optimizer=’adam’
- learning_rate=0.001
- metrics=[‘accuracy’]
- epochs=10
- batch_size=4
- one_hot=False
If we run:
param_dict_defaults, param_dict = get_defaults(), get_defaults()
accuracy_baseline = run_test(X=X, y=y, param_dict=param_dict_defaults)
We’ll get:
Finished cross-valiation. Took 1.5 mintues. Mean Accuracy: 71.61%, Standard Deviation: 2.92%
It’s not bad, but definitely far from the top result of 77.7%.
Importance of Different Parameters
To understand different parameters’ impacts on model tuning, let’s adjust one parameter at a time while keeping other parameters constant (thus different from an exhaustive search such as GridSearchCV in sklearn). Running the tests will provide us with the following results:
First, it’s interesting to note that some parameters not mentioned in the above parameter tuning guideline, can be important factors, e.g. optimizer and epochs.
Second, learning rate is indeed among the most impactful parameters.
Third, for this specific experiment (including parameter choices), it seems that number of layers is more important than number of hidden units. This is contrary to the above guideline.
Below is the tuning trend which can be used to find the ranges to tune in.
It’s important to note that the test here is only meant to provide some intuition and shouldn't be taken as formal rules. This is due to at least two reasons — one, the various parameters and their candidate values are not necessarily comparable; two, there’s innate randomness in neural networks, as such, results such as the above plots could change.
Although it’s highly likely that the interaction between parameter values does matter, i.e. 40 epochs may yield a worse accuracy when paired with a learning rate other than 0.001 (e.g. 0.1), we’ll nevertheless try out a naive approach here — combine the independently tuned best parameter values and train a model, which gives us:
Finished cross-valiation. Took 49.3 mintues. Mean Accuracy: 78.00%, Standard Deviation: 4.59%
Wow, that’s a brutal 50 minutes! Although we cannot complain about the result since it’s state of the art! It seems the naive approach does work.
Parameter Tuning
Now that we see the relative importance of the parameters, it’s time to tune the model. As learning rate is the most important one, let’s tackle it first. We’ll use the following code to generate 6 random learning rate values between 0.0001 and 0.01 since this is the most promising area based on the above tuning trend visualization.
bases = np.repeat(10, 3)
exponents_1 = -(np.random.rand(3) + 3)
exponents_2 = -(np.random.rand(3) + 2) learning_rate = np.power(bases, exponents_1).tolist() + np.power(bases, exponents_2).tolist()
After running the test, we got:
which points us to 0.0006716184352348816 as the best learning rate. Let’s use this and continue tuning batch size also with 6 options, since, we definitely want to trust Prof. Ng’s guideline that batch size is a second most important parameter :)
batch_size = [2 ** e for e in range(6)]
Although batch size 2 has a higher accuracy result, the time cost significantly outweighs the benefit, so we’ll go with batch size of 16.
After updating the batch size value in our parameters dictionary, we can now proceed to tune number of epochs. Since the time taken to train and test increases with the number of epochs, it’s better to tune this parameter at a later stage to avoid long running time.
which gives us the best number of epochs as 200. Next, let’s build the final model, with standardized features:
run_test(X=X_std, y=y, param_dict=param_dict)
That gives us:
Finished cross-valiation. Took 8.3 mintues.
Mean Accuracy: 78.53%, Standard Deviation: 3.64%
Absolutely great result! The time taken is not too bad, although it’s 1422 times more than XGBoost 😂
Now, what if we don’t tune the parameters and just standardize the features?
Finished cross-valiation. Took 1.7 mintues. Mean Accuracy: 76.95%, Standard Deviation: 2.88%
So it seems that parameter tuning’s effect is a bit marginal, but standardization, i.e. to make features have zero mean and unit variance is huge for neural networking model tuning.
Summary
- Learning rate is the most important parameter to tune since it can yield big performance improvements while not negatively affecting training time.
- Smaller batch sizes may provide better results, but it’s also more time-consuming! Similarly, training for more epochs generally help improve accuracy, but the time cost is also high.
- Optimizer can be an important parameter to tune.
- Deeper and wider neural networks may not always be helpful.
- Feature standardization can greatly improve model performance and is an easy win compared with parameter tuning.
- Neural networks are great, but they are not for everything. As we showed above, the time to train and tune a neural network model can take thousands if not millions of times more than non-neural networks! Neural networks are best fit for use cases such as computer vision and natural language processing.
You can find the complete code in my project repo on GitHub. Do give it a try and see what results you can get!
Thank you for reading! Is there anything that I can improve on? Kindly let me know below. We all get better by learning from each other!