Using Google Colab GPU VM + Drive as persistent storage for long Deep Learning training runs
Not all people have a Deep Learning Rig or lots of credits in cloud services to have hardware accelerated computing. Google Colabaratory is a service which provides a Tesla K80 GPU runtime for free, but training deep neural networks from scratch can be a pain with the limitations that they currently have. So, this post is about using your Google Drive to store your dataset (in “upload only once” mode) and save checkpoints to resume your long training run, whenever the Colab instance gets disconnected.
[Update] Now you can get Nvidia T4 and Nvidia P100 as GPU as well. Run !nvidia-smi to see the GPU details.
Why use Google Drive?
- Google Colab provides a maximum GPU runtime of 8~12 hours ideally at a time. It may get disconnected earlier than this, if it detects inactivity, or when there is heavy load.
- Its acts a persistent storage for the Colab Virtual Machine, so that you won’t lose your trained data in case it gets disconnected from the run time.
- You can load your data set once, and use it hassle free whenever you reconnect to a new runtime.
How do I use Google Drive with Google Colab?
To mount your drive to a Colab runtime, run these 2 lines of code which will prompt for an authorization code. These lines will return a link to obtain that authorization code. Copy it into the input prompt, press Enter
and you will have successfully mounted your drive to the current Colab session.
from google.colab import drive
drive.mount(‘/content/gdrive’)
Note: you will have to do these steps every time you restart your Colab notebook runtime, or if it gets disconnected.
How do I access my files from Google Drive in Google Colab?
All the files and folders of your GDrive are accessible from a folder called “My Drive”. After you have mounted your Drive (as mentioned above), it will be accessible from the path /content/gdrive/My Drive/<folder/file>
.
Example : The path for a folder called
Sample
in your Drive, and a file namedsample.csv
in it will be/content/gdrive/My Drive/Sample/sample.csv
How do I store and load datasets from Google Drive?
- It is more efficient to upload zipped files of your dataset (especially if it contains a lot of images), and then unzip them in Colab directly, than it is to upload the unzipped folder to your Colab.
- Once the drive is mounted you can unzip the file by using the bash command
unzip
(with an exclamatory mark to use it in your notebook cell)
Example : To unzip
sample.zip
which is in the folderSample
, run!unzip -qq '/content/gdrive/My Drive/Sample/sample'
- The unzipped files will be accessible from the
/content/
directory.
Example : If the above
sample.zip
file had a file calledtrain.csv
, you would access it from'/content/sample/train.csv'
.
How do I use Colab for long training times/runs?
You can do it by storing “checkpoints” to your Drive using “callbacks”. Below is the demonstration for this using the Keras library.
from keras.callbacks import *filepath = "/content/gdrive/My Drive/MyCNN/epochs:{epoch:03d}-val_acc:{val_acc:.3f}.hdf5"checkpoint = ModelCheckpoint(filepath, monitor='val_acc', verbose=1, save_best_only=True, mode='max')callbacks_list = [checkpoint]
filepath
: is the path for a folder called “MyCNN” in your Drive which will store all your epochs as different files which have the epoch number and the validation accuracy of that epoch as the file name. These files contain the weights of your neural network architecture at that epoch.ModelCheckpoint
: is a class inKeras.callbacks
which is used to create a checkpoint. The arguments passed in the above code snippet are monitoring the validation accuracy of each epoch and when a higher validation accuracy is achieved than the last checkpoint, it saves the weights of this epoch as a new checkpoint.callbacks_list
: is just storing theModelCheckpoint
object reference as one of the callbacks to be later passed while callingmodel.fit
ormodel.fit_generator
(as shown in the snippet below). You can further add other callbacks that you might want to use, to this list.
model.fit_generator(datagen.flow(x_train, y_train, batch_size=64),
epochs=epochs,
verbose=1,
validation_data=(x_test, y_test),
callbacks=callbacks_list)
For more details on Keras callbacks and checkpoints, visit the Keras Docs.
Resuming the training once the instance is disconnected
- Mount your Drive in the new runtime.
- Create and compile the previous model.
- Run model.load_weights as shown in the following checkpoint example where, at the 47th epoch, it had reached a new max validation accuracy of 90.5%
model.load_weights('/content/gdrive/My Drive/MyCNN/epochs:047-val_acc:0.905.hdf5')
- Then for example, if you want to fit the model for another 13 epochs (i.e., 60 epochs in total), pass values to the arguments
initial_epoch
andepochs
offit
orfit_generator
as 47 and 60 respectively.
Hope this helps your training on Colab without the worry of losing all trained weights! 🙂
Thank you Pavan Rao for the edits.