Using Google Colab GPU VM + Drive as persistent storage for long Deep Learning training runs

Prajwal Prashanth
4 min readMar 15, 2019

--

Not all people have a Deep Learning Rig or lots of credits in cloud services to have hardware accelerated computing. Google Colabaratory is a service which provides a Tesla K80 GPU runtime for free, but training deep neural networks from scratch can be a pain with the limitations that they currently have. So, this post is about using your Google Drive to store your dataset (in “upload only once” mode) and save checkpoints to resume your long training run, whenever the Colab instance gets disconnected.

[Update] Now you can get Nvidia T4 and Nvidia P100 as GPU as well. Run !nvidia-smi to see the GPU details.

Why use Google Drive?

  • Google Colab provides a maximum GPU runtime of 8~12 hours ideally at a time. It may get disconnected earlier than this, if it detects inactivity, or when there is heavy load.
  • Its acts a persistent storage for the Colab Virtual Machine, so that you won’t lose your trained data in case it gets disconnected from the run time.
  • You can load your data set once, and use it hassle free whenever you reconnect to a new runtime.

How do I use Google Drive with Google Colab?

To mount your drive to a Colab runtime, run these 2 lines of code which will prompt for an authorization code. These lines will return a link to obtain that authorization code. Copy it into the input prompt, press Enter and you will have successfully mounted your drive to the current Colab session.

from google.colab import drive
drive.mount(‘/content/gdrive’)

Note: you will have to do these steps every time you restart your Colab notebook runtime, or if it gets disconnected.

How do I access my files from Google Drive in Google Colab?

All the files and folders of your GDrive are accessible from a folder called “My Drive”. After you have mounted your Drive (as mentioned above), it will be accessible from the path /content/gdrive/My Drive/<folder/file>.

Example : The path for a folder called Sample in your Drive, and a file named sample.csv in it will be /content/gdrive/My Drive/Sample/sample.csv

How do I store and load datasets from Google Drive?

  • It is more efficient to upload zipped files of your dataset (especially if it contains a lot of images), and then unzip them in Colab directly, than it is to upload the unzipped folder to your Colab.
  • Once the drive is mounted you can unzip the file by using the bash command unzip (with an exclamatory mark to use it in your notebook cell)

Example : To unzip sample.zip which is in the folder Sample, run
!unzip -qq '/content/gdrive/My Drive/Sample/sample'

  • The unzipped files will be accessible from the /content/ directory.

Example : If the above sample.zip file had a file called train.csv, you would access it from '/content/sample/train.csv'.

How do I use Colab for long training times/runs?

You can do it by storing “checkpoints” to your Drive using “callbacks”. Below is the demonstration for this using the Keras library.

from keras.callbacks import *filepath = "/content/gdrive/My Drive/MyCNN/epochs:{epoch:03d}-val_acc:{val_acc:.3f}.hdf5"checkpoint = ModelCheckpoint(filepath, monitor='val_acc', verbose=1, save_best_only=True, mode='max')callbacks_list = [checkpoint]
  • filepath: is the path for a folder called “MyCNN” in your Drive which will store all your epochs as different files which have the epoch number and the validation accuracy of that epoch as the file name. These files contain the weights of your neural network architecture at that epoch.
  • ModelCheckpoint: is a class in Keras.callbacks which is used to create a checkpoint. The arguments passed in the above code snippet are monitoring the validation accuracy of each epoch and when a higher validation accuracy is achieved than the last checkpoint, it saves the weights of this epoch as a new checkpoint.
  • callbacks_list: is just storing the ModelCheckpoint object reference as one of the callbacks to be later passed while calling model.fit or model.fit_generator (as shown in the snippet below). You can further add other callbacks that you might want to use, to this list.
model.fit_generator(datagen.flow(x_train, y_train, batch_size=64),
epochs=epochs,
verbose=1,
validation_data=(x_test, y_test),
callbacks=callbacks_list)

For more details on Keras callbacks and checkpoints, visit the Keras Docs.

Resuming the training once the instance is disconnected

  • Mount your Drive in the new runtime.
  • Create and compile the previous model.
  • Run model.load_weights as shown in the following checkpoint example where, at the 47th epoch, it had reached a new max validation accuracy of 90.5%
    model.load_weights('/content/gdrive/My Drive/MyCNN/epochs:047-val_acc:0.905.hdf5')
  • Then for example, if you want to fit the model for another 13 epochs (i.e., 60 epochs in total), pass values to the arguments initial_epoch and epochs of fit or fit_generator as 47 and 60 respectively.

Hope this helps your training on Colab without the worry of losing all trained weights! 🙂

Thank you Pavan Rao for the edits.

--

--