For those that are trying to build a competency in the emerging fields of machine learning, a common obstacle and expense for implementation comes from the hardware requirements for accelerated training via GPU, without which the iterative nature of hyper-parameter selection can be time intensive for all but smaller data sets or shallow networks. Users have traditionally had the choice between either buying dedicated hardware such as a Nvidia GPU for their PC or alternately working in a virtual terminal via services such as AWS EC2 or Google Cloud, incurring either a fixed upfront expense for the hardware purchase or hourly usage charges of virtual machines.
Colaboratory is a Jupyter notebook environment built to run from the Google Drive cloud, with the added benefit of simultaneous multi-user collaboration features as can be found in Google Docs or Sheets. The service offers a Runtime option to activate acceleration via GPU (reportedly a Nvidia Tesla K80) for uninterrupted training intervals of up to 12 hours.
This post will quickly highlight a few of the basics for data imports and exports to help a user get started. Note that Colaboratory does offer an official reference notebook with import and export functions (which is the inspiration for the methods here), however in my attempts to implement I found a few additional details were helpful in practice which will be included here and in a public companion notebook on Colaboratory: [link]. In the process I’ll address data imports and exports to a local drive as well as files saved in Google Drive.
As a few quick asterisks note that the code embeddings here are GitHub gists which is a neat resource for writers on Medium. Alternatively if you’re trying to implement you can pull code from the linked Colaboratory notebook. Also note that I am neither a professional in the field nor to be honest a very accomplished hobbyist, so don’t expect anything too elaborate, but the hope is that by sharing a few points I can contribute to others just getting started and who knows perhaps earn a few karma points from like Buddha or something.
- Notebook run in Python 3.6
- Data imports are Kaggle datasets in .csv format
- User has an active google drive account
2. Data Operations for Local Drive
I’ll first address data imports and exports of files stored on a local drive. I expect that for larger datasets there will be benefits to storing sets in the cloud such as on the Google Drive platform (upload network lag etc), however for the kind of beginner Kaggle problems I have been tackling lately simply uploading from a local drive works just fine.
Uploading dataset from Local Drive
Downloading a Keras Model
Uploading a Keras Model
Downloading a dataframe to .CSV file
(e.g. for a Kaggle submission)
3. Data Operations for Google Drive
The value of the Colaboratory platform extends well beyond just the GPU access, it also has a novel feature for multiuser simultaneous shared editing of notebooks with live updates — similar to what you may have experienced with Google Docs for instance. As such I would expect that using files stored on a Google Drive account would be beneficial for shared notebooks, allowing all parties to access with datasets stored in common platform.
Now Google Drive does have an API for interacting with files, and I’m sure there are very polished ways to automate file selection and query file metadata — however I’m not writing a thesis here so this approach will resort to ‘hacks’, fair warning.
Note that the Colaboratory reference notebook offers two approaches for interfacing with files on Drive, one using PyDrive and one using Drive REST API. I’m demonstrating an approach using the latter of the two here not based on any evaluation of relative merits or features, but simply for the fact that it was the first method that I got to work. It could be that PyDrive is the superior approach here, if anyone feels either way on this point feel free to let me know in comments.
For the following approach you will need the “id” of the Google Drive file you wish to download (which is Drive meta-data, different than the file name). I’m sure there are much cleaner ways to do this, but a quick hack is to use the embedded API on this Google API page (REST v2), which if you simply click the EXECUTE button near bottom of the page (without inputting or editing any of the fields), it will output the text of a python dictionary containing meta-data for every file you have saved in the google drive account you select. If you select all (i.e. control-a) on the output and paste into a word processor you can do a quick control-F string search for the name of the file, and then a few rows above you’ll find the “id” string. (If any domain experts reading this want to suggest a cleaner way to lookup a file id using the API I’d be happy to update this tutorial.)
Uploading dataset from Google Drive
I’m not sure if it’s a quirk of Colaboratory or just how io is supposed to work (probably the latter), but some experimentation revealed that the downloaded Bytes.io() object does not remain in memory outside of the cell it is called in — fortunately we can just immediately convert to a dataframe which is what is done here.
Uploading a Keras Model from Google Drive
So this is where I ran into some roadblocks. I found that I could use a comparable approach as above to load a Keras model saved in .h5 format from Google Drive to the Colaboratory notebook, but then when I went to convert the byte string object well that’s where I got stuck. I have an active Stack Overflow query if anyone has any suggestions.
Well now that you’ve got your hands on a GPU, there’s no excuses. You’ve got the hardware to accelerate training, you’ve got the Kaggle community’s kernels demonstrating all kinds of winning solutions, and unlike the practitioners of just a few short years ago, you’ve got recent publishings demonstrating practical fundamentals in accessible book form. All that’s left is the drive. Have fun.
Books that were referenced here or otherwise inspired this post:
Deep Learning With Python — Francois Chollet
Hands-On Machine Learning with Scikit-Learn and TensorFlow — Aurélien Géron
(As an Amazon Associate I earn from qualifying purchases.)