Configuring Google Colab Like A Pro

How to Do Research Quality Machine Learning on a Budget

Photo by Damir Spanic on Unsplash

For the longest time, I avoided learning to use Google Colab. It was so close to what I knew in Jupyter Notebook, but I’d seen forum thread after forum thread about the quirks of Colab and how people had to hack it to make it work. I didn’t want that and was happy to pay $0.50/hr for a preemptible GCP instance, until GCP became unusable due to constant preemption. I considered a few other services, but eventually settled on Colab, feeling my skills had grown enough over the past year that I could figure out how to make it work. I did and now I’ll share what I learned and did to make it possible to do Automated Speech Recognition research on a Colab instance.

What this guide is and isn’t: This guide is for advanced features and hacks, like remoting in to a Colab session with VSCode, that allow you to make Colab usable for larger projects. I won’t cover basics like installing libraries or uploading files as there are already great resources.

Link to Colab Notebook with Code

Table of Contents:

  1. Make sure you don’t get disconnected
  2. Mount your drive for fast, responsible access to your datasets
  3. Use wget to download datasets to your drive
  4. Use Gdown to grab publicly available Google Drive files
  5. The best way to connect your Github
  6. Remote in through VSCode using SSH and ngrok
  7. How to Forward Ports from Colab to your computer
  8. Run Tensorboard in Colab or in the browser
  9. Run a Jupyter Notebook server on Colab and access it locally
  10. Use fastprogress when your code will take a while
  11. Setup a telegram bot to update you during setup and training
  12. Some paid addons worth considering
  13. Addendum and Extras

Tip 1: Make sure you don’t get disconnected

%%javascript
function ClickConnect(){
console.log("Working");
document.querySelector("colab-toolbar-button#connect").click()
}setInterval(ClickConnect,60000)

It has been an open secret that you can avoid getting disconnected on Colab by opening the console and entering JavaScript to click the reconnect button for you. It gets very old pressing Ctrl-Shift-I, finding this snippet, and pasting it in every time you start a new session, but Colab gives you the ability to run JavaScript from a cell using the %%javascript magic. Add this cell before your training loop and run it when you plan to do a long training run to avoid getting disconnected mid-training.

There are several variations of this “disconnect protect” code floating around the internet. If this one doesn’t work for you, check the Addendum section for more.

Note: Please use this responsibly. Getting booted from Colab is very annoying, but it is done to make resources available for others when you’re not actively using them. For some reason, Colab still kicks you off when doing training runs of several hours (when you’re not actually inactive), use this only to circumvent that case.

Tip 2: Mount your drive for fast, responsible access to your datasets

from google.colab import drive
drive.mount('/content/drive')

As you know, Colab deletes any files you’ve downloaded or created when you end a session. The best option is to use Github to store your code(details below), and Google Drive to store datasets, logs, and anything else that would normally reside on your filesystem but wouldn’t be tracked by git.

When you run the code above, you will need to click a link and follow a process that takes about 30 seconds. When complete all of your drive files will be available via ‘/content/drive’ on your Colab instance, and this will allow you to structure your projects in the same way you would if you were using a cloud server. I keep my pytorch-lightning logs on drive, and am able to view them using Tensorboard inside of my notebook, see tip #6 for more.

Tip 3: Use wget to download datasets to your drive

! wget -c -P '/content/drive/My Drive/Colab Notebooks/data/' http://www.openslr.org/resources/12/train-clean-100.tar.gz

Uploads from your computer to google drive can be incredibly slow, especially when dealing with multiple GBs of data. Download speeds are much faster, so take advantage with the command ! wget -c -P save_path url This allows you to download the data only once saving you time and saving bandwidth for the generous owners of publicly hosted datasets.

For details about how to unzip and untar data in Colab and Drive, see the addendum. A general note is, due to weird Google Drive quota issues, you are better off copying the archive to colab and decompressing it there than you are decompressing the archive while it is hosted on your drive, which brings us to our next tip:

Tip 4: Use Gdown to grab publicly available Google Drive files

## Command Line
# note, your file_id can be found in the shareable link of the file
! pip install gdown -q
! gdown — id <file_id>
## In Python
import gdown
url = https://drive.google.com/uc?id=<file_id>
output = 'my_archive.tar'
gdown.download(url, output, quiet=False)

Gdown is a nice library for downloading large files from drive to colab.

If the archive you’re decompressing is < 10 gb, and you’re not doing it multiple times per day, than this isn’t strictly necessary, but I’d suggest it as a best practice, unless your dataset is private.

Thanks to Sheik Mohamed Imran for the tip

Note: Google recently changed the format of shareable links, see my post here for details on how to reformat the URL to make this work.

Tip 5: The best way to connect your Github

import os
from getpass import getpass
import urllib
user = 'rbracco'
password = getpass('Password: ')
repo_name = 'fastai2_audio'
# your password is converted into url format
password = urllib.parse.quote(password)
cmd_string = 'git clone https://{0}:{1}@github.com/{0}/{2}.git'.format(user, password, repo_name)
os.system(cmd_string)
cmd_string, password = "", "" # removing the password from the variable# Bad password fails silently so make sure the repo was copied
assert os.path.exists(f"/content/{repo_name}"), "Incorrect Password or Repo Not Found, please try again"

Credit: Vinoj John Hosan https://stackoverflow.com/a/57539179/5042053

This cell is a bit long, but like the others here, you can copy and paste it in the top of your main notebook, and run it when you boot up. This will allow you to grab both public and private repos without leaving your password exposed in the notebook.

Bonus Tip: Don’t forget to tell Git who you are, add this cell so you don’t have to answer every time you commit during a new session!

!git config --global user.email <YOUR EMAIL>
!git config --global user.name <YOUR NAME>

Tip 6: Remote in through VSCode using SSH and ngrok

!pip install colab_ssh --upgrade
from colab_ssh import launch_ssh
# in a separate cell in case you need to restart the service
ngrokToken = <YOUR NGROK API TOKEN HERE>
launch_ssh(ngrokToken, password="password")

As someone who is scared of devops, I promise you this trick is way easier than it sounds. Also, the value of terminal access, not having to reenter github password, plus being able to edit .py files locally inside of VSCode, makes it well worth the trouble. Once setup it feels like everything is running on your own computer.

Here’s all you need to do:

  1. Visit the ngrok homepage, create an account, and copy your free api token into the cell above.
  2. Execute the above cell
  3. Copy the output and add it to your .ssh/config (if you don’t know how do this, see the addendum)

There is a catch here. ngrok runs in the background so if you invoke Google Colab’s stop-cell or interrupt execution, you will kill the ngrok process and get disconnected from VSCode, requiring you to relaunch and update the sshconfig (takes about 60 seconds). If you’re willing to pay $10/monthly, you can avoid this with a static tcp address which comes with ngrok pro. See the addendum for more.

Tip 7: How to Forward Ports from Colab to your computer

Sometimes you’d like to run something on the Colab server, like Jupyter, a flask app, or Tensorboard, that uses a browser. Colab doesn’t have a browser, so the solution is to forward the port so that you can open it up in your local browser. This does require an SSH connection, so you will need to follow tip 6 first. Once you’ve connected VSCode you can forward a port in two ways:

  1. Just for this session. In VSCode bring up the Command Palette with Ctrl-Shift-P and type ‘Forward a Port’, then type the number of the port you would like to forward (generally 5000 for flask, 3000 for react, 8888 for jupyter, 6006 for tensorboard)
  2. Every time you connect. In VSCode bring up the Command Palette with Ctrl-Shift-P and type ‘Remote-SSH: Connect to Host’ then ‘Configure SSH Hosts’ then select your SSH config file. Add the lineLocalForward 127.0.0.1:<PORT> 127.0.0.1:<PORT> under the entry you use to connect to Colab substituting <PORT> with the port to be forwarded (e.g. 8888) Mine looks like this:

Now you can open http://localhost:8888/ in your browser and whatever is running on port 8888 on your Colab will be there.

Tip 8: Run Tensorboard in Colab or in the browser

Start tensorboard by creating an executing a cell containing%load ext tensorboard and then %tensorboard --logdir /PATH/TO/LOGS This will create a special notebook cell output that is running an interactive tensorboard. If tensorboard isn’t updating in real time you can reload the extension with %reload_ext tensorboard

If you’re using VSCode to connect you can also use tensorboard in the browser by forwarding the ports as described in Tip 7. Here are the exact steps for tensorboard:

  1. Add LocalForward 127.0.0.1:6006 127.0.0.1:6006 in your SSH config under “Host google_colab_ssh” and it will forward port 6006(tensorboard) from Colab to your localhost.
  2. Run tensorboard --logdir /PATH/TO/LOGS from the VSCode terminal that is remotely connected to your Colab instance
  3. Open up localhost:6006 in your local browser and tensorboard is there!

Tip 9: Run a Jupyter Notebook server on Colab and access it locally

Colab is amazing and does a fantastic job of building on Jupyter, but sometimes you need to go back to the original. I help maintain fastai_audio, a project that, like all things fastai, is developed and documented entirely in Jupyter Notebooks (thank you Jeremy Howard and Sylvain Gugger).

As it requires a GPU, I wanted to use Colab to contribute to the project but I couldn’t figure out how to do it until I found this thread on the fastai forums, and saw the trick @TheZachMueller used to make it work.

There are two ways you can do this:

A. If you’re already using colab_ssh and vscode from tip 6

  1. Run this command in your Colab notebook to start Jupyter Notebook in the background running on port 8888
! nohup jupyter notebook — no-browser — allow-root — ip=0.0.0.0&

2. Forward port 8888 as described in Tip 7

B. If you don’t want to use vscode, but just want jupyter available via the browser.

Follow the black magic described in the fastai forum post.

Tip 10: Use fastprogress when your code will take a while

! pip install fastprogress
from fastprogress import master_bar, progress_bar

Fastprogress is a clean, well-designed progress bar library brought to you by the fastai family. Installing it and importing along with the other cells at the top of your notebook means that whenever you need to track loops that will take a long time, it’s as simple as wrapping the sequence you’re iterating over with progress_bar() That sounds a lot fancier than it actually is, if you’re not sure exactly what that means, check the example below.

#before
for item in my_list:
process(item)
#after
for item in progress_bar(my_list):
process(item)
Fastprogress in action: See more at https://github.com/fastai/fastprogress

Tip 11: Setup a telegram bot to update you during setup and training

I wrote a very brief guide to make a telegram bot to send yourself messages from Python, and a followup guide that shows you how to integrate it as a machine learning callback. The second guide outlines the few tweaks needed to get it running in colab with minimal setup friction.

Every morning when I run the code to pull my github repo, download and extract my dataset, install libraries, and so on, I go outside on the porch and drink my coffee and get a telegram message a few minutes later when my workstation is ready.

Some paid addons worth considering:

  1. Colab Pro — $9.99/month — available in US only :( — This gets you access to faster, higher-memory GPUs as well as higher usage limits, and less frequent disconnection. If you can afford it, it’s well worth the price.
  2. Extra Google Drive Storage — $29.99/yr (200GB) or $99.99/yr (2TB). I pay for 200GB and it has been fine for Speech Recognition research, as long as I’m conservative (my datasets are ~1000hrs of audio and ~75GB)
  3. Ngrok Static TCP Address — $99/year or $9.99/monthly. This is quite pricey for what it is, and won’t be necessary for most people. The main advantage it gives is that your tcp address doesn’t change every time you run colab_ssh and you don’t have to update your SSH_Config

Addendum

Sometimes things get messy. Here are the extra tweaks you might need to get things working.

Tip 1: Make sure you don’t get disconnected

There are a few variations of the “disconnect protect code”. If the one linked above doesn’t work for you, try the one from this Stack Overflow post that was kindly recommended to me on Twitter by Vijayabhaskar J.

Tip 3: Use wget to download datasets to your drive

How to extract .tar.gz archives from drive

If you download a gzip file to drive, you will only need to unzip it once and the extracted tar file will remain in your drive. You can un-gzip it with

! gunzip <PATH_TO_GZ_FILE>/<ARCHIVE_NAME>.tar.gz
# example from my own colab
base_path = Path('/content/drive/My Drive/Colab Notebooks')
! gunzip "{base_path}/data/train-clean-360.tar.gz"

The .gz file will be replaced with the .tar and then you will have to extract that to your colab as part of your setup process every day. If the data is public, you should use gdown as explained in step 4. You can then extract it locally using

! tar -xf <PATH_TO_ARCHIVE> -C <FOLDER_TO_EXTRACT_TO>

If the data isn’t public, skip the gdown step and point the tar command at the path to your archive inside of your mounted google drive. I include an example from my own colab below to show what it looks like.

! tar -xf <PATH_TO_ARCHIVE> -C <FOLDER_TO_EXTRACT_TO>
# example from my own colab
! mkdir /content/spanish/data/librispeech
! tar -xf "/content/drive/My Drive/Colab Notebooks/data/train-clean-100.tar" -C /content/spanish/data/librispeech

This takes about 30s per GB of data in my experience. To me this is acceptable as I batch all the setup code together at the start of my notebook and then go make coffee while it’s running.

Tip 4: Use Gdown to grab publicly available Google Drive files

This usually works for large files, but there is one additional trick you can use if it either isn’t working or you’re getting a very unhelpful “input/output error” on your drive (meaning you’ve hit a transfer quota).

The solution to this requires publicly sharing the dataset on google drive so that anyone with the link can download it, and then copying the file_id from that link and running this ridiculous command:

file_id = <YOUR_FILE_ID> e.g. '1-76_xBMDJ1Cd65FnUOs3SK7gpiO6pnrS'
file_name = <ARCHIVE_FILE_NAME> e.g. 'train-clean-360.tar'
! wget --load-cookies /tmp/cookies.txt "https://docs.google.com/uc?export=download&confirm=$(wget --quiet --save-cookies /tmp/cookies.txt --keep-session-cookies --no-check-certificate 'https://docs.google.com/uc?export=download&id={file_id}' -O- | sed -rn 's/.*confirm=([0-9A-Za-z_]+).*/\1\n/p')&id={file_id}" -O {file_name} && rm -rf /tmp/cookies.txt! tar -xf /content/<ARCHIVE_FILE_NAME> -C <PATH_TO_EXTRACT_TO>

An amazing amount of credit is due to Anjan Chandra Paudel who figured out this was possible and published a guide.

Also, if you are having issues with transferring from Google Drive, you can check the logs with! cat /root/.config/Google/DriveFS/Logs/drive_fs.txt

Tip 5: The best way to connect your Github

You can also use colab_ssh to connect to github, but it requires you to store a github oauth token in plaintext in the notebook. This means you could accidentally allow someone access if you share the notebook but forget to remove the token. While you can limit the scope of oauth tokens, I decided I’d rather manually enter my password for ultimate safety.

Tip 6: Remote in through VSCode using SSH and ngrok

Steps to add ngrok output to sshconfig:

  1. Press Ctrl-Shift-P in VSCode to open the command palette and type ‘Remote-SSH: Connect to Host’ and press enter.
  2. Select “Configure SSH Hosts”
  3. Paste the output on a new line in this file and save it
  4. Repeat step 1 and from the dropdown list select “google_colab_ssh”
  5. You should now be connected to the remote machine. Note you will likely be in /root, but your code will be in /content, so press ctrl-k ctrl-o (open folder) and select /content/<folder-where-your-code-is>

Setting up a static TCP address with ngrok pro:

  1. Register for an ngrok pro account for $10/monthly or $99 annually.
  2. From the ngrok dashboard select “Endpoints->TCP Addresses”
  3. Click “Reserve an address”
  4. Alter your launch_sshcode to include a remote_addr e.g. launch_ssh(ngrokToken, password="password", remote_addr=<YOUR_NGROK_TCP_ADDRESS>)
  5. Even better, put it inside a function called relaunch_ssh() so anytime you have to interrupt your code’s execution, you can quickly type/execute relaunch_ssh() in a new cell to reconnect.

All the skills of going back to school, with none of the debt.