How To Rocket Up Your DL Training. Connect Local Runtime with Tesla V100 Deep Learning Instance on GCP

Zack Pashkin
The School of AI (official)
3 min readJul 10, 2019

If you dream to boost ML/DL model training in your colab notebook, here is a simple guide how to do that:

Colab for this tutorial

Average spedup over 1080Ti

1. Go to your GCP console

2. Launch Cloud Shell

Export your image, zone(where you have quotas) and instance(any name you like) like this:

If you want pytorch:

export IMAGE_FAMILY=”pytorch-latest-cu92"
export ZONE=”europe-west4-a “
export INSTANCE_NAME=”deeplearning”

for tensorflow:

export IMAGE_FAMILY=”tf-latest-gpu”
export ZONE=”europe-west4-a “
export INSTANCE_NAME=”deeplearning”

Then make compute instance with DeepLearning VM like this:

gcloud compute instances create $INSTANCE_NAME \
— zone=$ZONE \
— image-family=$IMAGE_FAMILY \
— image-project=deeplearning-platform-release \
— maintenance-policy=TERMINATE \
— accelerator type=nvidia-tesla-v100,count=1 \
— metadata=’install-nvidia-driver=True’ \
— preemptible

Wait ~5min. Output should look be like this:

Output should look like this:

Created [https://www.googleapis.com/compute/v1/projects/kagglevaluepredictionchallenge/zones/europe-west4-a/instances/deeplearning].
NAME ZONE MACHINE_TYPE PREEMPTIBLE INTERNAL_IP EXTERNAL_IP STATUS
deeplearning europe-west4-a n1-standard-1 true 10.164.0.2 35.204.194.200 RUNNING

And you will see green running instance.
Go to next step

3.Launch Terminal

ssh to your compute instance (sometimes you have to wait couple min before it becomes avaivable to ssh)

gcloud compute ssh — zone europe-west4-a deeplearning — -L 8888:localhost:8888

4.In your instance

Make sure this is installed:

pip install — upgrade jupyter_http_over_ws>=0.0.1a3 && \
jupyter serverextension enable — py jupyter_http_over_ws

Launch jupyter and go to link:

jupyter notebook \
— NotebookApp.allow_origin=’https://colab.research.google.com' \
— port=8888 \
— NotebookApp.port_retries=0

5.In Colab

Connect to local runtime
port:
8888

6.Make sure your gpu is available and ready to go:

For pytorch instance

```

# check your gpu is available:
# for pytorch instance:

import torch

# setting device on GPU if available, else CPU
device = torch.device(‘cuda’ if torch.cuda.is_available() else ‘cpu’)
print(‘Using device:’, device)
print()

#Additional Info when using cuda
if device.type == ‘cuda’:
print(torch.cuda.get_device_name(0))
print(‘Memory Usage:’)
print(‘Allocated:’, round(torch.cuda.memory_allocated(0)/1024**3,1), ‘GB’)
print(‘Cached: ‘, round(torch.cuda.memory_cached(0)/1024**3,1), ‘GB’)

#initialize your GPU
torch.cuda.init()

For tensorflow instance:

# check your gpu is available:
# for tensorflow instance:
import tensorflow as tf
device_name = tf.test.gpu_device_name()
if device_name != ‘/device:GPU:0’:
raise SystemError(‘GPU device not found’)
print(‘Found GPU at: {}’.format(device_name))

*NOTE *

Compute Instance: Make sure to stop it or delete when job is done! Don’t leave it!

Use preemptible instances to save money

Maybe a better idea is not to use instance for prototyping to avoid extra charge. You can use colab with free hosted GPU instead. Use instance to speed up training process.

** Quotas.** Request your quotas for preemptible GPU if you haven’t yet. Go to manage quotas and request. It takes ~day to get response.

Errors. If you see an error 255 ( 3 step) trying to ssh , just wait a bit, sometimes it takes time for instance to locate IP adress

Other ports Use other ports 8889, 8890, etc. , just change them on steps 3 and 4

For additional info check this google drive folder: Screenshots and video instuction

Try instances with more GPU ( nvidia-tesla-v100 (count=1 or 8)

  • nvidia-tesla-p100 (count=1, 2, or 4)
  • nvidia-tesla-p4 (count=1, 2, or 4)
  • nvidia-tesla-k80 (count=1, 2, 4, or 8))

That’s all, this should work!

If you have any questions or issues contact @Zack on wizards slack or kaisenaiko@gmail.com. Thanks and happy deep learning!

Here is Colab for this tutorial

Here is Github

Docs

--

--