Deploy NVIDIA NGC on Google Cloud config lab — run your AI workloads

Published in

Google Cloud - Community

7 min readJul 3, 2024

Let’s get right into it. We are the beginning of the age of AI. In this post I want to check out deploying the NVIDIA NGC on Google Cloud. You can jump to the end to see a rough cut demo video here or go to YouTube >> https://youtu.be/Xe-IHdZc8A4

p.s. This lab uses a global application load balancer and an unmanaged instance group to connect to the VM.

Alternatively you can provision a VM with a public IP address and connect directly using that IP address as shown in the NVIDIA documentation https://docs.nvidia.com/ngc/ngc-deploy-public-cloud/ngc-gcp/index.html but we are doing it differently.

It’s always fun to lab up things and so to do this test I utilize:

# 1 — NVIDIA NGC from Google Cloud marketplace, running in one project.
Prerequisites:

You have a project created
You have permissions to deploy Cloud VPN, compute, Firewall in your environment.
You have access to marketplace (note this will incur a cost)

Set up— Create project

# 1 — Create a project. Note you can also use an existing project. (Optional skip if you already have your test project and network.)

Open Cloud Shell and configure as follows.
p.s change YOUR-PROJECT-ID for the name of your project.

gcloud config list project
gcloud config set project YOUR-PROJECT-ID
projectid=YOUR-PROJECT-ID
networkid=nvidia-network
echo $projectid
echo $networkid

Create VPC

gcloud compute networks create nvidia-network --project=$projectid \
--subnet-mode=custom \
--mtu=1460 \
--bgp-routing-mode=global

Create custom subnet in us-central1

gcloud compute networks subnets create nvidia-subnet \
--project=$projectid --range=10.0.150.0/24 \
--stack-type=IPV4_ONLY --network=$networkid \
--region=us-central1

Default firewall rules

gcloud compute firewall-rules create nvidia-network-allow-custom \
--project=$projectid \
--network=projects/$projectid/global/networks/$networkid \
--description=Allows\ connection\ from\ any\ source\ to\ any\ instance\ on\ the\ network\ using\ custom\ protocols. \
--direction=INGRESS --priority=65534 \
--source-ranges=10.0.0.0/16 \
--action=ALLOW \
--rules=all


gcloud compute firewall-rules create nvidia-network-allow-icmp \
--project=$projectid \
--network=projects/$projectid/global/networks/$networkid \
--description=Allows\ ICMP\ connections\ from\ any\ source\ to\ any\ instance\ on\ the\ network. \
--direction=INGRESS --priority=65534 \
--source-ranges=0.0.0.0/0 \
--action=ALLOW \
--rules=icmp


gcloud compute firewall-rules create nvidia-network-allow-http \
--project=$projectid \
--network=projects/$projectid/global/networks/$networkid \
--description="Allows http connections from any source to any instance on the network using ports 80 and 8080." \
--direction=INGRESS --priority=65534 \
--source-ranges=0.0.0.0/0 \
--action=ALLOW --rules=tcp:80,tcp:8080


gcloud compute firewall-rules create nvidia-network-allow-ssh \
--project=$projectid \
--network=projects/$projectid/global/networks/$networkid \
--description=Allows\ TCP\ connections\ from\ any\ source\ to\ any\ instance\ on\ the\ network\ using\ port\ 22. \
--direction=INGRESS --priority=65534 \
--source-ranges=0.0.0.0/0 --action=ALLOW \
--rules=tcp:22

gcloud compute firewall-rules create nvidia-network-allow-digits \
--project=$projectid \
--network=projects/$projectid/global/networks/$networkid \
--description=Allows\ http\ connections\ from\ any\ source\ to\ any\ instance\ on\ the\ network\ using\ port\ 8888. \
--direction=INGRESS --priority=65534 \
--source-ranges=0.0.0.0/0 \
--action=ALLOW --rules=tcp:8888


gcloud compute firewall-rules create nvidia-network-allow-https \
--project=$projectid \
--network=projects/$projectid/global/networks/$networkid \
--description=Allows\ TCP\ connections\ from\ any\ source\ to\ any\ instance\ on\ the\ network\ using\ port\ 443 \
--direction=INGRESS --priority=65534 \
--source-ranges=0.0.0.0/0 --action=ALLOW \
--rules=tcp:443

gcloud compute firewall-rules create nvidia-allow-health-check \
    --project=$projectid \
    --network=$networkid \
    --action=allow \
    --direction=ingress \
    --source-ranges=130.211.0.0/22,35.191.0.0/16 \
    --target-tags=nvidia-vms \
    --rules=tcp:80

Create NAT gateway

gcloud compute routers create nv-outbound-nat \
    --network $networkid \
    --region us-central1 

gcloud compute routers nats create nv-outbound-nat-gw \
    --router-region us-central1 \
    --router nv-outbound-nat \
    --nat-all-subnet-ip-ranges \
    --auto-allocate-nat-external-ips

Create unmanaged instance group

Create unmanaged instance group with named port http we will add the vm to it later on

gcloud compute instance-groups unmanaged create uig-vms \
 --zone=us-central1-a

gcloud compute instance-groups set-named-ports uig-vms \
    --named-ports http:80 \
    --zone us-central1-a

Deploy NVIDIA appliance.

The first thing you need to do is create a key and add to the metadata section of Google Cloud. If you don’t do this you will not be able to ssh into the appliance.

Create ssh key in cloud shell. When prompted add a passphrase I used something simple like sshnvidia

ssh-keygen -t rsa -f ~/.ssh/ndc_gcp -C "nvidia-user"

Open Cloud Shell in editor mode and navigate to the /home/username/.ssh/ folder and you will see the files created with the ndc at the start. Open the public key file (.pub) and copy the content.
Navigate to the menu Metadata. Choose SSH Keys and click edit and then select Add Item
Copy the key content here and save.

Install NVIDIA from marketplace

nvidia on google marketplace — Google marketplace NVIDIA VMI

Go to marketplace search for NVIDIA and select the
NVIDIA GPU-Optimized VMI option, this will prompt you with what to do.

Ensure the correct project and network (nvidia-network) is selected.
Deploy.

Next we will add the deployed vm to the unmanaged instance group and create a regional application load balancer.

Add vm to unmanaged instance group and add network tag to vm

gcloud compute instance-groups unmanaged add-instances uig-vms \
    --zone=us-central1-a \
    --instances=nvidia-ngc-test-vm

gcloud compute instances add-tags nvidia-ngc-test-vm \
    --zone us-central1-a\
    --tags nvidia-vms

Create Application load balancer

IP reservation

gcloud compute addresses create nvidia-lb-ip --ip-version=IPv4 \
--network-tier=PREMIUM --global

gcloud compute addresses describe nvidia-lb-ip --format="get(address)" \
--global

Health-checks

ps. For the health check to work it has to be TCP

gcloud compute health-checks create tcp nvid-hc \
   --port=80

Backend

gcloud compute backend-services create nvidia-backend-service \
  --load-balancing-scheme=EXTERNAL \
  --protocol=HTTP \
  --port-name=http \
  --health-checks=nvid-hc \
  --global

gcloud compute backend-services add-backend nvidia-backend-service \
  --instance-group=uig-vms \
  --instance-group-zone=us-central1-a \
  --global

URL map

gcloud compute url-maps create nvidia-lb \
  --default-service=nvidia-backend-service

Target-proxy

gcloud compute target-http-proxies create nvidia-lb-proxy \
  --url-map=nvidia-lb

Forwarding rule

gcloud compute forwarding-rules create nvidia-forwarding-rule \
  --load-balancing-scheme=EXTERNAL \
  --address=nvidia-lb-ip \
  --global \
  --target-http-proxy=nvidia-lb-proxy \
  --ports=80

Connect to NVIDIA VMI and set up

Navigate the console and go to VM Instance. Look for the VM you created called nvidia-ngc-test, select the name and in the VM setting select SSH

Inside the VM if you are prompted to install drivers type Y to install the NVIDIA drivers

Exit out of the SSH session and login again to run docker commands without using sudo.

Run NVIDIA pytorch container

docker run --gpus all --rm -it -p 80:80 \
    --shm-size=1g --ulimit memlock=-1 \
    --name=nvidia-ai \
    nvcr.io/nvidia/pytorch:23.06-py3 \
    jupyter lab --allow-root --ip='*' --port=80

In the VM window it will first download the images and then deploy. When completed you will see the token, copy it to somewhere. It will look like this ?token=

Connect to pytorch via public ip

Since you launched the container you can connect to if from your browser to get the gui. For this stage you will need the load balancer IP you created and the token value you copied

The syntax will be http://public-lb-ip-address/?token=

Get load balancer IP

gcloud compute addresses describe nvidia-lb-ip --format="get(address)" \
--global

In your browser type http//ip/?token
Example http://1.2.4.5/?token=abcde12345

You should see the following.

From here you can run use pytorch to run your training model on your NVIDIA VMI in Google Cloud.

Read more about:
NVIDIA GDS catalogue.- https://www.nvidia.com/en-us/gpu-cloud/
Deployment guide — https://docs.nvidia.com/ngc/ngc-deploy-public-cloud/ngc-gcp/index.html

Video:

rough cut video demo

Clean up

Delete all created elements and go to deployments and delete the deployment of the NVIDIA VMI.
Elements to delete
- Load balancer
-Backend, Frontend, health check
- Cloud NAT, Cloud Router
- Unmanaged instance group
- Deployment
- You can also delete the VPC

I’ll be in touch