Deploy NVIDIA NGC on Google Cloud config lab — run your AI workloads
Let’s get right into it. We are the beginning of the age of AI. In this post I want to check out deploying the NVIDIA NGC on Google Cloud. You can jump to the end to see a rough cut demo video here or go to YouTube >> https://youtu.be/Xe-IHdZc8A4
p.s. This lab uses a global application load balancer and an unmanaged instance group to connect to the VM.
Alternatively you can provision a VM with a public IP address and connect directly using that IP address as shown in the NVIDIA documentation https://docs.nvidia.com/ngc/ngc-deploy-public-cloud/ngc-gcp/index.html but we are doing it differently.
It’s always fun to lab up things and so to do this test I utilize:
# 1 — NVIDIA NGC from Google Cloud marketplace, running in one project.
Prerequisites:
- You have a project created
- You have permissions to deploy Cloud VPN, compute, Firewall in your environment.
- You have access to marketplace (note this will incur a cost)
Set up— Create project
# 1 — Create a project. Note you can also use an existing project. (Optional skip if you already have your test project and network.)
Open Cloud Shell and configure as follows.
p.s change YOUR-PROJECT-ID for the name of your project.
gcloud config list project
gcloud config set project YOUR-PROJECT-ID
projectid=YOUR-PROJECT-ID
networkid=nvidia-network
echo $projectid
echo $networkid
Create VPC
gcloud compute networks create nvidia-network --project=$projectid \
--subnet-mode=custom \
--mtu=1460 \
--bgp-routing-mode=global
Create custom subnet in us-central1
gcloud compute networks subnets create nvidia-subnet \
--project=$projectid --range=10.0.150.0/24 \
--stack-type=IPV4_ONLY --network=$networkid \
--region=us-central1
Default firewall rules
gcloud compute firewall-rules create nvidia-network-allow-custom \
--project=$projectid \
--network=projects/$projectid/global/networks/$networkid \
--description=Allows\ connection\ from\ any\ source\ to\ any\ instance\ on\ the\ network\ using\ custom\ protocols. \
--direction=INGRESS --priority=65534 \
--source-ranges=10.0.0.0/16 \
--action=ALLOW \
--rules=all
gcloud compute firewall-rules create nvidia-network-allow-icmp \
--project=$projectid \
--network=projects/$projectid/global/networks/$networkid \
--description=Allows\ ICMP\ connections\ from\ any\ source\ to\ any\ instance\ on\ the\ network. \
--direction=INGRESS --priority=65534 \
--source-ranges=0.0.0.0/0 \
--action=ALLOW \
--rules=icmp
gcloud compute firewall-rules create nvidia-network-allow-http \
--project=$projectid \
--network=projects/$projectid/global/networks/$networkid \
--description="Allows http connections from any source to any instance on the network using ports 80 and 8080." \
--direction=INGRESS --priority=65534 \
--source-ranges=0.0.0.0/0 \
--action=ALLOW --rules=tcp:80,tcp:8080
gcloud compute firewall-rules create nvidia-network-allow-ssh \
--project=$projectid \
--network=projects/$projectid/global/networks/$networkid \
--description=Allows\ TCP\ connections\ from\ any\ source\ to\ any\ instance\ on\ the\ network\ using\ port\ 22. \
--direction=INGRESS --priority=65534 \
--source-ranges=0.0.0.0/0 --action=ALLOW \
--rules=tcp:22
gcloud compute firewall-rules create nvidia-network-allow-digits \
--project=$projectid \
--network=projects/$projectid/global/networks/$networkid \
--description=Allows\ http\ connections\ from\ any\ source\ to\ any\ instance\ on\ the\ network\ using\ port\ 8888. \
--direction=INGRESS --priority=65534 \
--source-ranges=0.0.0.0/0 \
--action=ALLOW --rules=tcp:8888
gcloud compute firewall-rules create nvidia-network-allow-https \
--project=$projectid \
--network=projects/$projectid/global/networks/$networkid \
--description=Allows\ TCP\ connections\ from\ any\ source\ to\ any\ instance\ on\ the\ network\ using\ port\ 443 \
--direction=INGRESS --priority=65534 \
--source-ranges=0.0.0.0/0 --action=ALLOW \
--rules=tcp:443
gcloud compute firewall-rules create nvidia-allow-health-check \
--project=$projectid \
--network=$networkid \
--action=allow \
--direction=ingress \
--source-ranges=130.211.0.0/22,35.191.0.0/16 \
--target-tags=nvidia-vms \
--rules=tcp:80
Create NAT gateway
gcloud compute routers create nv-outbound-nat \
--network $networkid \
--region us-central1
gcloud compute routers nats create nv-outbound-nat-gw \
--router-region us-central1 \
--router nv-outbound-nat \
--nat-all-subnet-ip-ranges \
--auto-allocate-nat-external-ips
Create unmanaged instance group
Create unmanaged instance group with named port http we will add the vm to it later on
gcloud compute instance-groups unmanaged create uig-vms \
--zone=us-central1-a
gcloud compute instance-groups set-named-ports uig-vms \
--named-ports http:80 \
--zone us-central1-a
Deploy NVIDIA appliance.
The first thing you need to do is create a key and add to the metadata section of Google Cloud. If you don’t do this you will not be able to ssh into the appliance.
- Create ssh key in cloud shell. When prompted add a passphrase I used something simple like sshnvidia
ssh-keygen -t rsa -f ~/.ssh/ndc_gcp -C "nvidia-user"
- Open Cloud Shell in editor mode and navigate to the /home/username/.ssh/ folder and you will see the files created with the ndc at the start. Open the public key file (.pub) and copy the content.
- Navigate to the menu Metadata. Choose SSH Keys and click edit and then select Add Item
- Copy the key content here and save.
Install NVIDIA from marketplace
- Go to marketplace search for NVIDIA and select the
NVIDIA GPU-Optimized VMI option, this will prompt you with what to do.
- Ensure the correct project and network (nvidia-network) is selected.
- Deploy.
- Next we will add the deployed vm to the unmanaged instance group and create a regional application load balancer.
Add vm to unmanaged instance group and add network tag to vm
gcloud compute instance-groups unmanaged add-instances uig-vms \
--zone=us-central1-a \
--instances=nvidia-ngc-test-vm
gcloud compute instances add-tags nvidia-ngc-test-vm \
--zone us-central1-a\
--tags nvidia-vms
Create Application load balancer
IP reservation
gcloud compute addresses create nvidia-lb-ip --ip-version=IPv4 \
--network-tier=PREMIUM --global
gcloud compute addresses describe nvidia-lb-ip --format="get(address)" \
--global
Health-checks
ps. For the health check to work it has to be TCP
gcloud compute health-checks create tcp nvid-hc \
--port=80
Backend
gcloud compute backend-services create nvidia-backend-service \
--load-balancing-scheme=EXTERNAL \
--protocol=HTTP \
--port-name=http \
--health-checks=nvid-hc \
--global
gcloud compute backend-services add-backend nvidia-backend-service \
--instance-group=uig-vms \
--instance-group-zone=us-central1-a \
--global
URL map
gcloud compute url-maps create nvidia-lb \
--default-service=nvidia-backend-service
Target-proxy
gcloud compute target-http-proxies create nvidia-lb-proxy \
--url-map=nvidia-lb
Forwarding rule
gcloud compute forwarding-rules create nvidia-forwarding-rule \
--load-balancing-scheme=EXTERNAL \
--address=nvidia-lb-ip \
--global \
--target-http-proxy=nvidia-lb-proxy \
--ports=80
Connect to NVIDIA VMI and set up
Navigate the console and go to VM Instance. Look for the VM you created called nvidia-ngc-test, select the name and in the VM setting select SSH
- Inside the VM if you are prompted to install drivers type Y to install the NVIDIA drivers
- Exit out of the SSH session and login again to run docker commands without using sudo.
Run NVIDIA pytorch container
docker run --gpus all --rm -it -p 80:80 \
--shm-size=1g --ulimit memlock=-1 \
--name=nvidia-ai \
nvcr.io/nvidia/pytorch:23.06-py3 \
jupyter lab --allow-root --ip='*' --port=80
In the VM window it will first download the images and then deploy. When completed you will see the token, copy it to somewhere. It will look like this ?token=
Connect to pytorch via public ip
Since you launched the container you can connect to if from your browser to get the gui. For this stage you will need the load balancer IP you created and the token value you copied
The syntax will be http://public-lb-ip-address/?token=
Get load balancer IP
gcloud compute addresses describe nvidia-lb-ip --format="get(address)" \
--global
In your browser type http//ip/?token
Example http://1.2.4.5/?token=abcde12345
You should see the following.
From here you can run use pytorch to run your training model on your NVIDIA VMI in Google Cloud.
Read more about:
NVIDIA GDS catalogue.- https://www.nvidia.com/en-us/gpu-cloud/
Deployment guide — https://docs.nvidia.com/ngc/ngc-deploy-public-cloud/ngc-gcp/index.html
Video:
Clean up
Delete all created elements and go to deployments and delete the deployment of the NVIDIA VMI.
Elements to delete
- Load balancer
-Backend, Frontend, health check
- Cloud NAT, Cloud Router
- Unmanaged instance group
- Deployment
- You can also delete the VPC
I’ll be in touch