Azure Compute GPU vs CPU DCGAN
Azure offers many, many functional areas. One thing that is quite awesome though is the ability to spin up a machine, and shut it down.
This gives you the flexibility to work with many different types of machines. Combined with compute targets and driver scripts to point to different machines, you can send jobs to the right machine for the job.
It also gives you the ability to run up the cost really, really quickly. :)
See https://azure.microsoft.com/en-us/pricing/details/machine-learning/ for a list of the current shapes and costs. The most expensive of these is about $3\hour. That really doesn’t sound like much. If you’re used to playing with small data sets, like MNIST, it’s not. To train a MNIST data set to near 100% accuracy using well known techniques, would cost $.24 or so.
Imagine though that you’re working with a data set 10,000,000 bigger than MNISTs paltry 179MB. At that point, the difference between 500 and 700 epochs a day makes a huge difference.
In previous tutorials, we have worked with the STANDARD_D2_V2 configuration. This configuration, as described by Microsoft, is a general purpose compute:
- D-series VMs are designed to run applications that demand higher compute power and temporary disk performance. D-series VMs provide faster processors, a higher memory-to-core ratio, and a solid-state drive (SSD) for the temporary disk. For details, see the announcement on the Azure blog, New D-Series Virtual Machine Sizes.
Let’s work on getting a GPU enabled instance, and see how much faster our script goes there.
All code in this tutorial can be found at: https://dev.azure.com/allangraves/_git/Public%20Azure%20ML
We’ll use a GPU shape listed off of: https://docs.microsoft.com/en-us/azure/virtual-machines/sizes-gpu, the STANDARD_NC6.
It is describe as a 6 core system with 7GB RAM containing 1 K80 Nvidia GPU. This seems like a pretty good entry level setup. For a comparison of Nvidia GPUs see: https://www.microway.com/knowledge-center-articles/comparison-of-nvidia-geforce-gpus-and-nvidia-tesla-gpus/
To see a list of all the VM sizes, I’ve created a script, 07-azure-list-vmsizes.py. You can pass the — gpus argument to highlight only those with GPUs.
agraves@LAPTOP-I5LSJI5R:~/gitrepos/Public Azure ML$ python 07-azure-list-vmsizes.py --gpus
Only showing GPU enabled instances
{'name': 'Standard_NC6s_v3', 'vCPUs': 6, 'gpus': 1, 'memoryGB': 112.0, 'maxResourceVolumeMB': 344064}
{'name': 'Standard_NC12s_v3', 'vCPUs': 12, 'gpus': 2, 'memoryGB': 224.0, 'maxResourceVolumeMB': 688128}
{'name': 'Standard_NC24rs_v3', 'vCPUs': 24, 'gpus': 4, 'memoryGB': 448.0, 'maxResourceVolumeMB': 1376256}
{'name': 'Standard_NC24s_v3', 'vCPUs': 24, 'gpus': 4, 'memoryGB': 448.0, 'maxResourceVolumeMB': 1376256}
{'name': 'Standard_NV6', 'vCPUs': 6, 'gpus': 1, 'memoryGB': 56.0, 'maxResourceVolumeMB': 389120}
{'name': 'Standard_NV12', 'vCPUs': 12, 'gpus': 2, 'memoryGB': 112.0, 'maxResourceVolumeMB': 696320}
{'name': 'Standard_NV24', 'vCPUs': 24, 'gpus': 4, 'memoryGB': 224.0, 'maxResourceVolumeMB': 1474560}
{'name': 'Standard_NC6', 'vCPUs': 6, 'gpus': 1, 'memoryGB': 56.0, 'maxResourceVolumeMB': 389120}
{'name': 'Standard_NC12', 'vCPUs': 12, 'gpus': 2, 'memoryGB': 112.0, 'maxResourceVolumeMB': 696320}
{'name': 'Standard_NC24', 'vCPUs': 24, 'gpus': 4, 'memoryGB': 224.0, 'maxResourceVolumeMB': 1474560}
{'name': 'Standard_NC24r', 'vCPUs': 24, 'gpus': 4, 'memoryGB': 224.0, 'maxResourceVolumeMB': 1474560}
{'name': 'Standard_NV12s_v3', 'vCPUs': 12, 'gpus': 1, 'memoryGB': 112.0, 'maxResourceVolumeMB': 344064}
{'name': 'Standard_NV24s_v3', 'vCPUs': 24, 'gpus': 2, 'memoryGB': 224.0, 'maxResourceVolumeMB': 688128}
{'name': 'Standard_NV48s_v3', 'vCPUs': 48, 'gpus': 4, 'memoryGB': 448.0, 'maxResourceVolumeMB': 1376256}
There are other ways to list the sizes — including a nifty powershell environment. (https://docs.microsoft.com/en-us/powershell/module/az.compute/get-azvmsize?view=azps-5.1.0)
However, the first thing we need to do is to increase our quota if we are using a Free Azure account. The Azure account we use doesn’t have any of these items allowed to us. :shrug:
Overall, this was pretty painless.
I followed the directions at: https://docs.microsoft.com/en-us/azure/azure-portal/supportability/regional-quota-requests and submitted a request to increase my quota. I got a response back anywhere from 60 minutes to 8 hours later, depending on which time I did this.
There is a service “vCPUS” — we don’t want that. Instead, for the service, select Machine Learning Service. ( I did the other service first… and now have more quota there as well! :) )
Click Solutions at the bottom.
Click “Request Details”
Select your resource types, location, and vCPU (in multiples of the vCPU for the resource).

Depending on your location, you may see additional Resource Types. Or even different resource types. But whatever type you are going to use in your new resource cluster needs to be bumped here — in multiples of the number of CPUs that the service has. For instance, to get a 2 node resource cluster of NC6, we need 12 cpus — 6 per machine.
Once we have the new Quota, I’d love to say that I just ran 02-create-compute-cuda.py and it worked. I’m missing something, and I haven’t yet figured it out. Running this script returns an error that I don’t have enough vCPU quota.
# Provisioning errors: [{'error': {'code': 'InvalidPropertyValue', 'message': 'The specified subscription has a total vCPU quota of 4 and cannot accomodate for at least 1 requested managed compute node which maps to 6 vCPUs', 'details': []}}]
However, creating the compute cluster through the portal worked just fine. I’m not sure what the difference was, and I haven’t spent more time on it.
Update — 2 days later, I ran the same script, and it worked. I’m wondering if there was some quota updates that had to make their way through the API layer and the endpoint my script was getting didn’t work. I dunno. ;)
agraves@U18.04:~/gitrepos/Public Azure ML$ python 02-create-compute-cuda.py
Creating a compute cluster
Creating
Succeeded
AmlCompute wait for completion finishedMinimum number of nodes requested have been provisioned
In order to setup our new environment, we’ll need to change the base image we are using. Remember that in the AzureML world, we run a docker container on top of the underlying hardware. This container can be registered, much like any other docker container registry, so we can refer to it by name. I’m choosing to not do that in this tutorial — but in a large corporation, it’s not a bad starting point. You could easily provide preconfigured options for your users to just grab and go with, allowing them to run on an environment that you know works, with all dependencies configured. Individual departments can build specialized images by starting with your base image, and those can be pushed back up to the master images eventually, if there’s a larger need.
Let’s see what container environments Microsoft helpfully provides for us, but browsing https://docs.microsoft.com/en-us/azure/machine-learning/resource-curated-environments. You’ll notice a little disclaimer at the top of this page — the list is current as Sept 2020, but there’s always a way to get the most current list — the Python SDK!
We’ll create a new file to list the environments, so we can always have the most updated info. I will leave creating an Azure Function to automatically update and provide a REST api which can be queried by your Apple watch as an exercise to the reader.
Running our script gets us:agraves@LAPTOP-I5LSJI5R:~/gitrepos/Public Azure ML$ python 09-azure-list-environments.py | grep -i pytorch
Name AzureML-PyTorch-1.2-CPU
Name AzureML-PyTorch-1.1-CPU
Name AzureML-PyTorch-1.0-GPU
Name AzureML-PyTorch-1.0-CPU
Name AzureML-PyTorch-1.2-GPU
Name AzureML-PyTorch-1.1-GPU
Name AzureML-PyTorch-1.3-GPU
Name AzureML-PyTorch-1.3-CPU
Name AzureML-PyTorch-1.4-GPU
Name AzureML-PyTorch-1.4-CPU
Name AzureML-PyTorch-1.5-CPU
Name AzureML-PyTorch-1.5-GPU
Name AzureML-Designer-PyTorch
Name AzureML-Designer-PyTorch-Train
Name AzureML-PyTorch-1.6-CPU
Name AzureML-PyTorch-1.6-GPU
If we go ahead and use the — env parameter to the script, we get more info on a single environment:
agraves@LAPTOP-I5LSJI5R:~/gitrepos/Public Azure ML$ python 09-azure-list-environments.py --env AzureML-PyTorch-1.6-GPU
Base Docker: mcr.microsoft.com/azureml/openmpi3.1.2-cuda10.1-cudnn7-ubuntu18.04:20201112.v1channels:
- conda-forge
dependencies:
- python=3.6.2
- pip:
- azureml-core==1.18.0.post1
- azureml-defaults==1.18.0
- azureml-telemetry==1.18.0
- azureml-train-restclients-hyperdrive==1.18.0
- azureml-train-core==1.18.0
- cmake==3.18.2
- torch==1.6.0
- torchvision==0.5.0
- mkl==2018.0.3
- horovod==0.20.0
- tensorboard==1.14.0
- future==0.17.1
name: azureml_9d2a515d5c77954f2d0562cc5eb8a1fc
We can see above that while there’s nothing specific about the GPU variant here that suggests CUDA in terms of the environments specified, it does note that the Base image is based on an OpenMPI and CUDA variant of Ubuntu 18.04.
Rather than trying to setup CUDA and get packages right, let’s just go ahead and use that as our base image environment. An environment specifies a docker file and associated conda files to install a system. In this case, we’ll just take the defaults. Later, we may want to modify these dependencies to fulfill additional dependencies.
To do this, we’ll create a new script, 03-run-hello-cuda.py.
This script will look just like our old 03-run-hello.py script. We’re going to modify 2 lines:
curated_env_name = 'AzureML-PyTorch-1.6-GPU'env = Environment.get(workspace=ws, name=curated_env_name)
This sets the base docker image to the PyTorch 1.6 GPU variant we identified as looking promising. It will pull the base docker image from where the curated environment specifies it — in this case, the Microsoft Container Registry — mcr.microsoft.com/azureml/openmpi3.1.2-cuda10.1-cudnn7-ubuntu18.04:20201112.v1. Hopefully this has all the CUDA things we need!
The last thing we need to do for our driver script is to tell it to run on our new cluster config:
config = ScriptRunConfig(source_directory='./src', script='hello-cuda.py', compute_target='nc6-gpu-cluster')
The only real change here is the new ‘nc6-gpu-cluster’ identifier which corresponds to the new compute cluster we created.
Lastly, we need to make our new “training” script — hello-cuda.py. This won’t do much, but it will verify that the environment is working just fine.
Our hello-cuda.py is going to be pretty simple, just print out some info that verifies cuda is working.
import torch
import os# Number of workers for dataloader
# set to the number of cpus we are going to have access to.
workers = len(os.sched_getaffinity(0))# Number of GPUs available. Use 0 for CPU mode.
# Since we can run this in multiple scenarios, we want
# to set this dynamically - we might suddenly have more GPUs
# or 0 GPUs.
ngpu = torch.cuda.device_count()print("CUDA GPU? " + str(torch.cuda.is_available()))
print ("NGPU: " + str(ngpu))
print ("Workers: " + str(workers))
Please see my previous articles for why’s and hows — but essentially, this just sets up torch and gets the number of CPU threads available. (https://allangraves.medium.com/gans-on-the-azure-ml-sea-part-1-e3af65061900)
Let’s go ahead and run this!
After looking at our training script output in 70_driver_log.txt, we see the following:
After variable expansion, calling script [ hello-cuda.py ] with arguments: []CUDA GPU? True
NGPU: 1
Workers: 6
Starting the daemon thread to refresh tokens in background for process with pid = 98
And that folks, is what they call a WIN! This environment looks good to go!
With that said, let’s set up our script to do performance timings, so we can compare GPU vs CPU.
We’ll be working with the file ‘06-dcgan_azure-cuda.py’, which is a copy of the file we created in https://allangraves.medium.com/gans-on-the-azure-ml-sea-part-2-4e426d45e7cf.
The first thing we need to do here is to change the environment to run on our new PyTorch container:
curated_env_name = 'AzureML-PyTorch-1.6-GPU'
env = Environment.get(workspace=ws, name=curated_env_name)
config.run_config.environment = env
As we saw in our hello-world-cuda.py file, this allows us to run on a precreated environment, the AzureML-PyTorch GPU environment, with all the libraries we need.
The second thing we need to do is change the ‘compute_target’ to be the nc6-gpu-cluster we set up. Otherwise, this runs on the CPU only D2 cluster. We also need to adjust the script we are running.
script='train-dcgan_azure-cuda.py',compute_target='nc6-gpu-cluster',
The last thing we need to do is to add the ability to select a location — the D2 cluster or GPU cluster, for instance.
parser.add_argument('--target', action='store', default="nc6-gpu-cluster", help="the compute target to run against")
This will let us use the same script to run against the D2 or NC6 machines we have, to compare the 2 machines in terms of cash per training.
Next, we need to setup our training script to be both cuda and non-cuda aware, so we can run the exact script in both scenarios.
We’ll also remove extraneous processing, so we are only measuring the training itself, not printing, creating images, etc.
When doing performance training, it is extremely important to keep 2 things in mind:
- Change the minimum number of things for each training run. Change 2 many variables, and you don’t know which one had the effect you wanted.
- Ensure you are measuring the thing you think you are measuring. All too often it is easy to measure the impact of something you didn’t think you were measuring. Ie — if we started the timing higher in this script, we could be measuring the impact to load the GPU with data, which is higher than than loading the CPU with data. You might want to know that, you might not.
For us, we want to see the impact a GPU makes on training. That means, we’ll only vary the GPU vs CPU component. We’ll also make sure the number of batches and epochs is exactly the same. We’ll take out the calls to write out images, log, etc. This way, the only things we’ll measure in our timing loop is the actual training.
So, let’s crack open train-dcgan_azure-cuda.py and get cracking!
First, let’s add some timing around our training loop. We’ll put this code right at the top of our for loop which loops the number of epochs. This way we don’t accidentally time the model setup, data loading, etc. We are only timing the training portion of this.
That’s not to say timing those things are bad — perhaps loading your model takes longer on the GPU? Again, performance tuning is as much an art as a science, and finding the bottle neck in your code often requires thinking outside the box. Flame graphs are awesome for this sort of thing. (Check out cProfile and flameprof — https://pypi.org/project/flameprof/)
Anyway, the start of our loop will get this call:
start_time = time.perf_counter()
This simply stores the number of seconds since an unspecified point in time. (I’m not kidding — https://docs.python.org/3/library/time.html — states that it is an undefined reference.) This includes the time spent sleeping — so if our process is swapped out, it still will continue incrementing the timer vs the reference point. For us, that’s fine, we want to track total run time of the loop here, including the time spent waiting for other parts (ie — the GPU) to finish running. Lastly, this number is only valid as a difference — ie, seeing how many seconds are between 2 calls to perf_counter.
So, in light of that, we need to put a stop time somewhere. We’ll add that at the bottom of our loop.
And lastly, we’ll print the difference.
stop_time = time.perf_counter()
print(f"Finished in {stop_time - start_time:0.6f} seconds")
The format string (denoted by ‘f’ in front of the string), is available in Python 3.6 and later. Our format string tells Python to print the difference between start and stop time with 6 decimals.
Cool! The next thing we need to do is to add some arguments to the script.
parser.add_argument('--cpu', action='store_true', help="Force the training to happen on the CPU")parser.add_argument('--timing', action='store_true', help="Time the training loop, with no output or logging")
Here, I’ve added 2 new options:
- — cpu : this option forces the script to run on the cpu. We’ll use this when running the driver script.
- — timing : this option will force the script to run in a mode with no logging, so that we are only measuring the impact of CPU vs GPU on our training epochs.
To turn off the CPU:
# Note - if run with the --cpu option, turn off the GPU entirely by setting it to 0.
if not args.cpu:
ngpu = torch.cuda.device_count()
else:
ngpu = 0
Pretty easy!
To turn off the logging, we’ll just wrap calls to ‘vutils.save_image’, ‘print’, and ‘run.log’ in a check for args.timing.
I’ve also added these 2 new arguments to the 06-dcgan_azure-cuda.py file, so that we can easily pass them through.
Now, let’s send 3 runs to Azure! (One NC6 — CPU, One NC6 — GPU, and one D2 — CPU.)
agraves@U18.04:~/gitrepos/Public Azure ML$ python 06-dcgan_azure-cuda.py --timing
TutorialWorkspace eastus2 TutorialResourceGroup
Calling ScriptRunConfig with Arguments:
--data_path
<azureml.data.dataset_consumption_config.DatasetConsumptionConfig object at 0x7f658bdb4f28>
--timing
https://ml.azure.com/experiments/cuda-experiment/runs/cuda-experiment_1606390894_e41c31e5?wsid=/subscriptions/c14a37bd-a658-463c-9d44-9a9326fe5fbe/resourcegroups/TutorialResourceGroup/workspaces/TutorialWorkspaceagraves@U18.04:~/gitrepos/Public Azure ML$ python 06-dcgan_azure-cuda.py --timing --cpu
TutorialWorkspace eastus2 TutorialResourceGroup
Calling ScriptRunConfig with Arguments:
--data_path
<azureml.data.dataset_consumption_config.DatasetConsumptionConfig object at 0x7f189df8c7b8>
--timing
--cpu
https://ml.azure.com/experiments/cuda-experiment/runs/cuda-experiment_1606390928_95f6ca45?wsid=/subscriptions/c14a37bd-a658-463c-9d44-9a9326fe5fbe/resourcegroups/TutorialResourceGroup/workspaces/TutorialWorkspaceagraves@U18.04:~/gitrepos/Public Azure ML$ python 06-dcgan_azure-cuda.py --target d2-cpu-cluster --timing --cpu
TutorialWorkspace eastus2 TutorialResourceGroup
Calling ScriptRunConfig with Arguments:
--data_path
<azureml.data.dataset_consumption_config.DatasetConsumptionConfig object at 0x7fab05761518>
--timing
--cpu
https://ml.azure.com/experiments/cuda-experiment/runs/cuda-experiment_1606407710_a1296b65?wsid=/subscriptions/c14a37bd-a658-463c-9d44-9a9326fe5fbe/resourcegroups/TutorialResourceGroup/workspaces/TutorialWorkspace
While we wait, let’s talk about D2 vs NC6.
Looking at: https://azure.microsoft.com/en-us/pricing/details/machine-learning/, we can see:
- that a standard D2_V2 is $.146 / hour. These have no GPU and are CPU only, being more optimized for CPU and Memory work, like serving up web pages. 2 CPUs, 7 GB RAM.
- A standard NC6 is $.90/hour, and has 1 Nvidia K80 available, 6 cores, and 56GB RAM.
This makes the NC6 about 6x as expensive as the D2, so we need to be about 6x as fast to break even.
D2 CPUS are: Intel® Xeon® Platinum 8272CL (Cascade Lake), Intel® Xeon® 8171M 2.1GHz (Skylake) or the the Intel® Xeon® E5–2673 v4 2.3 GHz (Broadwell) or the Intel® Xeon® E5–2673 v3 2.4 GHz (Haswell) processors with Intel Turbo Boost Technology 2.0.
I was unable to find an exact comparison, but this article highlights the issue pretty well: https://www.electronicdesign.com/industrial-automation/article/21121636/should-you-send-a-cpu-to-do-a-gpus-job
What does “better” mean? Are we talking about the performance per watt? Performance per dollar? Total amount of dollars we spent to get this configuration? Is money an object?
For instance, if we packed an 8 socket board full of Intel’s latest CPUs with the Deep Learning Inference engine, it would probably show pretty well against the K100 in a quad configuration. But… that’s most likely 100x the cost. And will burn a lot more electricity, meaning more ongoing cost.
The question we want to answer right now — is it more cost effective to use a D2 or a NC6 for this particular load… is all we can answer. We can draw some general inferences — such as GPUs should be faster than CPUs in general, but they may not hold true in all scenarios.
Back over on Azure, we can see both of our runs correctly passed the timing and cpu arguments:
After variable expansion, calling script [ train-dcgan_azure-cuda.py ] with arguments: [‘ — data_path’, ‘/tmp/tmpzwgr0tsp’, ‘ — timing’]After variable expansion, calling script [ train-dcgan_azure-cuda.py ] with arguments: [‘ — data_path’, ‘/tmp/tmpk_hiwq9s’, ‘ — timing’, ‘ — cpu’]
- NC6 GPU run: Finished in 1109.929453 seconds
- NC6 CPU run: Finished in 5477.218974 seconds
- D2 CPU run: Finished in 9591.198831 seconds
The difference for just the training portion of the NC6 machines was not quite 6x. That is with the same machine — the STANDARD_NC6 model. That’s pretty insane — the same machine is 6x faster on GPU. The D2 run takes quite a bit longer- almost 3 hours. That’s going to be trouble, since this is a relatively small data set.
In terms of dollars per run though, the NC2 GPU is the winner, with the D2 right behind. The extra speed of the NC2 really helps it, even though it was more expensive.

As your data set grows larger, or your net grows larger, or the backprops or other parameters change, this advantage will change. The number of calculations will grow, and thus the GPU run will come out ahead.
Lastly, there’s the time factor — you might want to pay an extra $.30 to have your run in 20 minutes, versus 1.5 hours.
So there are lots of ways to look at things, it really depends on what best is for you, and what your budget is.
And that wraps up today’s tutorial — stay tuned for downloading nets, and running them on android devices locally, as well as deploying using Microsoft Functions!
As a bonus, I’ve included a new file — 08-azure-list-usages.py, which can list your quota and currently running machines. Enjoy!
Links
Code: https://dev.azure.com/allangraves/_git/Public%20Azure%20ML
Python:
- https://pypi.org/project/flameprof/
- https://docs.python.org/3/library/time.html
- https://realpython.com/python-f-strings/
- https://docs.python.org/3/library/argparse.html
Azure:
- VM Sizes with GPUs: https://docs.microsoft.com/en-us/azure/virtual-machines/sizes-gpu
- A way to install CUDA if using Compute Instances: https://docs.microsoft.com/en-us/azure/virtual-machines/linux/n-series-driver-setup
- NVidia Docs for CUDA — not really applicable to this article, but a free bonus: https://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html
- How to Make a Quota Request: https://docs.microsoft.com/en-us/azure/azure-portal/supportability/regional-quota-requests
- Azure subscription limits for various types of subscriptions: https://docs.microsoft.com/en-us/azure/azure-resource-manager/management/azure-subscription-service-limits
- Using Azure Powershell to check and list quotas: https://docs.microsoft.com/en-us/azure/virtual-machines/windows/quotas
- A list of the compute targets and sizes: https://docs.microsoft.com/en-us/azure/machine-learning/concept-compute-target#supported-vm-series-and-sizes
- Machine Pricing! https://azure.microsoft.com/en-us/pricing/details/machine-learning/
- More on Azure Powershell: https://docs.microsoft.com/en-us/powershell/module/az.compute/get-azvmsize?view=azps-5.1.0
- Deploying a CUDA setup using various methods. The same method can be used for training, not just inference: https://docs.microsoft.com/en-us/azure/machine-learning/how-to-deploy-inferencing-gpus
- The containers used by Azure: https://github.com/Azure/AzureML-Containers
- A list of the curated environments — but always use Python to get the latest! https://docs.microsoft.com/en-us/azure/machine-learning/resource-curated-environments
- The API reference for Conda in Python: https://docs.microsoft.com/en-us/python/api/azureml-core/azureml.core.conda_dependencies.condadependencies?view=azure-ml-py#serialize-to-string--
- The docker API reference for Python: https://docs.microsoft.com/en-us/python/api/azureml-core/azureml.core.environment.dockersection?view=azure-ml-py
- An MS tutorial on training PyTorch: https://docs.microsoft.com/en-us/azure/machine-learning/how-to-train-pytorch
- Working with compute instances and GPUs, but applicable to Compute Clusters: https://docs.microsoft.com/en-us/azure/container-instances/container-instances-gpu
- Azure ML location list: https://azure.microsoft.com/en-us/global-infrastructure/geographies/
Machine Learning: