Comparing AI Platform Machine Types using YouTube-8M

Warrick
Google Cloud - Community
8 min readMay 19, 2020

When training a neural net model, time is of the essence. This is why different machine configurations including GPUs, TPUs and multiple servers are utilized.

I’ve been exploring the YouTube-8M project for the last couple months and there are previous posts about the project, the video dataset, the algorithms and how to run them in Cloud. For this post, I trained the two algorithms from the getting started code on different AI Platform standard machine configurations to see how they compared. AI Platform provides a number of scale tiers that are established configurations of different machine types and number of machines to run jobs.

The post covers how to run the different scale tiers and comparisons of time and cost of how both algorithms, frame-level logistic regression and deep bag of frame models, did on the different tiers and types.

Command Line Flags

The YouTube-8M project includes a yaml file in the getting started code repo to set the configuration for this project when it is run it on AI PLatform. You can modify this file each time to change out the tiers or you can do something easier which is to pass in flags that changes the tiers (--scale-tier)and types (--master-machine-type).

There are a couple different formats for how to pass in the scale tier

--scale-tier='BASIC_TPU'
--scale-tier=basic-tpu

When passing in flags for scale tier, you must specify --runtime-version otherwise you will get an error:

“Runtime version must be provided when the master Docker image URI is empty.”

Use the following for this project.

--runtime-version=1.14

Scale Tiers

I experimented running the 2 different algorithms, Frame-level Logistic and Deep Bag of Frames (DBoF), in the starter code on the following scale tiers.

  • STANDARD_1: 1 master with 8 VCPUs| 8 GB Memory, 4 workers with 8 VCPUs| 8 GB Memory and 3 parameter servers with 4 VCPUs| 15 GB Memory
  • PREMIUM_1: 1 master with 16 VCPUs| 14.4 GB Memory, 4 workers with 16 VCPUs| 14.4 GB Memory and 3 parameter servers with 8 VCPUs| 52GB Memory
  • BASIC_GPU: 1 worker with 1 K80 GPU | 8 VCPUs| 30 GB Memory
  • BASIC_TPU: 1 master with 4 VCPUs| 15 GB Memory and 8 TPU v2 cores
  • CUSTOM: More info below

Example code when passing in the scale tier flags

JOB_NAME=yt8m_train_frame_$(date +%Y%m%d_%H%M%S)gcloud --verbosity=debug ai-platform jobs submit training \
$JOB_NAME --package-path=youtube-8m --module-name=youtube-8m.train \ --staging-bucket=$OUTPUT_BUCKET \
--scale-tier=basic-tpu --runtime-version=1.14 --region=us-central1\
-- --train_data_pattern='$TRAIN_BUCKET/train*.tfrecord' \
--frame_features --model=FrameLevelLogisticModel \
--feature_names="rgb,audio" --feature_sizes="1024,128" \
--train_dir=$OUTPUT_BUCKET/$JOB_NAME --start_new_model

CPU Only

Note, the previous post on how to run this project on AI Platform covered running it off the equivalent of BASIC_GPU. Using STANDARD_1 and PREMIUM_1 did not have enough memory to run the models. You can consider using a custom setup that allows you to increase master and work memory. Also, to point out that in previous posts I ran these models on a single server with only CPUs with at least 30 GB memory and I was able to run Frame-level but not DBoF. You may be able to find a multiple worker configuration that runs DBoF with only CPUs, but in the end, it makes sense to use TPUs and GPUs when training a neural net.

GPUs

AI Platform give you access to 3 different types of GPUs for machine configurations. The standard scale tiers only include K80s, and you have to use custom tier to utilize the others. These are the different types of GPUs in increasing order of their performance.

  • K80
  • p100
  • v100

The model complexity and amount of data will factor into how much performance gains you will get as you add these different types of GPUs.

TPU Access

The TPU scale tier is a quick way to get access to TPUs for your project. When requesting TPU, you have to specify a region; otherwise, you can leave it off. You can get the following error if you don’t pick a region that the system has TPUs in:

RESOURCE_EXHAUSTED No zone in region us-west1 has accelerators of all requested types.”

When I left off region, it defaulted to us-west1 and looking at the list of TPU types and zones, it appears that it’s based in us-central1 region. A list of TPU types and zones can be found at this link.

Custom Tiers

Custom tiers is exactly how it sounds, it provides flexibility in machine configuration.

When using custom scale tier, you must pass in master-machine-type.

--master-machine-type=[MACHINE TYPE OPTIONS]

Machine type options that I used.

  • complex-model-m-gpu: 4 K80 GPUs | 8 VCPUs| 30 GB Memory
  • complex-model-l-gpu: 8 K80 GPUs | 16 VCPUs| 60 GB Memory
  • standard-p100: 1 P100 GPU | 8 VCPUs | 30GB Memory
  • complex-model-m-p100: 4 P100 GPU | 16 VCPUs | 60 GB Memory
  • standard-v100: 1 V100 GPU | 8 VCPUs | 30 GB Memory
  • large-model-v100: 1 V100 GPU | 16 VCPUs | 52 GB Memory

Example code when passing in the custom tier and machine type flags.

JOB_NAME=yt8m_train_frame_$(date +%Y%m%d_%H%M%S)gcloud --verbosity=debug ai-platform jobs submit training \
$JOB_NAME --package-path=youtube-8m --module-name=youtube-8m.train \ --staging-bucket=$OUTPUT_BUCKET \
--scale-tier=custom --master-machine-type=standard_p100 \
--runtime-version=1.14 \
-- --train_data_pattern='$TRAIN_BUCKET/train*.tfrecord' \
--frame_features --model=FrameLevelLogisticModel \
--feature_names="rgb,audio" --feature_sizes="1024,128" \
--train_dir=$OUTPUT_BUCKET/$JOB_NAME --start_new_model

Note there is room for more fine grained control of the type of machines, configurations as well as the ability to add number of workers. More information can be found in the link at the top of the post. I stuck to some standard configurations for this demonstration and did not add workers beyond what was provided. So it kept to single machine for most examples and all examples in custom include GPUs.

For more detail on the configurations, see the screenshot below and checkout the docs:

Quotas

Below are the quotas that were automatically setup for my project.

  • 16 TPU_V2
  • 16 TPU_V3
  • 2 P4
  • 2 V100
  • 40 K80
  • 40 P100

I found this out in the error I got when I tried to experiment with a couple of machine types that have more than 2 V100s. If your job requires you to scale up the machines more than this then you need to make a quota increase request. If I get access to more than 2 V100s in the near future, I’ll run those other machine types and add the details below.

Performance Comparison

Now for the fun part, comparing the performance of the two different algorithms with all of these different configurations. What is nice about AI Platform and any managed service is you can spin up all of these at once and they can run in parallel.

A reminder that the price for predefined scale tires is $.49 per hour per training unit which is the base price. To get the cost of the job, multiply the base price by the Consumed ML unit. Consumed ML units (MLU) are the equivalent to training units with the job duration factored in.

The time and cost may vary slightly if you try this on your own but they should be within a few minutes range of what is provided below.

Frame-level Logistic

Each bullet point provides scale tier/machine type, total training time and total cost.

  • BASIC_GPU: 13 hr 9 min and 21.85 MLU * $.49 = $10.71
  • BASIC_TPU: 20 hr 51 min and 198.58 MLU * $.49 = $97.31
  • complex-model-m-gpu: 11 hr 30 min and 59.27 MLU* $.49 = $29.04
  • complex-model-l-gpu: 11 hr 7 min and 114.72 MLU* $.49 = $56.21
  • standard-p100: 12hr 45 min and 47.39 MLU * $.49 = $23.22
  • complex-model-m-p100: 11 hr 57 min and 159.61 MLU * $.49 = $78.21
  • standard-v100: 12 hr 19 min and 71.39 MLU * $.49 = $34.98
  • large-model-v100: 13 hr 17 min and 79.28 MLU * $.49 = $38.85

The basic GPU configuration is half the cost of the next lowest cost option, standard p100, but it takes almost 30 minutes more to run. This is a great example of determining what is the time worth. Can you wait 30 minutes for training to complete? Probably. Still when training, you will most likely need to run these machines multiple times to tune and experiment with the model. That extra time can add up and it may seem more cost effective (especially considering deadlines) to use a machine that can shave off some time.

Deep Bag of Frames (DBoF)

Each bullet point provides scale tier/machine type, total training time and total cost.

  • BASIC_GPU: 1 day 10 hr and 56.26 MLU * $.49 = $27.57
  • BASIC_TPU: ran out of memory and exited with non-zero status
  • complex-model-m-gpu: 1 day 4 hr and 145.95 MLU * $.49 = $71.52
  • complex-model-l-gpu: 1 day 3hr and 286.36 MLU * $.49 = $140.32
  • standard-p100: 16 hr 59 min and 63.14 MLU * $.49 = $30.94
  • complex-model-m-p100: 16 hr 22 min and 219.11 MLU * $.49 = $107.36
  • standard-v100: 16 hr 24 min and 92.8 MLU * $.49 = $45.47
  • large-model-v100: 15 hr 23 min and 91.76 MLU * $.49 = $44.96

For this model, it’s clear the v100 GPUS were slightly faster and can be almost as cost effective as the p100. which was almost as fast and the cheapest option when factoring in time. Also, when adding more GPUs like under the complex models, the cost significantly increased but the time did not improve as much. In this use case a single GPU does the job, but there are other situations based on model, data and time constraints where multiple GPUs are needed.

Also to point out that if you want to experiment with TPUs then create a custom tier that uses them and use a main/master that has more memory than BASIC_TPU configuration (at least 30 GB).

Overall for both models, the v100 GPUs show strong performance but the single p100 is the best option when considering the time and cost trade-offs.

Wrap up

This post focused on reviewing how different AI Platform scale tiers and machine types performed with the YouTube-8M example algorithms. Scaling the number of GPUs, TPUs and servers depends on the complexity of your model, the amount of data and how much time you have to get the job done. Those will help you determine what to use and using several servers or GPUs is not necessarily faster. It’s important to take time to understand your requirements before you spin up the platform.

What the post showed was that for this specific project and the two algorithms the code provides, the custom tier using standard-p100 was the best option considering time and cost. If you explore other models, a different configuration may suit your needs better. Also, I did not exhaustively explore all the ways you can customize these configurations so there may be a better option. I challenge you to look for it and let me know if you find it.

--

--