Why your personal Deep Learning Computer can be faster than AWS and GCP

Task composition impacts performance and consumer grade hardware offer the best value

Jeff Chen
Mission.org
7 min readApr 29, 2019

--

Updated 7/15/2019

A personal Deep Learning Computer with 4 GPUs — 2080 Ti, 2 x 1080 Ti, and Titan RTX.

Whether you’re building your own Deep Learning Computer or renting one from the cloud, it’s important to know what drives performance. We take a deep dive to identify key performance and show you how to run your own benchmarks.

This is part 3 of 3 in the Deep Learning Computer Series. Part 1 is ‘Why building is 10x cheaper than renting from AWS’ and Part 2 is ‘How to build the perfect one’. See new photos and updates: Follow me on Medium, Instagram, and Twitter! Leave thoughts and questions in comments below.

Your personal computer is faster than AWS/GCP if the workload is CPU or IO bound

The $6,000 V100 hosted on AWS performed anywhere between 0.86x (underperforms) to 3.08x faster than the $700 1080 Ti, as shown in the benchmark results below. There is a huge price difference between the two to illustrate why price sometimes does not mean performance.

AWS K80 and V100 Cloud GPUs, a Titan, and a 2080 Ti are benchmarked against a 1080 Ti on my personal Deep Learning Computer. Four tasks are benchmarked and the $6,000 AWS V100 has overall best performance, but underperforms the $700 1080 Ti on Image Segmentation, which is CPU bound. AWS V100 is the p3.2xlarge. AWS K80 is the p2.xlarge.

Task composition drives how much speedup you get: if the task is GPU bound and can utilize the new Tensor Cores in the case of ResNet-50 FP16, then there is a substantial speedup at 3.08x. However, if the task is CPU bound in the case of Image Segmentation and does not use FP16 (unoptimized by design for this demonstration), then the top-of-the-line V100 hosted in the cloud will underperform the much cheaper $700 1080 Ti running on your personal Deep Learning Computer.

The alternative to the V100 on AWS is the K80 GPU, which is on the Kepler architecture. It’s 2 generations behind (Kepler → Pascal → Turing) and it performed terribly on all tasks as expected, running somewhere between 25% and 50% the speed of the 1080 Ti.

Google Cloud Platform (GCP) performed very close to AWS, showing these two cloud players are neck and neck in their infrastructure. Results are here.

Here is a brief description of the tasks, and the raw benchmark data is here:

  • ResNet-50 FP32: trains the ResNet-50 CNN model using synthetic data with 32-bit precision. This is GPU bound and cannot use Tensor Cores. Code here
  • ResNet-50 FP16: like the ResNet-50 FP32, but trains using 16-bit precision and uses the Tensor Cores. This task is also GPU bound.
  • Sentiment Analysis: trains a small 1-dimensional CNN with 32-bit precision and a small 16MB dataset. This code is highly optimized and will use as much GPU compute and memory as possible. Code here
  • Image Segmentation: uses 32-bit precision and has a large 10GB compressed dataset and does cropping before training a FCN. This task is GPU bound and CPU bound (Yes, it would make sense to optimize the code so pre-processing is done before training, but I’m trying to demonstrate the effect of workload mix on performance) Code here

Consumer-grade 2080 Ti and 1080 Ti GPUs offer best bang for the buck

Looking at the benchmark results in the table below, it’s obvious that consumer-grade 2080 Ti and 1080 Ti GPUs offer the best value. For example, the 2080 Ti offers at least 50% the performance of the V100 at 25% of the price. This is why builders predominantly use these GPUs, as seen here, and why these GPUs are almost always out of stock. This is also why building your own Deep Learning Computer is 10x cheaper than AWS.

Datacenters like AWS and Google Cloud are forced to buy Tesla V100 because NVidia contractually prohibits the use of Titan and GeForce (1080 Ti and 2080 Ti) cards in datacenters and that makes cloud GPU time really expensive. Researchers and hobbyists who train models occasionally will opt for renting in the cloud because there is no startup cost. And some researchers who train very large models prefer the Titan RTX and its massive 24GB GPU memory.

Benchmark data presented in numerical form. Value for money goes down significantly for more expensive GPUs.

Tensor Cores drive performance, but not 10x faster like the TFLOP numbers indicate

GPUs are often rated on TFLOPs, which stands for Tera FLoating OPerations per second. One TFLOP means the GPU can do 1 Trillion (1,000,000,000,000) add or subtract or multiply or divide on decimal numbers per second.

Tensor Cores more than 10x the number of TFLOPs, but this does not translate to 10x faster benchmarks

The biggest difference between the 1080 Ti and the 2080 Ti is the inclusion of Tensor Cores, which 10x the number of TFLOPs for 16-bit (TFLOP-FP16) numbers from 11 to 110, shown below. But as you’ve seen in the benchmarks above, training tasks are not running ten times faster. Let’s take a closer look at why this is.

The newer GPUs on Turing architecture all have Tensor Cores (V100, Titan RTX, 2080 Ti), which 10x their TFLOPs. Total Cores* is the number of CUDA core equivalent, computed by multiplying Tensor Cores by 64 and adding the result to CUDA cores.

Tensor Cores compute faster by doing 64 computations at a time, but only work for FP16

Each Tensor Core is capable of multiplying two 4x4 matrices together in one clock cycle, which is 64 operations. So it’s 64x faster than a ‘normal’ CUDA core, which only does one multiply in one clock cycle. But Tensor Cores only work with 16 bit numbers, so it’s only applicable for FP16 training. Each Tensor Core is also eight times larger in surface area than a CUDA Core, which is why there are hundreds of Tensor Cores as compared to thousands of CUDA Cores.

If you multiply out the number of Tensor Cores by 64 and add it to the number of CUDA Cores, you get 39,168 for the 2080 Ti vs. 3,584 for the 1080 Ti. This 10x difference is the driver behind the massive TFLOP increase, shown in the table above.

Actual performance gain is limited to 2–3x because Memory IO and CPU to become bottlenecks

The speedup impact on a task overall when accelerating only one part of a task depends on 1) what portion of overall time the accelerated part takes up and 2) how much acceleration is applied to the accelerated part. Imagine trying to have a baby faster by accelerating the sex, regardless of whether you take two hours or two minutes to finish, the baby still takes nine months to develop. You’ve only sped up the first part, but not the second. This overarching principle is known as Amdahl’s Law.

In the context of Deep Learning, let’s look at a simple example where there are 2 parts to a task — a GPU compute part and a Memory IO part — and each part takes an equal 50ms before any acceleration. Even when GPU compute is accelerated by 10x, the overall speed up is only 1.8x, shown below.

A simplified example of why accelerating GPU performance by 10x does not lead to a 10x speedup in overall performance. Time spent in Mem IO Time, the non-accelerated part, starts to dominate.

A Deep Learning task has many more than 2 parts: there is Disk IO, CPU processing, computer RAM transfer to GPU Memory, GPU Memory transfer to CUDA/Tensor Cores, and GPU compute. So speeding up GPU compute alone by 10x TFLOPs does not lead to 10x overall performance.

Expect faster Memory Bandwidth in future GPUs

One important heuristic to note in the results is the importance of GPU Memory Bandwidth. The AWS V100 has just ~10% more Total Cores and 33% less GPU Memory than the Titan RTX, but actually performs much faster on the ResNet-50 FP16 task (3.08x vs 2.33x). This is driven by its massive 900GB / s Memory Bandwidth versus 672 GB / s on the Titan RTX. Naturally, expect GPUs to have faster Memory Bandwidth in the future to accelerate Deep Learning. It would also be interesting to see the effect of adding CPU cores inside the GPU, which would bypass slower PCI-e lanes.

Easily run benchmarks on your own

Benchmarking can be done in just a few minutes with existing code bases. It’s pretty fun to try for yourself and here are instructions for the ones I ran. Have a great time!

ResNet-50

#Get the code base
git clone https://github.com/tensorflow/benchmarks.git
#It must be run with the tf_nightly build.
#I use conda so I created a new conda environment. Requires CUDA 10.
conda create -n tf_nightly python=3.6.4 anaconda
conda activate tf_nightly
pip install tf-nightly-gpu absl-pycd benchmarks/scripts/tf_cnn_benchmarks#CUDA_VISIBLE_DEVICES=0 makes the benchmark use a particular
#GPU on your computer
time CUDA_VISIBLE_DEVICES=0 python tf_cnn_benchmarks.py --num_gpus=1 --model=resnet50 --variable_update=parameter_server#Change the batch size with the --batch_size flag
time CUDA_VISIBLE_DEVICES=0 python tf_cnn_benchmarks.py --num_gpus=1 --model=resnet50 --variable_update=parameter_server --batch_size=128
#Switch between FP16 and FP32 with the --use_fp16 flag
time CUDA_VISIBLE_DEVICES=0 python tf_cnn_benchmarks.py --num_gpus=1 --model=resnet50 --variable_update=parameter_server --batch_size=64 --use_fp16

Sentiment Analysis

Assumes you have the conda environment and pip setup as in the above ResNet-50 instructions.

#Get the code base
git clone https://github.com/tensorflow/models.git
pip install kerascd models/research/sentiment_analysispython sentiment_main.py

Image Segmentation

Assumes you have the conda environment and pip setup as in the above ResNet-50 instructions.

Download the CityScape dataset: you will want these two files (register for an account if you don’t have one): gtFine_trainvaltest.zip (241MB) and leftImg8bit_trainvaltest.zip (11GB)

#Get the code base
git clone https://github.com/Edarke/229-Final.git
cd 229-Final
mkdir data
cd data
#Move the zip files into this directory,
#use path that's appropriate for you
mv ../../gtFine_trainvaltest.zip .
mv ../../leftImg8bit_trainvaltest.zip .
#Answer Yes, when it asks you whether you want to replace files
unzip gtFine_trainvaltest.zip
unzip leftImg8bit_trainvaltest.zip
pip install tqdm numpy itertoolstime CUDA_VISIBLE_DEVICES=0 python main.py --epochs 5 --no-early-stop --batch-size 32

See new photos and updates: Follow me on Medium, Instagram and Twitter!

FAQ

Will you help me build a Deep Learning Computer?
Happy to help with questions via comments / email. I also run the www.HomebrewAIClub.com, some of our members may be interested in helping.

More FAQ: is located at the bottom of the page here.

Thank you to my friends Evan Darke, Nick Guo, Kelly Iriye, Kevin Chen and Hanoz Bhathena for reading drafts of this.

My current project: automatica.xyz

--

--

Jeff Chen
Mission.org

AI engineer and company builder. Founded Joyride (acquired by Google). Current projects: thisisjeffchen.com