Building a 50 Teraflops AMD Vega Deep Learning Box for Under $3K
In 2002, the fastest supercomputer in the world (i.e. “NEC Earth Simulator) was capable of 35.86 teraflops.
You can now get that kind of computational power (that can fit under your desk) for the price of a first class airline ticket to Europe.
Now that AMD has released a new breed of CPU (i.e. Ryzen) and GPU (i.e. Vega) it is high-time that somebody conjure up with an article that shows how to build an Deep Learning box using mostly AMD components. The box that we are assembling will be capable of 50 teraflops (fp16 or half precision).
Here are the parts we gathered together to put together this little monster. I purchased them through Newegg and you can click on the links for the specific parts.
2 — AMD Radeon Vega Frontier Edition 16GB GPU $2,000
1 — G.SKILL 32GB (2 x 16GB) DDR4 Memory $200.99
1 — Crucial 525GB SATA III 3-D SSD $159.99
1 — EVGA Plus 1000W Gold Power Supply $119.99
1 —MSI X370 AM4 AMD Motherboard $129.99
1 —AMD RYZEN 7 1700 8-Core 3.0 GHz (3.7 GHz Turbo)Processor $314.99
1 — TOSHIBA 5TB 7200 RPM SATA Hard Drive $146.99
1 — Rosewill ATX Mid Tower Case $49
Total: $3,122 or a mind-boggling $62 per teraflop!
My mistake, I went over budget! It would have been under $3,000 if I didn’t need either the SSD or the HDD. But if you want to pay a little more for additional hardware then I would add more CPU memory (up to 64GB) and to use an NVMe based SSD card. That would add an additional $600 to the total. For comparison with a desktop solution, Nvidia’s pre-assembled “DIGITS Devbox” with 4 Titan X GPUs and 64GB ram goes for $15,000.
One other perspective is to compare it with a single Nvidia Tesla P100 (just the GPU card) with 16GB and capable of 18.7 teraflops (fp16) but costs $12,599 from Dell. If you are thinking of side-stepping ‘professional grade’ to a consumer grade Nvidia Titan X or GTX 1080 ti (like the DIGITS Devbox) then it is worth knowing that half-precision (fp16) is only at 0.17 teraflops for a Titan X and comparable for a GTX 1080 ti. That’s because fp16 as well as fp64 (double precision) is unavailable for Nvidia consumer cards:
Clearly you are getting a ridiculous amount of fp16 compute for the buck with a Vega solution! Deep Learning does not require high precision unlike conventional scientific computation. So we can ignore double-precision floating point performance numbers when we select Deep Learning hardware. Nvdia charges premium for their Tesla double precision capability. Scientific computing absolutely required double precision and thus paid premium for this capability. However with the discovery that Deep Learning workloads don’t need high precision and favor a lot more computation, Nvidia is now also charging premium for half-precision! Kind of like giving you half the cake for twice the price. That’s the benefits of being a virtual monopoly!
Perhaps AMD Vega can take a piece of the action with better fp16 performance:
We can also make comparisons with respect to cloud based GPUs. A p2.xlarge is priced today at $0.9 per Hour using K80 GPUs. A p2.xlarge specifications is at 2,496 CUDA cores with 12GB of memory. If we do the math this appears to be at 4.3 teraflops per GPU instance (single precision). That is 5.8 times less than this AMD box we are assembling. In terms of money, you will spend at a maximum 24 days of AWS GPU time to cover the cost of this AMD box! (To be fair, we are comparing oranges with oranges, that is numbers are on single precision compute).
There is an question about Vega’s double precision performance. The MI25 documentation shows it at 768 gigaflops. In comparison, the latest Xeon Phi (Knights Landing) is at 3.46 teraflops double precision and the Nvidia P100 is at 4.7 teraflops. Price analysis by Microway for Nvidia P100 solutions are in the range of $1,500 per teraflop. So clearly, the Vega is optimized for fp32 and fp16 workloads, implying that it indeed is designed to be a Deep Learning engine and not a more traditional scientific computational engine.
There are of course a lot of other considerations that need to be taken into account here. The cost of assembly, installing, networking, power and maintenance that we did not add. Also, there are considerations to be made if the need arises in distributing work across more than one machine (which its always best to avoid). Finally there is the issue of Deep Learning frameworks that are compatible with the AMD box. Still, it is hard to ignore the massive potential cost savings of this approach.
The AMD Ryzen CPU has an interesting feature in that it can handle 24 PCIe lanes, in comparison an Intel desktop CPU has only 16 lanes. This is an important consideration in that you need to be able to adequately feed the attached GPUs with data fast enough while it is training. Each GPU can occupy 16 lanes, so perhaps a more optimal solution is to wait for AMD’s Threadripper ( i.e. on store shelves in early August ) that supports 64 lanes! Threadripper is definitely something to check out with a box with four Vega GPUs (meaning 100 teraflops). I likely will like explore this new CPU to see how it performs with multiple GPU and NVMe devices. In addition, Vega is uniquely designed memory architecture that lets it access 512TB of memory. Coupling Vega to NVMe devices may lead to unprecedented performance improvements.
Okay, let’s get assembling! Here are the unboxed parts:
All one needs is a screwdriver and mounting screws. With about the same amount of brain power and time to assemble Ikea furniture and you have a final assembly that’ll look like this:
It isn’t one of those gorgeous water cooled and color coded assemblies, but this should be more than adequate. (Note: AMD sells a water-cooled version of Vega). To my surprise, the LEDs were horribly color uncoordinated. I have white, red, blue and yellow. I guess you will have to pay premium for style.
We’ll install Ubuntu 16.04 and install the AMD ROCm software.
To do this, go to www.ubuntu.com and download an iso image that you will ‘flash’ into a flash drive. You can do this all using Unetbootin. Once done downloading and flashing, insert the flash drive into the USB port of the new desktop and it’ll boot up Ubuntu and guides you towards installing the operating system on to the attached HDD or SSD.
Once Ubuntu is installed, you now need install drivers and software for the Vega cards. You can find instructions to that here: installing ROCm. (Note: ROCm updates the kernel, so there is a current code synchronization issue that requires a 4.10 kernel for ROCm)
You will only need to install the AMD ROCm Kernel Fusion drivers. Everything else though sits in userspace and is most conveniently set up using Docker images. Hopefully, we can build a richer set of Docker images so that will be more convenient to run Deep Learning experiments.
Once you have set up Docker, you can now start exploring Deep Learning on AMD hardware. The Docker image we have, we have made public at Docker Hub (https://hub.docker.com/r/intuitionfabric/hip-caffe/). This we hope will allow you to start experimenting quickly. So by typing the following command line (apologies for how Medium formats this):
sudo docker run -it --device=”/dev/kfd” — rm intuitionfabric/hip-caffe:latest
you can use it to run Caffe examples (see: https://github.com/ROCmSoftwarePlatform/hipCaffe ). So as an example, you can run the CIFAR example:
./build/tools/caffe train --solver=examples/cifar10/cifar10_quick_solver.prototxt
Here is the a snapshot of the training:
I0716 10:01:03.631714 3367 caffe.cpp:251] Starting Optimization
I0716 10:01:03.631747 3367 solver.cpp:279] Solving CIFAR10_quick
I0716 10:01:03.631762 3367 solver.cpp:280] Learning Rate Policy: fixed
I0716 10:01:03.632768 3367 solver.cpp:337] Iteration 0, Testing net (#0)
I0716 10:01:06.965248 3367 blocking_queue.cpp:50] Data layer prefetch queue empty
I0716 10:01:07.114781 3367 solver.cpp:404] Test net output #0: accuracy = 0.1067
I0716 10:01:07.114820 3367 solver.cpp:404] Test net output #1: loss = 2.30263 (* 1 = 2.30263 loss)
I0716 10:01:12.147825 3367 solver.cpp:228] Iteration 0, loss = 2.30306
I0716 10:01:12.147881 3367 solver.cpp:244] Train net output #0: loss = 2.30306 (* 1 = 2.30306 loss)
I0716 10:01:12.147897 3367 sgd_solver.cpp:106] Iteration 0, lr = 0.001
.... (4000 iterations later)
I0716 10:01:32.410472 3367 solver.cpp:464] Snapshotting to HDF5 file examples/cifar10/cifar10_quick_iter_4000.caffemodel.h5
I0716 10:01:32.412582 3367 sgd_solver.cpp:283] Snapshotting solver state to HDF5 file examples/cifar10/cifar10_quick_iter_4000.solverstate.h5
I0716 10:01:32.419283 3367 solver.cpp:317] Iteration 4000, loss = 1.0674
I0716 10:01:32.419322 3367 solver.cpp:337] Iteration 4000, Testing net (#0)
I0716 10:01:32.634677 3367 solver.cpp:404] Test net output #0: accuracy = 0.6252
I0716 10:01:32.634712 3367 solver.cpp:404] Test net output #1: loss = 1.12303 (* 1 = 1.12303 loss)
I0716 10:01:32.634743 3367 solver.cpp:322] Optimization Done.
So let’s get some numbers to see how well this box performs!
I ran the benchmarks found here: https://github.com/soumith/convnet-benchmarks and here is a chart of the results:
I don’t know the specific hardware that was used in these benchmarks, however this comparison does show that the performance improvement is quite significant as compared to alternatives. One thing to observe is that the speedup is most impressive with a complex network like GoogleNet as compared to simpler one like VGG. This is a reflection of the amount of hand-tuning that AMD done on the MIOpen library.
AMD still needs a lot of catching up to do in the Deep Learning software front. The ROCm software stack is still a work in progress. However it is indeed clear (for folks on a budget) that there are opportunities here that can be exploited. If AMD can get more aggressive with porting frameworks particularly with an emphasis on fp16, then this could be a major win for in competitiveness of this alternative platform. Not many Deep Learning frameworks support fp16, but if you consider the 2x speedup benefits then there is real incentive to occupy this vacuum.
For researchers who perform experiments on low precision arithmetic or compressed networks, this is an opportunity to get a fp16 enabled system at an extremely low cost. AMD’s Vega solution appeals to the idea that emergent intelligence arises from having more simple computations performed at a massive scale. The fp16 (also the int8 capability) and the large 16GB of HBM2 memory can be an advantage for certain kinds of innovative architectures.
Consider also that Vega can directly access NVMe devices and can address a 512TB of memory through is new memory architecture. This can be huge in the context of the newer 3D XPoint technology that Micron and Intel have created. Vega GPUs that can access terabytes of memory, this is unique for GPU devices and has the potential to change the entire dynamics of Deep Learning architectures.
Looking forward in the future, Nvidia’s Volta architecture has specialized Deep Learning GPUS (see: Google’s AI Processor’s (TPU) Heart Throbbing Inspiration) that is nowhere in AMD’s roadmap. Systolic arrays can take up expensive real-estate and will likely be available for Deep Learning specialized hardware like Google’s TPU and Fujistu’s DLU. It is unlikely though that these capabilities will ever be available to Nvidia consumer graphics cards. Intel has yet to deliver any Deep Learning hardware so its anybody’s guess as to how they can shake up the market.
Therefore, it may be safe to assume that, at a hobbyist level, AMD GPUs can become an extremely cost-effective Deep Learning hardware for the next couple of years. AMD needs to do a lot of work to get Deep Learning software ready for this platform, however there are plenty of opportunities to try unique alternative approaches. One extremely intriguing capability of AMD Vega is that the compute engine can do 512 8-bit operations per clock! What that means is that if you can perform 50 trillion operations per second on a single Vega card. Think of the possibilities here in the area of genetic (aka evolution) algorithms.
One thing to always remember, Deep Learning is an experimental science and the more folks that have 50 teraflops (or 100 teraops) of computing power underneath their desks, the higher the likelihood that we make some accidental impressive discoveries!
For those readers who aren’t familiar with Deep Learning, read:
BTW, ♡ if you agree that Deep Learning needs alternative hardware vendors!
Update: A new TVM backend has been created that supports NNVM and ONNX supported frameworks.