Building a 270* Teraflops Deep Learning Box for Under $10,000
Look what I got for Christmas!!! If you don’t recognize it, it’s two Titan V cards from Nvidia. A single Titan V has a systolic array unit dubbed as a TensorCore that is capable of 110 teraflops peak performance. In addition, it includes a conventional GPU that’s capable of 25 teraflops half-precision. This means, we are speaking here of roughly 135 teraflops (half-precision) per card. This makes a grand total of 270 teraflops for a box with these two cards inserted in it. We don’t even have to count the relatively minuscule extra flops that a multi-core CPU provides. (* Editor’s note: I will have to update these theoretical numbers when I dig up more details.)
Each Titan V costs $3,000.00 + plus taxes. So this gives around $3,700 wiggle room to come up with a decent box to host these cards on. I recently built a 50 teraflops box for under $3,000, which comes out to 16.6 gigaflops per dollar. This new box should give you a mind-boggling, 27 gigaflops per dollar. Just to make a comparison, a late model Intel i7 8700K cranks out 217 gigaflops. An i7 8700k costs $400, the math comes out to .53 gigaflops per dollar. Granted the numbers here are theoretical vs empirical, it is still a massive difference!! (BTW, I could have placed these Titans on the same box as my 50 teraflops box and it would cost around $7,300. Equating to 36 gigaflops per dollar.)
In June of 2007, IBM’s Blue Gene/P was installed at Argonne National Laboratory which is capable of 445 teraflops (double precision). Two years earlier, Blue Gene/L was the fastest supercomputer in the world at 280 teraflops. Back in 2007, IBM was charging $1.3M per rack. 20 of these racks gets you to around 280 teraflops (it will set you back $26 million). You might be saying: “well hold on now, you are comparing double precision with half-precision which isn’t fair”. Honestly, I don’t care because Deep Learning workloads really don’t care much for higher precision.
The Blue Gene/P monster of a machine looked like this back in 2007 (just 10 years ago):
Now imagine all this computational horsepower sitting quietly (water cooled) underneath your desk, all in a single box and costing 2,600 times less (i.e. $26,000,000 versus $10,000). This doesn’t even include the cost of power. There’s no need for a battalion of folks to install and maintain this monstrosity. There’s no need to wear a shirt, tie, slacks and shoes to work on this! Think about how potentially world dominating this can be ;-).
There are two architectural developments that got you this massive increase in flops in a very short time. (1) The use of fp16 meaning less silicon than comparable fp32 or fp64 multiply-add accumulators and (2) systolic arrays that gets you 110 teraflops for the same amount of silicon that got you around 10 teraflops. In 2016 you could get less than 10 teraflops per GPU chip, fast-forward to 2017 and its a quantum leap to 135 teraflops with a V100 GPU. I don’t expect 2018 to yield this kind of leap in capability. The low hanging fruit has already been picked and only needs to be exploited by software.
The next big leap perhaps may be the kind of architecture GraphCore is touting. Here are some intriguing benchmarks from GraphCore. If I were to gaze at my crystal ball, Google is going to stun the world again with a new kind of architecture in silicon. Better Deep Learning algorithms are feeding back into more capable silicon. This is what Elon Musk has coined as “double exponential growth”. Deep Learning progress is moving at break-neck speed!
This kind of comparison in terms of size and cost gives a visceral feel to the kind of changes that are coming. How many businesses are still running their operations the same way that it was 10 years ago? This kind of exponential change in compute capability has got to mean a massive change in how we run our daily operations. It’s likely that 99.99% of the people out there don’t even realize what’s happening! When I mention “Deep Learning” to people, most folks eyes glaze over. Don’t even mention the term “Intuition Machine”, it sounds like an oxymoron.
I’m waiting for a water cooled AMD Threadripper box custom built by a reputable vendor of workstation class desktops. This will give me the opportunity to kick the tires on this kind of an intuition machine!
Here’s the AMD ThreadRipper-based machine with two Nvidia Titan V’s:
The motherboard is an ASUS Prime-X 399-A reviewed here by AnandTech. The key feature why this board was selected is that it is equipped with four PCIe 3.0 x16 slots. This should provide sufficient bandwidth for the CPU to feed data to the Titan V GPUs. Furthermore, the motherboard supports NVMe U.2 (supporting up to four PCIe 3.0 lanes) for faster SSD-based storage access in contrast to SATA III. Note the diagram below shows the ThreadRipper chipset capable of 3 NVMe devices on a RAID configuration. The main CPU is a Ryzen ThreadRipper 1900x (8 Core, 3.8GHz). This configuration has sufficient I/O for the workloads required for Deep Learning. This can be seen by the graphic below:
The primary reason for an AMD based motherboard and not an Intel one is the I/O support. A Threadripper offers more PCIe lanes than the Core i7–7820X (60+4 versus 28). Two Nvidia Titan Vs require 32 lanes, this immediately exceeds the capability of a late model Core i7. This ThreadRipper motherboard is more than capable of supporting 4 Titan V cards!
This is a water cooled CPU so it’s cool as a cucumber and extremely quiet:
I employ a P600 Nvidia graphics card to avoid a graphics load on the Titan Vs. The Titan Vs aren’t water cooled, but I hope to see that in the future. My hardware supplier has their cable management down to a science. All in all, this is not only a very capable machine, but one that is absolutely beautiful.
For comparison purposes, one can compare against Nvidia’s DGX Station. The DGX station is listed at $49,900. DGX has 4 Tesla V100 with an Intel Xeon E5–2698 v4 2.2GHz (20-Core) capable of 500 teraflops. The E5–2698 has only 40 PCIe lanes, so it really can’t support the theoretical maximum bandwidth of 4 Titan Vs (64 lanes). Our system however can be upgraded by adding 2 more Titan Vs and the 16 core 1950x at 3.4GHz. We are not going to shell out $49,900 to run a bake-off, but you can look at the theoretical numbers and realize that we have a compelling solution. Our system thus is arguably quite competitive with the best Deep Learning system that is out there.
Now it’s time to run this machine on our Deep Learning software stack. Please stay tuned!
Email firstname.lastname@example.org if you are interested in acquiring a similar or upgraded machine.
You might be wondering, can I mine cryptocurrency with this? Yes, of course, that’s when Intuition Fabric comes out!