Deep Learning: massively scaled parallel approach using hybrid supercomputers
Machine learning is about everywhere and has almost become a brand in itself. Practically every smartphone has assistants like Siri or Okay Google. Increasing number of traditional domains uses machine learning: from medicine and education to industrial production lines. “The master algorithm: How the Quest for the Ultimate Learning Machine Will Remake Our World”, a book by Pedro Domingos (2015), uncovers the striking ubiquity of this technology. From the book you will know that machine learning is by now practically a commodity in many industries and as you know, it has been used for many years by companies like Google and Amazon. As I believe, it has become extremely popular and prominent a few years ago, after the deep learning hype. Deep learning dominated not only IT journals and blogs but also media space, TV, and so on. This hype and popularity is also related to the evolution of computation technologies which made use of deep learning methods possible for real, practical needs (while actually these methods exist from 1970th). Cloud technologies and GPU-based computation approaches provided scalability and compute power for deep learning and boosted its popularity. But today we want to talk about not so popular or well known, such as cloud or GPU, approach — supercomputing. We will demonstrate why it is important and what it can provide for “regular” machine learning tasks.
Not everyone knows, but supercomputers are around for quite some time. Actually, the first computer was a supercomputer. And all other machines developed in 50th-70th, were supercomputers too. What is ‘super’ actually? One of the definitions states that supercomputer can do what other computers cannot. And based on this definition, all the computers used to be unique with their own architecture and purposed for special tasks. The first standards come with Cray and IBM mainframes, but those still were supercomputers. Not a secret that nowadays, a laptop or even a smartphone generally possesses much higher compute capacity than any of these old machines. Nevertheless, while these machines were unique and “super” decades ago, laptops today are just a commodity like any goods we buy in a supermarket.
So, what is the supercomputer of today? Are they not extinct yet? Of course not. In fact, the trend may very well be reversing.
These machines typically consist of high performance compute nodes in a shared resource cluster. Modern supercomputers host hundreds or thousands of nodes connected to an extremely, unbelievably fast by common standards network. With this, some people may argue that a supercomputer can be built in the cloud or it is possible to create supercomputer from couple servers in the rack. This is not true. It will just be a regular computing cluster, virtual or bare metal. These, “regular” clusters can often be seen in medium and large companies for purposes like data processing, distributed applications or some company-specific computations. Even what you see in Google or Amazon datacenters are not supercomputers, but probably they look very impressive (thousand of square meters of server racks).
True supercomputers are quite expensive machines intended for purposes like mathematical modelling or specialized high performance computations that cannot be dealt with by using other, conventional computing approaches. Such kind of tasks require hundreds of compute devices that exchange huge amounts of information between nodes during the actual process of computation. This information is encapsulated into messages sent through the network. Size of these individual messages is typically not significant, but the number is extremely high. The network, despite all the engineering efforts, is always much slower than a computer’s local memory and also much slower than data exchange on local CPUs. This creates a fundamental problem known as the “memory wall” (Wm. A. Wulf and Sally A. McKee “Hitting the memory wall: implications of the obvious”, Published by ACM 1995). This is why network interconnect in a supercomputer has to be as fast as possible. And not only in terms of throughput, but also in terms of latency. The network has to be unblocking as much as possible and ideally allow to exchange data in the peer-to-peer mode. Have you ever seen a network switch with thousands of ports? Probably not, because such switches don’t exist. But there are several approaches for implementing networks with minimal blocking rate. For example, the “fat tree” topology. With this, supercomputers usually use network interconnect such Mellanox Infiniband, Intel Omnipath, and their analogs. This usually serves as the main difference and advantage of a supercomputer compared to a regular cluster.
Here is the world’s most powerful supercomputers rating called TOP500 http://top500.org. You can see that many supercomputers have hybrid architectures, which means that these are not just standard x86 servers, but they contain devices with GPU or FPGA architecture.
This is a very good news for machine learning, because using GPUs is very efficient, so almost all popular frameworks like TensorFlow, Caffe, Theano support GPU-based computations.
So, probably using one node of a supercomputer with two GPUs is not such a bad idea. One can do that. But in this case, still there will be no difference between this approach and using a regular server with GPUs. Same time, we want a supercomputer to become helpful when it is not possible to get quick results on one server and we need to scale. So, can it be applicable to Deep Learning tasks? We propose that:
- Training takes a lot of time. Deep learning often requires huge amounts of training data and it can take weeks to train a model on a single node. For example, MITIE, a tool for NLP that uses machine learning, takes quite a lot of time to build a classification model on several hundreds of records, and I cannot imagine the time needed for thousands or hundred thousands records.
- Deep learning processes often can’t fit into the local memory and hence it is necessary to distribute the data across compute nodes.
One more time — these challenges cannot be solved using any cloud computing approaches because of hyper-intensive message exchange during the training process. Supercomputers help us here: by using the MPI (Message Passing Interface) technology it is possible to implement very high performance computations that implement real-time data exchange between processing units. Let’s take a deeper look at how it can work.
After quite some research around available machine learning tools for supercomputers we decided to implement distributed learning ourselves. We took a language modeling task as a reference implementation ( the description is available here: https://www.tensorflow.org/versions/master/tutorials/recurrent/#recurrent-neural-networks. It is based on recurrent neural networks and implements long-short term memory using LSTM. We used PTB data set (https://catalog.ldc.upenn.edu/ldc99t42). In this paper we will demonstrate how distributed learning performs based on this example. Modifications of this code are applicable for natural language processing and understanding products including reasoning systems, conversational chat bot interfaces, and intelligent search systems.
To implement distributed machine learning on a supercomputer we split our dataset into subsets, designed and implemented a parallel gradient calculation mechanism that takes place on all nodes in the cluster. We evaluated both, CPU and GPU computation nodes. After each steps gradients were averaged across the cluster and all models synchronously updated. To realize it we used collective MPI operations. Together with scientists from St. Petersburg Polytechnic university we performed experiments on hybrid supercomputer with total computer capacity more than 1.3 PFlops (1300000000000000 operations with floating points per second). We evaluated two architectures:
- Standard cluster that consists of nodes with 28 compute cores and 64GB of board memory
- Cluster with the same configuration as above plus two NVidia Tesla k40 GPUs per node.
Our software is built on top of TensorFlow, the diagram is presented on picture below:
On the diagram SLURM is a name of scheduler which processes compute jobs and storage is based on shared network filesystem Lustre which is also a part of the supercomputer.
During performance tests we evaluated several configurations with and without GPUs. For the comparison we used an abstract metric ‘word per second’ (wps) which is related only to this task, but was good to evaluate scalability of computations on the same data set and software.
The logs of the computations show us:
From this data we extracted following metrics:
This chart demonstrates that if we are increasing number of nodes the speed on one particular node is decreasing. It means that network exchange takes sufficient time. But performance for the entire cluster is increased and reached almost 3000 wps on 32 node cluster. So, it is about 15x speedup comparing to one node.
This becomes a one-day computation job vs. what it was — 2 weeks!
Let’s have a look at result of computations using GPUs:
So, the performance using GPUs on one node is amazing — more than 500 wps. It also decreases in a cluster mode because of network bottleneck. But look, 4 nodes can provide more than 1000 wps, which is quite a lot comparing to cpu nodes. But unfortunately, GPU nodes don’t scale well because of memory restrictions — NVidia tesla k40 has only 12Gb of memory on each GPU and scalability is limited by this amount (more nodes in a cluster require more memory for operations inside one node).
Performance of such computations can be increased also by tweaking batch size, playing parameters and aligning data. But this is a different story, here we focused only on scalability.
In this overview we demonstrated how important and practical supercomputers could be for machine learning. By using general purpose supercomputers it is possible to speed up the training, which is very important for rapid experimentation and delivery of machine learning based solutions. From the experimentation we saw that GPU cluster provide more performance per node but scalability is limited comparing to CPU only computations. So both approaches have pros and cons for different challenges and tasks.
Important to mention that there are several ways for making deep learning faster, let’s list them:
- Software optimization (frameworks, algorithms and methods)
- Parameter tweaking (batch sizes, data structure optimization)
- System software stack configuration (different versions of MPI, usage of Intel MKL, etc)
Also very interesting point and topic for another paper is computation architectures. Intel promotes Xeon Phi systems with MIC (many integrated core) architecture as a killer for GPUs. Nvidia builds specialized deep learning devices and specialized deep learning supercomputers like DGX-1. Both companies propose special devices for deep learning which have more operations, but with less precision which is enough for deep learning, but not enough for mathematical modelling. Who will win this battle? We don’t know, but we will keep exploring both worlds, evaluating the outcome and keep applying the best methods for solving scientific and engineering challenges for us and our customers.