Limited-Memory Accelerators: Exciting News for GPU-Based and Distributed Machine Learning

Here at Neuromation, we are building a distributed platform where mining farms will be able to switch from mining cryptocurrencies, a pointless mathematical exercise designed simply to prove that you have spent computational resources, to what we call knowledge mining: useful computing, especially such computationally intensive tasks as training large neural networks.

The mining farms have huge computational power, but modern neural networks and especially datasets can be so large that they do not fit on a single GPU. That’s why we are always following the news about distributed learning. And it appears we have got some pretty exciting news.

At the NIPS 2017 conference currently being held at Long Beach, CA, Swiss researchers from IBM Zurich and EPFL presented a paper on “efficient use of limited-memory accelerators”. What is a “limited-memory accelerator”, you might ask? It’s your GPU!

This somewhat generic figure from the paper shows “Unit A” that has a lot of memory but limited computational power and “Unit B” that has much less memory but is much better at computation. Exactly like a desktop computer with a CPU and lots of RAM communicates with its video card, which has much less memory on board (say, 6GB instead of 32GB RAM) but can train neural networks much faster.

In the paper, Dünner et al. present a framework where “Unit A” can store the entire dataset in its memory and choose, in a smart way, which subset of the dataset to present to “Unit B” right now. Different points in a dataset have different utility for training: some points contain more information, uncover new important features, and so on. It turns out that the resulting algorithm can converge much faster than just randomly sampling points or their coordinates (which is the current standard approach). About 10x faster, as it appears from the experiments.

The scope of this work is both a strength and a weakness. On the positive side, Dünner et al. present their results about a general form of block coordinate descent, which is a general enough setting to cover most modern machine learning models. But, on the negative side, their results as stated in the paper apply only to convex generalized linear models, which means that they cover models like lasso regression and support vector machines (SVM) but not, unfortunately, deep neural networks. The whole point of having a deep neural network is to move beyond convex objective functions, and, alas, convexity is a pretty central assumption throughout all the theorems of (Dünner et al, 2017).

Still, being able to train SVMs 10x faster on large-scale training sets on the order of tens of gigabytes is a great result which will no doubt be useful, for example, for many business applications. Hence, we hope that implementations of similar algorithms will find their way into the Neuromation platform.

Although the interaction between CPU and GPU on a desktop computer is the central example of these “Unit A” and “Unit B”, it is easy to imagine other scenarios. For example, “Unit B” can be one of the customized ASICs for training deep neural networks that Bitmain has recently announced. But, as I said, adapting this idea to modern deep neural networks will require more research insights. Let’s get to work…

Sergey Nikolenko,
Chief Research Officer at