High Accuracy & Faster Deep Learning with High Resolution Images & Large Models
[I published this story in December 2017 on LinkedIn. Moving to Medium to bring all my blogs here]
Deep learning has had a profound impact on our ability to build highly accurate AI models. In the field of computer vision, we have gone from a 26% error rate of machine learning based models in 2011, to around 3% error rates using deep learning based computer vision. As a result, it is possible to see as well as humans on many vision tasks now.
Large Data Sets Cause the Model Size to Explode
Most research papers and consumer use cases tend to use low resolution images for training deep learning models; often as small as 256x256 pixels. In fact, many resize the ImageNet data set images down to this resolution.
Several enterprise use cases, however, require the use of high resolution images. For example, when working with medical images, resizing them to lower resolution can mean that what was a cancer lesion now becomes a dot on a smaller image.
The challenge in keeping the images large is that the deep learning model size explodes. If you are operating on an image that is 2,000 by 2,000 pixels for example, each layer can have millions of nodes.
PCIe Bottleneck for Large Models
When the model size becomes very large because of the size of input data, it is not possible to keep the entire model in the GPU. So, you have to keep the model and data in the system memory connected to the CPU and move over pieces at a time to the GPU. The PCI-Express (PCIe) connection between the CPU and GPU, however, becomes a bottleneck for this communication. Data scientists have to either bear very large training times, or compromise by either reducing the size of the images (loss of accuracy) or tiling the image, which makes it hard to train a model based on the input data set.
Large Model Support (LMS) in Power9 with Volta GPUs
The new POWER9 CPU-based IBM AC922 Power System enables us to overcome this limitation for large models. The P9 processor has the high-speed next-generation NVIDIA NVLink direct interface embedded in the processor chip, that enables direct communication between the POWER9 CPU and the NVIDIA Volta-based Tesla V100 GPUs at 150 gigabytes per second each.
High-Level Diagram of CPU-GPU Connections in the Power9-based IBM AC922 Power System with a 4-GPU configuration
We utilized this CPU-GPU NVLink connection to build a module called “Large Model Support” (LMS) into our PowerAI deep learning enterprise software distribution. The LMS module keeps the model and data in the system memory connected to the P9 CPU, and moves the model as and when required to the GPU for computations. In this way, each layer or a few layers at a time can be in the GPU, and the super-fast NVLink connection eliminates the communication bottleneck.
We implemented large model support (LMS) into the Chainer and Caffe deep learning frameworks and see speed-ups of 3.7–3.8 times over an Intel x86 based server as shown in the charts below. This comparison is between a AC922 server with four NVIDIA Tesla V100 GPUs versus a server with two Intel Xeon 2640v4 CPUs with four NVIDIA Tesla V100 GPUs. We ran 1000 iterations of an enlarged GoogleNet model on an enlarged imageNet data set (2560x2560).
As these results show, the large model support (LMS) feature gives a considerable saving in training time when dealing with large data sets and enables us to use large images without reducing image resolution. Also, although we used images to demonstrate the idea, the same feature is applicable to any data set where each input data point is large.
For more recent results and technical details, read this blog on large model support in TensorFlow for 3D image segmentation.
Learn More about PowerAI
PowerAI is a software suite based on open-source AI frameworks like TensorFlow, PyTorch, etc. Learn more in these blogs:
- Distributed Deep Learning (DDL): Scaling TensorFlow and Caffe to 256 GPUs
- SnapML: GPU-Accelerated Machine Learning (Logistic & Linear regression and SVMs)
- PowerAI Vision: Auto-Deep Learning for Videos and Images
Details on benchmark results presented in this blog are:
Hardware Server Setup
- IBM: Power AC922; 40 cores (2 x 20c chips), POWER9 with NVLink 2.0; 2.25 GHz, 1024 GB memory, 4xTesla V100 GPU Pegas 1.0
- Competitive: 2x Xeon E5–2640 v4; 20 cores (2 x 10c chips) / 40 threads; Intel Xeon E5–2640 v4; 2.4 GHz; 1024 GB memory, 4xTesla V100 GPU, Ubuntu 16.04
Software for Chainer
- IBM internal measurements running 1000 iterations of Enlarged GoogleNet model on Enlarged Imagenet Dataset (2240x2240).
- Software: Chainverv3 /LMS/Out of Core with CUDA 9 / CuDNN7 with patches found at Github (patch 1 and patch 2).
Software for Caffe
- IBM Internal Measurements running 1000 iterations of Enlarged GoogleNet model (mini-batch size=5) on Enlarged Imagenet Dataset (2240x2240).
- Software: IBM Caffe with LMS Source code at Github (full details available).