dKeras: Make Keras up to 30x faster with a few lines of code

7 min readNov 25, 2019

Introduction

Reducing server costs through faster inference and cheaper hardware is a primary concern for industrial deep learning applications. This article will explain why deep learning is making a transition from the workstation to the data center and how dKeras, a distributed framework built on Ray and Keras, can be used to drastically improve performance and reduce cost through distributed inference. The first section of the article is a general analysis of distributed deep learning. The technical details of how to apply Ray and dKeras are found in the second half. The following graph shows how dKeras can speed up Keras by up to 30x on a single machine.

Distributed Deep Learning

Distributed deep learning refers to both distributed training and inference. This section will focus mainly on the technical details of distributed inference, but will touch on some industrial examples of distributed training.

Distributed inference algorithms fall into two groups, model parallel algorithms and data parallel algorithms. In the model parallel approach, a single model is partitioned across multiple computational devices and nodes. Though there are many ways to partition a model across nodes, in a simple example, the outputs of one node’s model partition become the inputs to the next node’s model partition. This technique targets large models that cannot fit on one node, perhaps due to memory constraints. In contrast, smaller models typically leverage data parallelism. In this case, each node contains one or more copies of the model, and requests are split into batches with each batch being sent to a different model copy. The outputs of all the copies are then aggregated and returned.

These techniques can be combined for applications that require large data throughput as well as large model sizes. An example might be a multi-model data processing pipeline or a super-network data parallelism hosting server where the model cannot fit in one accelerator.

Certain fields of study require some form of distributed training, such as active learning, federated learning, and natural language processing (NLP). Active learning can use distributed training techniques by a type of user-query system common on web browser devices. Similar is federated learning where, due to privacy laws, the user’s data cannot be sent to another server. Language models are the best example of the importance of distributed learning as they can be much larger than computer vision models. The BERT-Large model uses 8.3 billion parameters and emphasizes the importance of data center AI training, as it was trained with 1,472 V100 GPUs in a DGX SuperPod using 92 DGX-2H nodes. This shows another trend in AI computing where processors are clustered together into a single product which is then connected to even more of these clusters, such as DGX SuperPods or TPU pods. This race for computation is showing to become a necessity as estimates suggest that the compute used to train state of the art models doubles every 3.4 months while Moore’s Law has diminished over the past few years, so large-scale distribution will become necessary (image credit: https://openai.com/blog/ai-and-compute/)

Constraints with Single Machine Systems

The limited memory of a GPU limits feasible model sizes. However, with the increased popularity of mobile applications, a new generation of smaller models have been proposed such as SqueezeNet and MobileNet that sacrifice accuracy for speed. But there has also been an increase in the size of some of the new state-of-the-art models such as FixResNetXt-101 32x48d (829M parameters), GPIPE (557M parameters), and SENet-154 (146M parameters). Loading these models and using single-precision floating-point weights would require roughly 3.3 GB, 2.2 GB, and 0.6 GB, respectively. This doesn’t take into account the training batch and activations required for the forward pass, nor the gradients for the backward pass. While these may fit in a GPU, they have limited ability to leverage the benefits of distributed inference or training. The relationship between model size and ImageNet accuracy can be clearly seen in the following graph:

Ball chart reporting the Top-1 and Top-5 accuracy vs. computational complexity. Top-1 and Top-5 accuracy using only the center crop versus floating-point operations (FLOPs) required for a single forward pass is reported. The size of each ball corresponds to the model complexity. (Credit: https://arxiv.org/pdf/1810.00736.pdf)

This can be shown to affect the inference performance due to the memory size clearly from the following graph where the accuracy of the model is compared to the FPS of an Nvidia Titan XP GPU (Credit: https://arxiv.org/pdf/1810.00736.pdf):

With multi-model pipelines, the problem is exacerbated by the number of required GPUs, especially once it requires multiple compute nodes and there must be data transfer between nodes that then transfer the data to their GPUs. Typically this requires a combination of customized hardware and fine-tuned software, but for normal computing systems, this may require a great deal of additional cost and time.

dKeras and Ray

Ray solves these issues with an easy-to-use hardware allocation API that allows for GPU partitioning and makes the problem of distributed GPUs easy. With an emphasis on real-time execution, Ray simplifies the development process while still providing good performance. This is done by using shared memory instead of copying the data between processes. Ray drastically reduces development time by providing a fast way to create complicated hierarchical execution graphs across multiple nodes. However, while this solves the issue from a technical perspective, cost improvement can also be made by using CPU-only solutions.

While GPUs have been traditionally used for inference, a rising trend with the advent of CPU-optimized deep learning frameworks has been to transition from using accelerators to using distributed CPU inference. This has taken place in part due to advancements in deep learning CPU software acceleration over the past few years. A recent example of successful framework optimization is where an Intel Xeon 9280 CPU beat the inference record for ResNet50 held by an Nvidia V100 GPU with 7878 FPS instead of the GPU’s 7844 FPS.

To make distributed inference easier for developers, we built dKeras, a Ray-based framework that wraps Keras, the popular deep learning framework that runs on top of TensorFlow, CNTK, Theano, or PlaidML. By using the same familiar API as Keras, this new framework makes the transition between serial and distributed deep learning systems much more easily instead of having to use large distributed frameworks that often require major system redesigns. Paired with Ray, this allows developers to create test systems on their laptops or workstations and then run on the cloud by changing only a few lines of code.

Installation:

pip install git+https://github.com/dkeras-project/dkeras

Example:

https://gist.github.com/gndctrl2mjrtm/095111f0dc510ef668e7ada09c3c1573

With CPUs, data parallelism is an effective method to greatly increase inference performance by storing much larger models in memory. Indeed, CPU-based systems can hold terabytes of DDR4 RAM, orders of magnitude more than a GPU. Bottlenecks to consider are the amount of shared data between workers, the transfer speed, multi-core utilization, and the implementation of the communication between workers and the parameter server. Since multiple workers request data from a parameter server, there may be access conflicts that degrade performance. As a result, inference performance scales sublinearly in the number of workers, as shown in the graph below:

A drawback to data parallel inference is that in order to initialize the worker processes, a copy of the model has to be set up on each process. While these live in separate processes, there is still a one-time time cost per worker since each worker must get the weights from the shared data object store, create the model, and load the model’s weights. With dKeras, the performance cost of setting up workers vs. number of workers is shown in this graph:

However, after paying the initialization cost, we can see a significant increase in CPU performance. From the tests shown in the following graph, we observe improvements of up to 30x without any changes to the model or underlying software itself. Typically CPU optimization requires building the framework with CPU optimized libraries such as Intel MKL-DNN or AVX-512 or optimizing the model itself through layer precision changes, but dKeras speeds up inference with only a few additional lines of Python.

Original Keras FPS vs. dKeras FPS using default TensorFlow 1.14.0 on Xeon 8280

However, this improvement is on a large multi-core processor where a single model took very little CPU utilization. This speedup level should not be expected on significantly smaller processors.

Currently in a very early stage, dKeras only supports data parallel inference. Future plans for the dKeras library include model parallel inference, distributed training, automatic hardware allocation, distributed GANs, and many more applications of distributed deep learning. This emerging field of machine learning has already seen a huge impact on medical, scientific, and industrial systems. Using Ray, it is simple to build powerful distributed deep learning applications.

dKeras: Make Keras up to 30x faster with a few lines of code

Written by Stephen