Published in

Horizontally Fused Training Array (HFTA): An Effective Hardware Utilization Squeezer for Training Novel Deep Learning Models

Note 1: SOTA stands for State-Of-The-Art.

Note 2: This paper is at the algorithmic level of improving hardware utilization. For more information on the fusing algorithm of models, the appendices of the papers should be read. This summary just gives the idea with a solid background.


Deep Learning (DL) has become ubiquitous. In recent years, the training cost of new models has increased astonishingly. The authors of HFTA[1] analyze the hardware efficiency achieved by typical DL training jobs. Their study reveals that single-accelerator (single GPU) training jobs can dominate the cluster-wide resource consumption (46.2% of the cluster-wide total GPU hours) when launched repetitively. The purpose of repetitive launching is typically for hyperparameter tuning. However, this domination ends in severe hardware underutilization. Their research reveals that: (1) these models often have the same types of operators with the same shapes, (2) the inter-model horizontal fusion of such operators is mathematically equivalent to other already well-optimized operators. Therefore, they propose HFTA, a new DL framework extension library that horizontally fuses the models from different repetitive jobs deeply down to operators and then trains them simultaneously on a shared accelerator to improve the hardware utilization. The evaluation of the proposed mechanism on six DL models training on state-of-the-art accelerators (GPUs and TPUs) indicates its high efficiency. The results show up to 15.1X higher training throughput versus the standard practice of running each job on a separate accelerator.

Background and Motivation


It is noteworthy to mention that the amount of computing for training DL models doubles every 3.4 months outpacing Moore’s law [ref].


Neural (Model) Architecture Search (NAS)

One crucial aspect of the progress of neural networks is their architecture composed of hyperparameters. Currently employed architectures have been developed manually by human experts in the trial-and-error method. This method is a time-consuming and error-prone process. Because of this, there is a growing interest in automated Neural Architecture Search (NAS) methods. This survey paper provides an overview of existing work in this field of research.

Convergence Stability Testing

This method trains the same model many times with different random seeds to verify the final accuracy results.

Improving Hardware Utilization for DL Training Jobs is not Easy!

This challenge is an effect of two causes. First, DL researchers and practitioners lack the system and architecture expertise to optimize their training workloads. The naive only chosen was by them is to increase the mini-batch size. However, large mini-batch sizes can lead to problems such as increasing Time-To-Accuracy (TTA) [2], training instability in GANs [3,4], generalization gap [5], diminishing returns due to mini-batch scaling limit [6]. Second, Accelerators evolve toward more compute power with more specialized compute units and larger memory capacity and bandwidth.

Hardware Sharing

The hardware-based sharing solutions applicable to DL training jobs are the Multi-Process Service (MPS) and Multi-Instance GPU (MIG).

Multi-Process Service (MPS): MPS allows CUDA kernels from different processes to potentially run concurrently on the same GPU via a hardware feature called Hyper-Q.

Nvidia Hyper-Q Technology

Multi-Instance GPU (MIG): This capability is available on the most recent A100 GPUs, which are being sold packed in the DGX A100 machine. It is a machine consisting of 8 A100, a pair of 64-core AMD server chips, 1TB of RAM, and 15TB of NVME storage. MIG capability partitions a single GPU into multiple (up to 7 instances) isolated GPU instances (GIs) where each job runs on a single GI.

The authors of the HFTA mentions the following downsides for the MPS and the MIG mechanisms:

  1. Both MPS and MIG duplicate the runtime overhead among kernels from different training jobs, including kernel launches [7], GEMM setup and teardown, and memory format conversions
  2. Both MPS and MIG require running training jobs as separate processes, which duplicates the GPU memory overhead reserved by the DL framework stack and leads to a higher overall GPU memory footprint.
  3. MIG’s partitioning granularity can be too coarse for many training workloads even with the finest granularity of MIG.

Nvidia A100 (40GB) GPU consists of 8 x 5GB memory slices, and 7 compute slices (or 7 SMs).

The following example shows how a 5GB memory slice is combined with 1 compute slice to create a 1g.5gb GPU Instance (GI).

The proposed Mechanism: HFTA

The researchers [1] according to their key observations propose HTFA from addressing the underutilization challenge.

(1) When launched repetitively, the models used across these jobs often have the same types of operators with the same shapes.

(2) Horizontally fusing the same types of operators with the same shapes often results in other mathematically equivalent operators that already exist in many SOTA DL models and thus have been optimized in most DL framework stacks on different accelerators.

The following figure shows the primary idea of the HFTA. The first operators in both models are Conv2d of the same shape; the horizontal fusion of many Conv2d operators is mathematically equivalent to a grouped Conv2d which is already used in ResNeXt [8] and MobileNets [9] models and supported by cuDNN on Nvidia GPUs and XLA on TPUs [ref].


As depicted above, many training hobs can be fused into a single job. It does not require implementing any new device-specific operator from scratch, which is both time-consuming and error-prone. Moreover, this approach generalizes to any hardware backends that the DL frameworks already support. Noting that horizontal fusing can be applied to both single-accelerator and distributed training. However, manually implementing existing training workloads to the fused ones from scratch can be challenging for DL researchers and practitioners. So, the authors developed a new DL framework library called HTFA. They chose PyTorch due to its user-friendliness and increased popularity within the ML community. Using the developed tool is possible just by changing several lines of code. The following figure shows how to enable HFTA for AlexNet. The model definition is kept the same with only a few extra lines of code to update PyTorch’s operator classes.


The following figure exemplifies the fusion of convolution operators is equivalent to their grouped convolution counterparts by concatenating (1) the inputs along the channel dimension, and (2) weights (filters) and biases along the dimension of the output channel.



In this work, the authors learned from the GPU cluster usage analysis that repetitive single-accelerator training jobs (usually for hyperparameter tuning) dominate cluster-wide hardware resource usage and can severely under-utilize the hardware. They observe specific characteristics of these jobs that enable the inter-model horizontal fusion. Hence, they propose the HFTA library that horizontally fuses the models down to operators significantly improving the hardware utilization by simultaneously training many models on the same accelerator. On six highly impactful DL models, HFTA achieves up to 15.13X higher training throughput than running each job on a separate accelerator, which is a common practice employed by hyperparameter tuning frameworks.


[1] Wang, Shang, et al. “Horizontally Fused Training Array: An Effective Hardware Utilization Squeezer for Training Novel Deep Learning Models.” Proceedings of Machine Learning and Systems 3 (2021).

[2] A. Koliousis, P. Watcharapichat, M. Weidlich, L. Mai, P. Costa, en P. Pietzuch, “Crossbow: Scaling Deep Learning with Small Batch Sizes on Multi-GPU Servers”, Proc. VLDB Endow., vol 12, no 11, bll 1399–1412, Jul 2019.

[3] A. Brock, J. Donahue, en K. Simonyan, “Large Scale GAN Training for High Fidelity Natural Image Synthesis”, CoRR, vol abs/1809.11096, 2018.

[4] Open Question about GANs, accessed 26–11–2021,

[5] N. S. Keskar, D. Mudigere, J. Nocedal, M. Smelyanskiy, en P. T. P. Tang, “On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima”, CoRR, vol abs/1609.04836, 2016.

[6] C. J. Shallue, J. Lee, J. M. Antognini, J. Sohl-Dickstein, R. Frostig, en G. E. Dahl, “Measuring the Effects of Data Parallelism on Neural Network Training”, CoRR, vol abs/1811.03600, 2018.

[7] D. Lustig and M. Martonosi, “Reducing GPU offload latency via fine-grained CPU-GPU synchronization,2013 IEEE 19th International Symposium on High-Performance Computer Architecture (HPCA), 2013, pp. 354–365, DOI: 10.1109/HPCA.2013.6522332.

[8] S. Xie, R. B. Girshick, P. Dollár, Z. Tu, en K. He, “Aggregated Residual Transformations for Deep Neural Networks”, CoRR, vol abs/1611.05431, 2016.

[9] A. G. Howard et al., “MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications”, CoRR, vol abs/1704.04861, 2017.



Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store