Toward GPU Utilization Prediction for Cloud Deep Learning
Deep learning had made a significant impact across many fields of computing. These workloads demand high compute/memory requirements. Graphical processing units (GPUs) are the primary accelerators for facilitating them. The current challenge is the underutilization of them on servers. It is because of the lack of fine-sharing capability of GPUs and the policies adopted by resource managers (job schedulers). For increasing the utilization of GPUs, some proposals provision these devices. However, hardware resource interference can result in less performance. So, online profiling is a common solution for monitoring systems. But, this approach reduces resource availability. The proposed mechanism by Yeung et al.  uses a machine learning model to predict the utilization of GPU for a deep learning model. They use information from models’ computation graphs. Their evaluations show 61.5% and 47.1% GPU cluster utilization compared to slot-based schedulers like Kubernetes and an online profiling mechanism profiling for one minute.
The proposed utilization estimator engine sits between the cluster queue and submission portal as shown in the following figure.
They do experiments with different models and data sets to study the relationship between deep learning workload characteristics and GPU utilization. The following table shows the models they used with the different configurations including mini-batch sizes, hidden dimensions, and the number of layers.
The following figure shows how utilization for models changes.
The prediction engine iterates through the model and calculates the FLOPs for each operation based on its inputs, output shape, and parameters . For example, a standard matrix multiplication in FLOPs is calculated by :
Input Shape * Output Shape * Batch Size
For the LSTM cell, they modeled it with two linear layers because LSTM cells perform matrix multiplication between the cell weights and inputs like input embedding and hidden states. Once the inputs are split into gates, gate computation can be modeled as activations. They show the relation between FLOPs and GPU utilization as the following figure shows:
The following table lists all features that are used for model training:
In total, 81 samples are split into 80%-20% for training and testing. They test with different regression models and select the random forest  because it offers the least Root Mean Square Log Error (RMLSE) of 0.154.
The evaluations prove the increased job completion time (JCT) as was shown earlier . It is because of GPU resources overallocation. The following figure shows the correlation between JCT and GPU utilization. However, in collocation scheduling, the utilization estimations can be used for wise collocation of jobs and increase utilization with managed performance overhead.
Importantly, the following options can be added to the approach for better results:
- Enlarging the data set by adding more configs and more models like GAN, GNN
- Generalizing it for other processors like FPGAs and accelerators
- Considering deep learning compilers like TVM. On its importance, for instance, when executing convolution, cuDNN selects the best algorithm for the later configuration. Knowing about these decisions would increase the predictor accuracy.
- Considering Distributed Training
- More intelligent collocation scheduling policies
GPU provisioning is one of the common solutions for the GPU underutilization problem. The proposed mechanism by Yeung et al. estimates the GPU utilization from models’ high-level information that is obtained from their computation graph. This information can be used for improving GPU cluster utilization by collocating jobs. However, the performance degradation (increased time completion time) is the accompanying challenge of this solution as it considers a very high-level utilization metric.
 Yeung, Gingfung, et al. “Towards GPU utilization prediction for cloud deep learning.” 12th USENIX Workshop on Hot Topics in Cloud Computing (HotCloud 20). 2020.
 “Random Forest Regression”, Link: https://bit.ly/37zUSj4, Accessed: 10–05–2022
 Xiao, Wencong, et al. “Gandiva: Introspective cluster scheduling for deep learning.” 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18). 2018.