One of the feature that I am dealing with is GPU resource management. The requirement is that we have multiple jobs are scheduled, each job is a python script and may require a number of GPUs to work, the scheduler needs to distribute the GPUs evenly for these scripts. Kubernetes GPU scheduling would eventually be the solution, but right now the platform that i am working on not ready to adopt this yet. So I come up with a very simple solution that is usable for short term.
The idea is using CUDA_VISIBLE_DEVICES to control GPU allocation. When a script starts, it will query for available GPUs and then set the environment variable to acquire the GPUs. If there is not enough GPUs, the script will just fail and the scheduler will schedule the script again later, hopefully that it will be next time. The code is as following:
When start, the script will call auto_acquire_gpus to indicate number of GPUs it want to use. Inside this function, I use gpustat, a python wrapper of nvidia-smi, to query for number of gpus and list of processes that are using GPUs. From that list, I will know which GPUs are free (not allocated to any process), and then set the environment accordingly.
There are only two catches here:
- Set CUDA_DEVICE_ORDER to ensure that CUDA will use the same device order returns from nvidia-smi
- Run a simple tensorflow operation to really acquire the devices.
It surprisingly works well. There has been no problem until now. I’m still working to use kubernetes for isolation, since this solution is prone to race condition, and does not truly provide isolation. However, it is still easy and cheap to get something done. I hope it is useful for someone 😀