Configure Docker to Use Local GPU for Training ML Models
In today’s machine learning development, it is common to package the training application into a container, which is then deployed to a compute infrastructure for training. However, before distributing the container image, it is crucial to perform a simple test locally to ensure everything works correctly.
In this guide, I will explain how to configure your local machine to run a Docker container with access to your on-premise GPU devices. I will demonstrate the setup process on a Ubuntu 20.04 machine equipped with an Nvidia RTX 2060 12G GPU, CUDA version 11.8, and cuDNN version 8.6.0.
Here’s a step-by-step guide to achieving this:
Prerequisites
- Docker
- Nvidia driver
- CUDA Toolkit
- cuDNN
- NVIDIA Container Toolkit
Install Docker
You can follow the official documentation to install Docker Desktop. This application includes Docker Engine, Docker CLI client, Docker Compose, and other tools that enable you to build and share containerized apps.
Install Nvidia Driver
Before installing the Nvidia driver, ensure that the driver version is compatible with the CUDA Toolkit you intend to install. You can check the official documentation for information on compatibility. To determine the required version of the CUDA Toolkit, refer to the machine learning framework you will be using. More details are provided in the CUDA Toolkit section below.
After that, you can proceed to the driver download page, where you should specify your machine’s specifications. Once you have entered the details, click the “Search” button to initiate the driver download.
Install CUDA Toolkit
On an Ubuntu machine, it is advisable to install the necessary system packages, such as "build-essential", before proceeding with the CUDA Toolkit installation.
sudo apt-get install g++ freeglut3-dev build-essential libx11-dev libxmu-dev libxi-dev libglu1-mesa libglu1-mesa-dev
Before installing the driver, make sure to double-check the required versions of cuDNN and CUDA Toolkit specified by the machine learning framework you intend to use. The version specifications below are from the TensorFlow library. In this case, ensure that the following version requirements are met:
After verifying the version requirements, proceed to the download page and select your machine’s specifications. Once you have selected the appropriate settings, click the “deb(network)” button to obtain the script for installing the CUDA Toolkit.
At this point, you need to modify the installation script to specify the version of the CUDA Toolkit you want to download. Below is the original script you are likely to receive after specifying your machine specifications
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/cuda-keyring_1.1-1_all.deb
sudo dpkg -i cuda-keyring_1.1-1_all.deb
sudo apt-get update
sudo apt-get -y install cuda
Below is an example of installing CUDA Toolkit 11.8. You will need to change the last line by appending the version number to cuda
.
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/cuda-keyring_1.1-1_all.deb
sudo dpkg -i cuda-keyring_1.1-1_all.deb
sudo apt-get update
sudo apt-get -y install cuda-11.8
Afterward, you will need to set your PATH
and LD_LIBRARY_PATH
to point to the CUDA Toolkit that you just installed, which in this case is cuda-11.8
. If you are installing a different version, be sure to update it accordingly to the corresponding version. This will ensure that your system can locate and use the installed CUDA Toolkit correctly.
echo 'export PATH=/usr/local/cuda-11.8/bin:$PATH' >> ~/.bashrc
echo 'export LD_LIBRARY_PATH=/usr/local/cuda-11.8/lib64:$LD_LIBRARY_PATH' >> ~/.bashrc
Once you have completed these steps and set up the environment variables, it is essential to reboot the machine. Rebooting ensures that all the changes and configurations related to the CUDA Toolkit and environment variables take effect. After the reboot, your machine should be ready to utilize the installed CUDA Toolkit and GPU for machine learning tasks.
Install cuDNN
The installation of cuDNN is relatively straightforward, involving copying specific files to the CUDA Toolkit’s include
and lib64
directories. To download cuDNN, you can visit the cuDNN Archive page. Ensure that the cuDNN version you download matches the one specified by your machine learning framework.
To download the cuDNN package, obtain it as a tar
file and extract its contents once the download is complete. After extraction, run the following script to copy the necessary files into the appropriate CUDA Toolkit directories. Make sure that the specified path points to the correct CUDA Toolkit installation directory.
sudo cp -P <extracted_cudnn_path>/include/cudnn.h /usr/local/cuda-11.8/include
sudo cp -P <extracted_cudnn_path>/lib64/libcudnn* /usr/local/cuda-11.8/lib64/
sudo chmod a+r /usr/local/cuda-11.8/lib64/libcudnn*
Now, you have both CUDA Toolkit and cuDNN installed.
Install Nvidia Container Toolkit
To configure your Docker container to utilize the on-premise GPU devices, you need to set up the Nvidia Container Toolkit. If you are not using Docker, you can follow the official guide to install the toolkit.
To install the NVIDIA Container Toolkit, run the following command:
sudo apt-get update
sudo apt-get install -y nvidia-container-toolkit-base
To validate your installation, run the following command:
nvidia-ctk --version
To set up Nvidia Container Toolkit, run the following command:
distribution=$(. /etc/os-release;echo $ID$VERSION_ID) \
&& curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg \
&& curl -s -L https://nvidia.github.io/libnvidia-container/$distribution/libnvidia-container.list | \
sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
Then, you will need to configure the Docker daemon to recognize the NVIDIA Container Runtime by editing the Docker daemon configuration file. Open the file with a text editor:
sudo nvidia-ctk runtime configure --runtime=docker
Finally, to restart your docker daemon, run the following command:
$ sudo systemctl restart docker
Test the Setup
After completing the configuration of the Nvidia Container Toolkit and Docker, you can test your setup by running a base CUDA container.
$ sudo docker run --rm --runtime=nvidia --gpus all nvidia/cuda:11.6.2-base-ubuntu20.04 nvidia-smi
Dockerfile
As additional information, below is how the Dockerfile
is setup.
FROM nvidia/cuda:11.8.0-cudnn8-runtime-ubuntu20.04
RUN apt-get update --yes --quiet && DEBIAN_FRONTEND=noninteractive apt-get install --yes --quiet --no-install-recommends \
software-properties-common \
build-essential apt-utils \
wget curl vim git ca-certificates kmod \
nvidia-driver-525 \
&& rm -rf /var/lib/apt/lists/*
RUN add-apt-repository --yes ppa:deadsnakes/ppa && apt-get update --yes --quiet
RUN DEBIAN_FRONTEND=noninteractive apt-get install --yes --quiet --no-install-recommends \
python3.10 \
python3.10-dev \
python3.10-distutils \
python3.10-lib2to3 \
python3.10-gdbm \
python3.10-tk \
pip
RUN update-alternatives --install /usr/bin/python3 python3 /usr/bin/python3.10 999 \
&& update-alternatives --config python3 && ln -s /usr/bin/python3 /usr/bin/python
RUN curl -sS https://bootstrap.pypa.io/get-pip.py | python3.10
COPY requirements.txt /requirements.txt
COPY finetune.py /finetune.py
RUN python3 -m pip install --upgrade pip && \
python3 -m pip install --no-cache-dir -r /requirements.txt
ENTRYPOINT [ "python3", "finetune.py" ]