Building a GPU-enabled Kubernetes cluster for machine learning with NVidia Jetson Nano
As you may know, Jetson Nano is a low-cost (99$), single board computer intended for IoT type of use cases. Among many, many similar devices, its key selling point is a fully-featured GPU, compatible with NVidia CUDA libraries. This might not seem to be a big deal, but in practice it is. CUDA is the de facto standard for modern machine learning computation. Typically it is used with GeForce, Quadro or Tesla boards or high-end workstations and servers produced by NVidia, which are highly performant — yet costly and power-hungry.
So cutting the story short, having the ability to use a cheap, CUDA-equipped device, we thought — let’s build our own machine learning cluster. Now, these days if you think “cluster” you typically think “Kubernetes”. Kubernetes — originally created by Google, is a very commonly used tool to manage distributed applications running on hundreds, thousands or maybe even hundreds of thousands machines.
We were not aiming that far with our project. Our cluster is composed of four Jetson Nano machines. Below is a detail guide on how we built and configured the working cluster. It is applicable to any number Jetson Nanos — so if you have less or more than four, you’ll be fine. Yet, you should have at least two of them to think about a real cluster.
Without further due, here are the instructions.
What is needed?
You need to have:
- 4 x Jeston Nano board from NVidia,
- 4 x high speed micro-SD card with at least 16 GB each (the faster these SD cards the better — you really should use fast cards),
- 4 x external power supplies — although Jetson Nano can be powered with a standard USB power supply, it is highly advised to use DC barrel PSU as described here: https://www.jetsonhacks.com/2019/04/10/jetson-nano-use-more-power/ — this will be really important if you’ll be running compute intensive tasks,
- 1Gbps Ethernet switch to connect the Jetsons — we used either 5 or 8-port desktop switch + of course UTP cables to wire everything,
- optionally, a nice case to fit the Nanos — we 3D printed a modified variant of this case: https://cults3d.com/en/3d-model/tool/jetson-nano-case.
In addition, we assume that all of your Nanos will be able to access the Internet to download additional software packages during the installation.
This is straight-forward. You need to prepare the SD cards to boot up the Nanos. You need to get the base operating system from NVidia. You need to download NVidia Jetpack version 4.2.1 or higher. As of writing this, the current version is 4.2.2 and you can get it here for free: https://developer.nvidia.com/embedded/jetpack. Instructions on how to write the downloaded image to SD cards can be found here: https://developer.nvidia.com/embedded/learn/get-started-jetson-nano-devkit#write (you’ll need a computer with SD card slot to do it).
Now, the very important thing here is to be sure that you use version 4.2.1 or higher. Earlier versions do not have GPU support for Docker based containers, which is strictly required for our plan to work out.
From now on, we’ll assume that you have your SD cards ready, with a fresh setup of NVidia Jetpack ≥ 4.2.1 and that all your Nanos can boot-up from these cards.
Be sure to check out the current “Getting Started” document from NVidia to get familiar with the machines and their system: https://developer.nvidia.com/embedded/learn/get-started-jetson-nano-devkit.
To interact with the Nanos, you should either use an external monitor and a keyboard, or more conveniently a remote SSH connection. For that you need to figure out the IP addresses assigned by your local DHCP. If unsure — plug an external monitor and set/check the system settings. Jetpack is an Ubuntu based system, so the initial setup should be straight-forward.
Configuring the base system
These steps should be repeated on each of the Nanos:
- Disable the GUI mode, which is enabled by default and consumes resources:
sudo systemctl set-default multi-user.target
Keep in mind, that by doing this, your Jetson Nano will boot to text-mode only. Yet, if you need — you can reverse this by resetting the default system mode to “graphical.target”.
- Make sure that the Nano is in its high-power (10W) mode:
sudo nvpmodel -m 0
This typically is a default setup. Yet, it’s better to be checked. The low-power (5W) mode decreases the computational performance of the board.
- Disable swap — swap can cause issues with Kubernetes:
sudo swapoff -a
- Set the NVidia runtime as a default runtime in Docker. For this edit /etc/docker/daemon.json file, so it looks like this:
Among the settings presented here, the one above is very, very, very crucial. You will experience a lot of issues if you fail to set this correctly. For details on why this is needed see for example here https://github.com/NVIDIA/nvidia-docker/wiki/Advanced-topics#default-runtime. By changing the default runtime, you are sure that every Docker command and every Docker-based tool will be allowed to access the GPU.
- Finally, just to be sure, update your system to fresh versions of the installed packages:
sudo apt-get update
sudo apt-get dist-upgrade
- Add current user to docker group to use docker command without sudo, following this guide: https://docs.docker.com/install/linux/linux-postinstall/. The required commands are following:
sudo groupadd docker
sudo usermod -aG docker $USER
- After all this, it is highly advised to reboot the system. Keep in mind, that it should reboot into text-mode only, which is good!
Test Docker GPU support
At this stage, we are ready to test if Docker runs correctly and supports GPU.
To make this easier, we created a dedicated Docker image with “deviceQuery” tool from the CUDA SDK which is used to query the GPU and present its capabilities. The command to run it is simple:
docker run -it jitteam/devicequery ./deviceQuery
If your setup is correct, you should see an output similar to this:
CUDA Device Query (Runtime API) version (CUDART static linking)Detected 1 CUDA Capable device(s)Device 0: "NVIDIA Tegra X1"CUDA Driver Version / Runtime Version 10.0 / 10.0
CUDA Capability Major/Minor version number: 5.3
Total amount of global memory: 3964 MBytes (4156932096 bytes)
( 1) Multiprocessors, (128) CUDA Cores/MP: 128 CUDA Cores
GPU Max Clock rate: 922 MHz (0.92 GHz)
Memory Clock rate: 13 Mhz
Memory Bus Width: 64-bit
L2 Cache Size: 262144 bytes
Maximum Texture Dimension Size (x,y,z) 1D=(65536), 2D=(65536, 65536), 3D=(4096, 4096, 4096)
Maximum Layered 1D Texture Size, (num) layers 1D=(16384), 2048 layers
Maximum Layered 2D Texture Size, (num) layers 2D=(16384, 16384), 2048 layers
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 32768
Warp size: 32
Maximum number of threads per multiprocessor: 2048
Maximum number of threads per block: 1024
Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535)
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and kernel execution: Yes with 1 copy engine(s)
Run time limit on kernels: Yes
Integrated GPU sharing Host Memory: Yes
Support host page-locked memory mapping: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Disabled
Device supports Unified Addressing (UVA): Yes
Device supports Compute Preemption: No
Supports Cooperative Kernel Launch: No
Supports MultiDevice Co-op Kernel Launch: No
Device PCI Domain ID / Bus ID / location ID: 0 / 0 / 0
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 10.0, CUDA Runtime Version = 10.0, NumDevs = 1Result = PASS
If you see “Result = PASS” at the end of the output, everything should be fine and you can proceed. If not, stop and debug the issue (or ask us questions in the comments)!
Setting up Kubernetes
Static IP addressing
To make things easier with the network setup, before setting up Kubernetes, it is advised to set static IP addresses for the Jetsons. This is not strictly required, but it’ll make your life easier. You can obtain this via different methods — configure your DHCP server to assign addresses statically based on MAC addresses or manually configure the network on each of the boards. One of the methods is to use netplan utility following the guide contained here: https://netplan.io/examples#using-dhcp-and-static-addressing. Before following the guide, make sure to install netplan via apt-get:
sudo apt-get install netplan.io
And after installing and creating the netplan configuration, remember to “apply” it with:
sudo netplan apply
If you don’t like netplan, you can use a more traditional approach described here: https://linux.m2osw.com/setup-static-network-jetson-tx2 (the guide is for Jetson TX2, but it will also work for Jetson Nano).
Summing up, from on, we assume that we have our four Jetsons with following, static IP addresses (they may differ in your setup):
Now we are ready to install Kubernetes with all the dependencies. This is achieved with the following set of commands:
sudo apt-get install apt-transport-https -ycurl -s https://packages.cloud.google.com/apt/doc/apt-key.gpg | sudo apt-key add -echo "deb https://apt.kubernetes.io/ kubernetes-xenial main" | sudo tee -a /etc/apt/sources.list.d/kubernetes.listsudo apt-get updatesudo apt-get install -y kubelet kubeadm kubectl kubernetes-cni
Configuring master node
Kubernetes in the simplest setup is a master-slave type of architecture (where slave is here referred to as a worker). We need to have one master node configured. In our case, this will be the jetson1 machine. So this step should only be executed on one of the Jetsons!
Cluster gets initialized with:
sudo kubeadm init --pod-network-cidr=10.244.10.0/16 --kubernetes-version “1.15.2”
The output of this command is relatively complex, but we need to study it carefully. In our case it looked like this:
Note that at the bottom we have specific instructions on what to do next, to start using the cluster. The key part of the message is the “kubeadm join” command with the IP address, port and secret tokens (which will be different for your setup!). This is an essential command for us to use on other nodes (jetson2, jetson3 and jetson4).
So now, use the kubeadm join command on all your worker nodes! The command is exactly the same on each of these nodes.
Complete your Kubernetes setup
Now on your master node (jetson1) you should be able to see the list of all the nodes of the cluster with:
kubectl get nodes
If this returns an error message like: “The connection to the server localhost:8080 was refused — did you specify the right host or port?”, run the following:
sudo cp /etc/kubernetes/admin.conf $HOME/
sudo chown $(id -u):$(id -g) $HOME/admin.conf
… and retry. It should give a nice list of nodes. We can mark the new nodes as workers now:
kubectl label node jetson2 node-role.kubernetes.io/worker=worker
kubectl label node jetson3 node-role.kubernetes.io/worker=worker
kubectl label node jetson4 node-role.kubernetes.io/worker=worker
And now we should be able to see something like this:
… where of course the “AGE” column will be different in your case.
Running your first GPU-enable Pod
Now we are ready to check if the GPU-enabled Pod (Kubernetes deployment) works. Create a file gpu-test.yaml with the following contents:
As you can see we use the same Docker image as before to execute deviceQuery tool. Let’s submit it to the cluster for execution:
kubectl apply -f gpu-test.yml
kubectl logs devicequery
The output should correspond to our earlier attempt to run “deviceQuery” in Docker, so we won’t copy it here again. Anyhow, look for “Result = PASS” at the end of the log.
We are almost there! Now it is time to check if Tensorflow works well! For this let’s create another Pod. The YAML file tensorflow.yaml should look like this:
Note that the command here is quite strange. The Pod will run and … sleep! This is intended, as we are going to spin-up the Pod and then access it with an interactive session to check if things go smoothly.
The Docker image used here is our “jetson-nano-tf-gpu”. It is a small Docker image with a current version of GPU-enable Tensorflow compiled for Jetson Nano. Note that since Jetson Nano is ARM64 based, standard Tensorflow Docker images WILL NOT WORK!
Our Docker image is published on Docker Hub (https://hub.docker.com/r/jitteam/jetson-nano-tf-gpu), but we can’t use the Docker Hub’s building infrastructure to host, since the image itself needs to be built on a Jetson Nano! For reference, you can see the underlying Dockerfile here: https://github.com/jit-team/jetson-nano/tree/master/docker/jetson-nano-tf-gpu — and if you like, you can use it to create your own images.
Having our tensorflow.yaml ready, let us try to run it and access the Pod:
kubectl apply -f tensorflow.yml
kubectl exec -it tf -- /bin/bash
The shell should spawn inside the running container, and from there we can verify if Tensorflow works and sees our GPU. Let’s execute:
python3 -c "from tensorflow.python.client import device_lib; print(device_lib.list_local_devices());"
The output should look somehow like this:
which confirms that indeed an instance of Kubernetes managed, Docker hosted container with a fresh version of Tensorflow can communicate with a GPU, which was our ultimate goal.
Having this working you are ready to go with your Tensorflow applications. Host your trained models or train new ones!
This completes the first part of our instruction. At this stage we have a very basic Kubernetes cluster, with 3 GPU-enabled worker nodes and 1 master node, on which you can run machine learning workloads for inference and even training, using GPU accelerated Tensorflow. Other popular machine learning frameworks should work as well (yet, sometimes building them from source may be needed).
What we still need to do is to start using NVidia Device Plugin for Kubernetes (https://github.com/NVIDIA/k8s-device-plugin) which is highly recommended on high-load clusters as it can help to monitor the status of the avaliable GPUs.
In our next tutorial we will show how to run popular Tensorflow benchmarks and models and also we will play around with Tensorflow Serving (https://www.tensorflow.org/tfx/guide/serving).