How to Cross Compile OpenCV and MXNET for NVIDIA Jetson (AArch64 CUDA)

Published in

Trueface

5 min readMay 26, 2021

A technical step by step guide.

Today I’ll be discussing how to cross compile popular computer vision and machine learning frameworks such as OpenCV and MXNET for AArch64 CUDA targets. The most popular example of this target is the NVIDIA Jetson, a GPU-enabled embedded System on Module (SOM). Despite the NVIDIA Jetson being widely used, I’ve found that there isn’t clear or sufficient documentation for cross compiling for this target, especially for novice programmers who may require a step-by-step guide. Therefore, I have put together this guide to help fill in some of those knowledge gaps.

Before we dive into all things technical, let us begin with a quick recap of cross compiling.

What does it mean to cross compile?

Cross compilation is the act of compiling code for a platform (often known as the target) other than the one on which the compiler is running (often know as the host).

Why cross compile instead of native compile on the target?

The most common targets for cross compilation are lightweight embedded devices. Although it is often possible to native compile on the target platform, doing so will be extremely slow when compared to cross compiling on a more powerful x86 device — the smaller memory, less powerful CPUs, and slower disks of embedded devices are to blame.

As an example, I went ahead and native compiled MXNET with CUDA support on the NVIDIA Jetson Xavier NX (their most powerful Jetson product) using the max power settings. The native compilation took a whopping 18 hours to complete! However, using an x86 laptop, I was able to cross compile the same library for the same target in less than an hour.

What are some of the challenges with cross compiling?

There are a few nuances associated with cross compiling which can make it more challenging than native compilation. For starters, the appropriate toolchain (compiler + linker + librarian + any other tools) must be installed for the cross compilation. Things become more complicated when the executable or library that is to be cross compiled has additional dependencies. These dependency libraries must either be cross compiled as well or installed for the specific target if available using a tool such as Multiarch (more on Multiarch here). Sometimes the target platform filesystem must even be copied onto the host machine.

TL;DR

If you want to skip the step-by-step guide and just view the final results, visit the following Github repository which contains the complete Dockerfiles, build scripts, and toolchain files discussed below.

cyrusbehr/cuda-aarch64-cc-mxnet-opencv

Contribute to cyrusbehr/cuda-aarch64-cc-mxnet-opencv development by creating an account on GitHub.

github.com

Let’s Dive into the Step by Step Guide

Many dependencies must be installed before the libraries can be cross compiled; therefore, it is best to work in a Docker container to avoid “dirtying” the local environment. The following steps give an overview of the Dockerfile which installs all the dependencies and build tools required for the build environment.

Creating the build environment Dockerfile

The first step is to choose the base Docker image. We will be using the nvidia/cuda:10.2-cudnn7-devel-ubuntu18.04 image which comes installed with CUDA 10.2 and cudnn7. Note, we are using the devel image which extends the runtime image by adding the CUDA compiler toolchain, x86 CUDA header files, and x86 CUDA static libraries.

Sounds like we are ready to go? Unfortunately not. Although the above image contains the CUDA compiler that we require (nvcc), we still need the AArch64 CUDA toolkit which contains the AArch64 CUDA headers and AArch64 CUDA libraries (which we will install later).

Next, we install a few of the build tools that will be used, including cmake, ccache, and aarch64-linux-gnu-g++ (installed through thecrossbuild-essential-arm64 package). We also specify the toolchain file to be used for the cross compilation as an environment variable. The contents of the toolchain file are shown below.

As can be seen, we specifyaarch-linux-gnu-gcc as the cross compiler and also instruct cmake to use this as the CUDA host compiler.

As stated in their documentation, MXNET relies on the BLAS (basic linear algebra subprograms) library for numerical computations. Although it supports several options, we will be using OpenBLAS which is best suited for embedded systems. We must therefore cross compile OpenBLAS by invoking the make command and specifying the cross compiler to be used.

In this step, which is arguably the most important, we install the cuda-cross-aarch64 and cuda-cross-aarch64-10-2 packages which contain the AArch64 CUDA toolkit. You may notice that if you simply try running apt-get install -y cuda-cross-aarch64 on the base image, the package won’t be found. For this reason, we must manually add the debian packages using dpkg -i. Finally, we create a symbolic link so that libcublas.so is found at the expected location.

At this point, our Docker file is ready and can be built into a docker image. An instance of the image can then be run as a container, and we can proceed with cross compiling our libraries in this docker container.

Cross Compiling OpenCV

The following build script can be used to cross compile OpenCV in the Docker container.

A few things to note:

In order to build OpenCV with CUDA support, the OpenCV contrib package must be installed. In the cmake command, the path to the contrib package must be specified: -DOPENCV_EXTRA_MODULES_PATH=....
On line 14, I had to apply a hack to resolve the issue discussed here.
The OpenCV provided toolchain file is used instead of the toolchain file that was added to the Docker image.
The -DCUDA_ARCH_BIN argument is used to specify the CUDA compute capability that will be supported (learn more about compute capability here).

Cross Compiling MXNET

The following build script can be used to cross compile MXNET:

Here are a few things to note on the above:

As per the bug described in this issue, line 15 of mxnet/cpp-package/CMakeLists.txt must be removed.
Since OpWrapperGenerator.py is disabled as a result of the above fix, op.h must be pre-generated and copied to the destination directory before running the build.
The toolchain file which was copied into the Docker container is used.

If you have followed along this far, you should have successfully compiled MXNET and OpenCV for the AArch64 CUDA target. At this point, you can write your main application and compile and link against the cross compiled libraries.

Here at Trueface, we automate the compilation of our Trueface SDK as part of our CI pipeline. We apply many of the techniques explained above in order to compile for our supported target platforms which include x86, x86 CUDA, AArch64, AArch32, and AArch64 CUDA. To learn more about what we are up to, check us out or feel free to reach out.