Optimizing arch64 Edge devices for Maximum Performance on ML
For a Deep Learning engineer it is comfortable task to train heavy models in cloud infrastrucrue such as Amazon AWS which offers high performance computing engine in EC2 and Sagemaker along with the feature such as model hosting. But when it comes to deployment of same heavy models onto edge devices such as arch v7(Raspberry 4) and arch v8 NVIDIA Jetson hardware it becomes an arduous task. In order to overcome the problem of performance onto edge devices such as Pi and Jetson I am sharing few optimization technique which can boost the performance of the device and optimize the Deep Learning models.
NVIDIA Jetson is the world’s leading embedded AI computing platform. Its high-performance, low-power computing for deep learning and computer vision makes it possible to build software-defined autonomous machines.
The Jetson platform includes small form-factor Jetson modules with GPU-accelerated parallel processing, the JetPack SDK with developer tools and comprehensive libraries for building AI applications, along with an ecosystem of partners with services and products that accelerate development.
NVIDIA Jetson Family has 4 types of devices -
- Jetson Nano
- Jetson TX1
- Jetson TX2
- Jetson AGX Xavier
It is always fun to deploy the machine learning onto such powerful edge devices which is capable of supporting almost all the Deep Learning frameworks such as PyTorch, Tensorflow, MxNet (which is also officially adobted by Amazon AWS for services such as Sagemaker, Caffe etc. Above that NVIDIA got powerful library called Deepstream which comes handy in solving problems such as Object Detection and tracking objects.
Despite of such a poweful computation on Jetson hardware it is still possible that performance might lack while using powerful deep learning algorithms in real-time such as YOLO, SSD, Faster-RCNN etc.
It is important to keep up the frame per second rate some where around 18–20 fps.
Being a Deep Learning Engineer I had these issues in past but after utilising the jetson machine computational power in an efficient manner we can achieve better Frame Per Second rate with real-time Object Detection models.
Below are the metrics we can tune in order get the max performance onto Arch64 architecture.
- Run the Jetson clock.
This should be the first step because the jetson_clocks script disables the DVFS governor and locks the clocks to their maximums as defined by the active nvpmodel power mode.
$ sudo jetson_clocks.sh
2. Create SWAP
Swap space is a common aspect of computing today, regardless of operating system. Linux uses swap space to increase the amount of virtual memory available to a host. It can use one or more dedicated swap partitions or a swap file on a regular filesystem or logical volume. Swap space is the second type of memory in modern Linux systems. The primary function of swap space is to substitute disk space for RAM memory when real RAM fills up and more space is needed.
$ fallocate -l 8G swapfile
$ sudo chmod 600 swapfile
$ sudo mkswap swapfile
$ sudo swapon swapfile
To check the RAM and swap
$ free -m
3. Tune the Energy profiles.
Nvidia offers the flexibility to change the CPU and GPU settings to adjust the performance and the power consumption of the Jetson TX2, and the nvpmodel tool offers some energy-performance profiles that are convenient and easy to switch.
Common Energy Profiles —
- Max Q
Nvidia uses the Max Q term to refer to maximum processing efficiency, so in this mode all components on the TX2 are configured for maximum efficiency. This configuration uses the values the best power-throughput tradeoff.
2.Max P
Spends more power than Q in order to increase the CPU’s clock frequencies. This mode increases the performance sacrificing the power.
3.Max N
According to the Nvidia TX2 NVP model definition, in this mode the CPU’s and GPU clock frequencies are higher than in Max-P, sacrificing power consumption even more.
Comparison of the Energy Profiles
In order to use any of the profile. My personal preference is Max-N ;)
sudo nvpmodel -m <mode number for desired profile>
4. Accelerating the Deep Learning models using NVDLA Deep Learning Inference Compiler
Designing new custom hardware accelerators for deep learning is clearly popular, but achieving state-of-the-art performance and efficiency with a new design is a complex and challenging problem.
The NVIDIA Deep Learning Accelerator (NVDLA) is a free and open architecture that promotes a standard way to design deep learning inference accelerators. With its modular architecture, NVDLA is scalable, highly configurable, and designed to simplify integration and portability. The hardware supports a wide range of IoT devices.
NVDLA introduces a modular architecture designed to simplify configuration, integration and portability; it exposes the building blocks used to accelerate core Deep Learning inference operations. NVDLA hardware is comprised of the following components:
- Convolution Core — optimized high-performance convolution engine.
- Single Data Processor — single-point lookup engine for activation functions.
- Planar Data Processor — planar averaging engine for pooling.
- Channel Data Processor — multi-channel averaging engine for advanced normalization functions.
- Dedicated Memory and Data Reshape Engines — memory-to-memory transformation acceleration for tensor reshape and copy operations.
5. Lastly an alternative Optimization compiler to NVDLA is TVM compiler.
TVM is an open deep learning compiler stack for CPUs, GPUs, and specialized accelerators. It aims to close the gap between the productivity-focused deep learning frameworks, and the performance- or efficiency-oriented hardware backends. TVM provides the following main features:
- Compilation of deep learning models in Keras, MXNet, PyTorch, Tensorflow, CoreML, DarkNet into minimum deployable modules on diverse hardware backends.
- Infrastructure to automatic generate and optimize tensor operators on more backend with better performance.
References —