NVIDIA Jetson Xavier NX: Unboxing + Review
We’re seeing how AI and machine learning is driving innovation in many industries. In particular, there’s been a rise of products featuring on device AI for improved performance, privacy, and other benefits.
Prototyping a physical product, especially one incorporating machine learning techniques, comes with many challenges.
This month, NVIDIA announced the new Jetson Xavier NX as the newest addition to the Jetson platform, showcasing its great compute power, modularity, and small footprint in power and size.
This board was designed to streamline the production of a minimal viable edge AI product. We got a chance to experiment with it and these are our most exciting takeaways.
Impressive Performance
The Jetson Xavier NX boasts 384 CUDA cores, 48 Tensor cores, and 2 Deep Learning Accelerator (DLA) engines based on the new NVIDIA Volta GPU architecture, delivering 21 Trillion Operations per Second (TOPS) of deep learning performance at under 15 Watts of power.
We benchmarked a variety of models, maximizing the gpu and cpu resources on the device.
Model Name FPS
0 inception_v4 306.557182
1 vgg19_N2 64.347013
2 super_resolution_bsd500 148.703391
3 unet-segmentation 145.107749
4 pose_estimation 236.427671
5 yolov3-tiny-416 544.657879
6 ResNet50_224x224 861.961310
7 ssd-mobilenet-v1 879.868200
Nearly every model ran in the hundreds of FPS, with an incredible ~880 FPS for ssd-mobilenet-v1! For context, an ssd-mobilenet-v1 model using TensorFlow 1.12 alone on a Nvidia GeForce GTX TITAN X card will run inference at ~33 FPS.
Running inference on an embedded device with large models like BERT once seemed out of reach. A GPU with at least 12GB of RAM was needed to run BERT base and 64GB RAM for BERT large. The Xavier NX can run inference on models like these at excellent speed. Benchmarking the BERT base model on 128 long sequences:
results for 1000 iterations with batch size 1, sequence length 128:
Average Time: 9.03 ms, 110.69 batch/second, 110.69 sequences/second
95th Percentile: 9.59 ms, 104.22 batch/second, 104.22 sequences/second
99th Percentile: 9.85 ms, 101.54 batch/second, 101.54 sequences/second
And benchmarking the BERT Large model using 128 long sequences:
results for 1000 iterations with batch size 1, sequence length 128:
Average Time:
29.71 ms, 33.66 batch/second, 33.66 sequences/second
95th Percentile: 30.37 ms, 32.93 batch/second, 32.93 sequences/second
99th Percentile: 31.82 ms, 31.42 batch/second, 31.42 sequences/second
With the NX, inference speed will not be a bottleneck for many high resource applications.
The Xavier NX makes it easy to run many models simultaneously. The NVIDIA team released a multi-container demo which runs 7 models seamlessly.
DeepStream Container with people detection:
- Resnet-18 model with input image size of 960X544X3. The model was converted from TensorFlow to TensorRT .
Pose container with pose detection:
- Resnet-18 model with input image resolution of 224X224. The model was converted from PyTorch to TensorRT .
Gaze container with gaze container:
- MTCNN model for face detection with input image resolution of 260X135. The model was converted from Caffe to TensorRT .
- NVIDIA Facial landmarks model with input resolution of 80X80 per face. The model was converted from TensorFlow to TensorRT .
- NVIDIA Gaze model with input resolution of 224X224 per left eye, right eye and whole face. The model was converted from TensorFlow to TensorRT .
Voice container with speech recognition and Natural Language Processing:
- Quartznet-15X5 model for speech recognition which was converted from PyTorch to TensorRT.
- BERT Base model for language model for NLP which was converted from TensorFlow to TensorRT.
The Jetson Xavier NX can swiftly run models using different frameworks simultaneously. We are currently using this demo to create a smart mirror assistant, which will require running many large models on high frame rate video and audio input. The demo runs these 7 models using four docker containers, making it a great starting point for developing an application using cloud-native tools.
Cloud-Native Computing to Edge AI Devices
Cloud native development containers hosted on NVIDIA NGC, pretrained models downloadable from the model zoo, and optimized SDKs let developers take advantage of cloud native development methodologies. This combined with software libraries to speed up most if not all tasks in an AI pipeline make POCing an AI application a dream.
You can read more about our experiments with some of these tools in this post.
Powerful and Compact
Another aspect of the new Jetson Xavier NX we were impressed by was the form factor and power requirements. The carrier board is as compact as the Jetson Nano and it features more specs like on-board WIFI!
Here is the full list of specs:
The Xavier NX Module can be used alone in a more finished product. The kit is efficient on power, only requiring 10W and up to 15W for maximum performance.
We are very excited to continue working with this newest addition to the Jetson platform. If you are developing an edge device that heavily incorporates AI and machine learning resources, we recommend the Xavier NX for its impressive computing capabilities, supporting software, and small footprint.