Some time ago, together with Kacper Łęgowski we experimented a bit with an amazing small computer produced by NVidia called Jetson Nano (you can read some of our adventures in this post: Building a GPU-enabled Kubernetes cluster for machine learning with NVidia Jetson Nano). NVidia did an amazing job releasing Jetson Nano — in a very small form factor and with a very low power consumption, we get a fully capable computer with a CUDA enabled GPU and a very attractive price tag of 99$. Jetson Nano is small and relatively powerful enabling multiple IoT / edge computing use cases with special focus on ML/AI workloads. Yet, as the smallest and cheapest member of the NVidia’s Jetson family, Nano has its limitations. What if you like the small form factor and little power consumptions, but at the some time you have the real need for speed? What if your use cases requires more processing power on the CPU and/or GPU front? Well, the answer has just arrived in the form of Xavier NX (more product details: https://nvda.ws/3bqcNEx).
In a nutshell Xavier NX is a Nano on steroids. Faster CPU (6-core ARM64 v8), faster GPU (384 CUDA cores, 48 Tensor cores — Volta microarchitecture), more RAM (8 GB of fast LPDDR4) and improved connectivity (finally M.2 NMVE SSD support!) really opens a lot of new opportunities … and all this with a 10 W / 15 W power requirement. What is even more interesting is that Xavier NX comes in a form of so-called SOM (System on a Module — exact same concept is being used by Raspberry Pi Compute Module), which is pin-compatible with the Nano, meaning that you can treat it as a plug-out/plug-in replacement for your Nano setup. Just as with Nano, you can get the SOM alone (which you need to integrate with your electronics) or a complete Dev-Kit which includes the SOM and a carrier board turning the SOM in a fully-functional micro computer.
Xavier NX Dev-Kit has been released just a few weeks ago and thanks to very friendly cooperation with NVidia, we had a chance to test it some weeks earlier to better understand it and evaluate the potential.
The initial impression we got, just after receiving the board is super positive. Xavier NX is fast, quiet (even though there is a fan cooling instead of a big radiator found in the Nano), stable and very well-made. A lot benchmarks of the board have been performed so far by multiple other sites (see for example: here, here, here and here). Obviously NVidia provided theirs as well, showing a massive speed increase as compared to Nano (and other similar solutions) — in many use cases over 20 times faster! This is impressive taking into account that Nano was already very fast and actually outperforming many of the other popular single board computers: outperformed multiple competing boards.
Anyhow, instead of trusting the existing external benchmarks, I really wanted to dig into the actual numbers and experience the performance for myself. And so I went on with my own benchmarking. For it I used phoronix test suite, AI benchmark and NVidia’s own tochrch2trt to see Xavier NX is action and compare it with the Nano. The results follow below…
Phoronix Test Suite
The Phoronix Test Suite (PTS) is an extensible and extensive, general benchmarking solutions to measure all of the aspects of your machine. I must admit that I’m a big fan of https://www.phoronix.com/ where they publish some of the benchmark they do using PTS of different machines. This small benchmarking project was my first try to actual repeat what they typically do at Phoronix and to use their tools, which are awesome BTW! I picked a number of different tests and compared the results from Nano and Xavier NX. It took few days (or rather weeks taking into account my scattered schedule) to configure and run everything…but finally here they are:
Note that “HIB” stands for “Higher is Better” and “LIB” is for “Lower is Better”, therefore the speed-up is calculated as a fraction of Xavier to Nano for HIB, and Nano to Xavier for LIB.
Clearly Xavier NX comes as a winner, yet the speed-up ranges from 1.2 to 4.6 depending on the test, and although this is strictly meaningless, the averaged speedup is 2.21. Taking into account that that price of Xaver NX is basically 4 times higher than Nano, in this mix of general propose tests it is unclear if this is a worthy upgrade. Yet, we need to take into account that these tests did not utilise GPU and were not really touching the AI use cases where Xavier shines (unfortunately I was struggling to use the pts/machine-learning test suite of the PTS — there are compatibility issues on AArch64 architecture). On the other hand these results show that Xavier NX provides a measurable speed-up in basically any use case, co if the performance of Nano is a problem for you and are willing to pay for a speed-up, it still seems that Xavier NX makes sense!
This is a set of benchmarks allowing to compare inference and training times of AI oriented CPUs / GPUs / TPUs. Various well known ML models are evaluated. Below is the comparison of inference measures of Nano vs. Xavier NX. All times are given in milliseconds (ms).
Notice that the final inference score for Nano was 182 while Xavier NX scored 1058 which is 5.8 higher! A great win for Xavier and clearly a justification of the 4 x price increase over Nano. So if you are using your Jetson for ML inference, by paying 4 times the price of Xavier NX on average you get more than 5 times performance increase over Nano. Nice!
One of the things that I’ve experienced while trying to run this benchmark is that the performance really strongly depends on the utilisation of GPU and its memory. The results shown above are from a freshly booted machined without the X11 GUI loaded (I’ve booted JetPack in text mode). Running the same benchmark from within the GUI on Xavier NX gives much lower marks and fails on Nano due to lack of memory. Moreover, there some to be a memory leak somewhere in the benchmark, driver or one of the libraries since when you re-run the same benchmark couple of times in a row, in each run the available GPU memory is lower. So if you struggle with Nano or Xavier NX and memory issues in TensorFlow — you should examine this deeper in your use case.
AI Benchmark also offers training benchmark. Unfortunately it doesn’t run on Jetson Nano, which makes perfect sense, as both Nano and Xavier NX are rather meant for inference. Yet it did ran on Xavier NX and here are the results:
Since we can’t get the Jetson Nano scored for Training (there is too little memory and GPU is not capable enough) to get some context of the results, we can look at: http://ai-benchmark.com/ranking_deeplearning.html where a ranking of different CPUs and GPUs is given. Device AI Score of 2103 which was achieved by Xavier NX corresponds to Intel i7 CPUs from the 8th and 9th generation which isn’t bad taking into account prices and power consumption. Anyhow, Jetson family is mostly for inference and as we already seen, Xavier NX gives a reasonable upgrade over Nano.
Important thing to note is that AI Benchmark uses TensorFlow, which is pretty nicely optimised on NVidia Cuda, BUT it does not use NVidia’s TensorRT for inference. TensorRT is an runtime, inference-only SDK, which can transform a pre-trained model and represent it in a form which is optimal for a given NVidia device. It also comes with a library of pre-trained models that you can use in your solutions. Unfortunately, I wasn’t able to easily change AI Benchmark code to test the inference using TensorRT. Luckily, NVidia provides a separate tool Torch2trt for converting models from PyTorch to TensorRT and along this tool, they also give a simple testing/benchmarking script. The conversion tool itself is fine, but in the our last section of the review we will concentrate on the benchmark results.
Torch2trt provides a testing script conveniently names “test.sh” which executes inference benchmarks for a number of models using pure PyTorch vs. TensorRT. For both frameworks throughput and latency is measured. Just in case you are not familiar with these measures, we typically aim for maximal throughput and minimal latency.
First let’s look on results obtained for Jetson Nano:
(Note: unfortunately some of the models available in the test script didn’t run on Jetson Nano due to out of memory on the board, which is strange.)
First of all, as we can see using TensorRT helps a LOT! Just by changing the runtime it was possible to increase the throughput multiple times on Jetson Nano. This is an important observation! If you are considering moving from Jetson Nano to Jetson Xavier NX due to the inference performance, first consider changing your code to make sure that TensorRT is used correctly! But assuming that you still need more… let’s examine the results of the same test on Jetson Xavier NX:
We can observe two important things! First of all… similarly to Jetson Nano, using TensorRT helps a LOT! If you do inference on Jetson family products — you probably should always be using TensorRT if you can!
Secondly, it is already clear, that again Jetson Xavier NX is much faster compared to Jetson Nano. But how faster it gets? Let’s compare the results. For Throughput we divide Xavier NX by Nano, and for Latency — we do the opposite — Nano by Xavier NX. Here are the outcomes:
As expected Xavier NX is always faster. The speed-up in case of TensorRT is between 1.5 up to almost 6 times faster. Interestingly, when using bare PyTorch the speed-ups are higher, yet keep in mind that PyTorch is much slower nominally than TensorRT.
All in all we didn’t replicate the 20 times speed-up reported by NVidia, but we got 11,5 times which is close enough… Xavier NX clearly can bring a substantial boost in certain workloads!
The final outcome seems relatively clear. Xavier NX is a highly performant small computer with low power usage. Yet, if you are looking for a very cheap, general purpose, single board computer — go with Jetson Nano (if you need ML inference) or Raspberry Pi (if you don’t care about ML and high performance in general). On the other hand, if your main point of concern is ML inference and you need a low power, (relatively) low cost, small dimensions device — Jetson Xavier NX is a perfect fit!
What’s more interesting is that for many projects, you can start with a Jetson Nano and as soon as the performance starts to be a problem, you should be able to upgrade to Xavier NX without issues — the software is basically the same, the form-factor is the same, the look-and-feel is the same and the performance is better (sometime MUCH better). The final decision depends on your use case… I see a lot of places where Xavier NX will be a great fit and can’t wait to see what the community (and our teams in Jit Team) do with this amazing new piece of technology.
Thanks to NVidia for providing us with a test Xavier NX board and I hope that a lot of great product launches are ahead of us! Cheers!