Published in

Nuro

8 min readJun 3, 2024

For an autonomous vehicle (AV), fast reaction speed combined with safe behavior equates to a good driver. Machine learning (ML) models serve as the brains of an AV, allowing it to process the world around it and making a driving decision. In order to run these ML models safely on the AV, they must be able to execute with low latency and high computational precision. ML compilers are a critical component of the AV development process, compressing and optimizing the trained ML models so that they run faster and more efficiently while producing the same results. Nuro has built a fully in-house compiler framework called Faster Than Light (FTL) which allows for flexible, state-of-the-art model deployments. This development aligns with Nuro’s AI-first approach to autonomous vehicle development, showcasing the company’s technical leadership as one of the foremost innovators in the field.

Motivation

At Nuro, we develop many types of models (e.g. vision, planning, prediction) of many architectures (CNN, transformer, RNN, etc). We quickly learned that there was no one industry solution that could easily support all the different types of models, since each industry compiler could only support a limited number of operators and architectures.

Traditionally, compilers dedicated to a single training framework like Tensorflow-TensorRT (TF-TRT) have achieved sufficient performance for industry model deployment at a company scale. However, as the speed of AI innovation at Nuro and in the community accelerated, we want to leverage advanced techniques like quantization, new operators, frameworks and hardware, without restricting ourselves to one framework or compiler. Thus, we laid out the requirements for our model optimization solution.

Can ingest models from different training frameworks (e.g. Tensorflow, PyTorch).
Highly customizable by users in order to fit a variety of models’ needs.
Supports advanced features multi-GPU inference & quantization.

The solution

Over the last several years, we built the Faster Than Light (FTL) Compiler Framework. It allows the flexible application of multiple industry compilers in a single compilation in a highly customizable python environment. The solution involves a few key features, each building on each other.

Orchestrator Compiler

In order to support a wider range of ML model architectures, we built a framework that can execute multiple industry compilers and produce a single, unified compiled binary.

Fig 1: Supporting multiple training frameworks with ONNX.

We first decided that our compiler would consume ML graphs in ONNX representation. ONNX is an open sourced ML framework designed by Microsoft to be a standardized representation of ML operations. Microsoft maintains several converters like tf2onnx and torch.onnx that allow us to compile models from different frameworks seamlessly (see fig. 1). From this point onwards, it was simple to implement a single converter from ONNX to the FTL intermediate representation (IR). This is FTL’s internal representation of different ML layers, built from scratch using MLIR, a leading industry IR as inspiration.

Fig. 2a: Example of how the Orchestrator Segmenter compiles relevant parts of the graph to TensorRT.

Fig. 2b: Example of incorporating multiple sub-compiler passes to produce a final stitched binary.

In order to support multiple compilers in a single compilation, the compiler groups together different layers into ‘islands’, or segments. Algorithmically, it’s simple: the compiler will greedily collect layers in topological order, stopping if a certain layer fails a certain predefined criteria. As presented in fig. 2a, this produces segments to send to the TensorRT compiler (as an example). The pre-defined criteria is written in python; enabling us to easily filter out specific layers that are known to have issues with TensorRT, or are simply not implemented. It also allows us to pre-process inputs to a segment, (e.g. add casts around inputs that are of an unsupported data type).

Fig. 2b shows how multiple sub-compilers can be utilized to compile different portions of the graph.

Each of the industry compilers like TensorRT, OnnxRT, Inductor, and OpenXLA provide APIs to execute their binaries. These APIs are integrated into the FTL executor so that the stitched graph can execute in a single, unified binary.

Fig. 3: Relative improvements in onboard resource utilization after general adoption of FTL.

Fig 3 shows the relative improvements in latency and resource consumption of models after full scale adoption of FTL. As an example, migrating a particular vision model from a legacy compiler to FTL, which internally used a mix TensorRT and some custom in-house compilation rules resulted in a 40% latency reduction of that model.

Custom kernel injection

FTL allows for injecting custom GPU kernels into the final compiled model, allowing the compiler team to pick specific implementations for specific layers. These custom GPU kernels can be written in CUDA, Triton or Pallas.

Fig 4: Injecting a PyTorch GPU kernel into the final compiled graph

An example demonstrating the flexibility of FTL’s kernel injection can be seen in how the team added support for the scatter-reduce max operation. This isn’t well supported in Nuro’s previous compiler, so the team was able to link the PyTorch scatter-reduce GPU kernel directly to FTL, allowing us to execute that specific layer using the highly optimized pytorch CUDA kernel.

Arbitrary user-defined compilation rules

Oftentimes, the conversion process from the training framework (e.g. Tensorflow, PyTorch) to ONNX can produce suboptimal or even incorrect conversions and decompositions of some layers. FTL provides users the means to configure any part of the compilation process to fit their model’s needs.

Fig. 5: Example of issues brought about by the model export / conversion process.

FTL provides flexible tools to avoid layer decomposition during model export. Fig 5 shows an example of such a decomposition and how users are able to define mappings to avoid it during FTL conversion & compilation.

Beyond the export and conversion of the training graph, third party compilers are notorious for causing silent regressions in the model’s performance.

Fig 6: Using the FTL Segment Breaker in the exported graph to isolate and configure a subgraph to compile in FP32.

Fig 6 shows how a user can configure a custom compilation rule on a specific part of the graph in a simple 3 step process. Note that in step 3, the final model binary is stitched around any custom compilation rule overrides. The rules are written in python and reference any graph or node attribute.

Multi GPU inference

Over the years, our models have evolved from a model-per-sensor approach to a modernized, multi-task unified model. While a unified model does use less compute and memory compared to separate models, it limits model size to what a single GPU can support. Multi-GPU inference allows us to execute the model on multiple devices with just one model definition. Nuro’s approach utilizes pipeline parallelism, which splits the model via subgraphs, as opposed to model parallelism, which splits the model across larger kernels.

Fig 7: How the user can configure multi-GPU inference in their model.

Manually splitting the model is cumbersome. Thus, we designed an API to allow users to configure a model that utilizes multiple GPUs. Configuring model parallelism is done through a cross-gpu-copy layer. The user simply inserts it wherever in the model we want to shift computations to another GPU, and FTL does the rest, as shown in Fig 7.

Fig 8: Nuro perception detector latency over time. Note the ~27% drop in latency after multi-gpu inference is applied.

Execution Priority Control & Early Publishing

In Nuro’s autonomy stack, certain downstream processes depend on a small subset of a certain model’s outputs. Traditionally, model graphs are executed in one atomic operation, implying that those downstream processes must unnecessarily wait for the entire model to complete executing. This motivated the development of the early model output publishing feature.

With this feature, users are able to mark each model output with a priority. This information allows FTL to prioritize executing parts of the model so the user-provided priority is respected. These outputs can then be made available to any downstream processes using an event-based approach. Downstream processes simply query the event of the specific model output, and can begin execution as soon as it’s ready, even before the model has completed its full execution.

Closing

The FTL Model Compiler Framework was a long-term investment into a custom ML model optimization sandbox to fit Nuro’s needs. Recently, the investment has been paying off, with the autonomy stack enjoying significant reductions to CPU compute utilization, GPU compute utilization and GPU memory consumption, and paving the path for the deployment of our unified multi-head vision model with multi-GPU inference.

Beyond just performance gains, FTL has provided Nuro with a unified optimization platform, which makes it easy to deliver performance impact to all models with just one code change. Additionally, non ML-infra teams (such as onboard performance) can easily look to one place to optimize CPU utilization and upgrade CUDA APIs, resulting in modernized and maintenance ML inference code.

The FTL Model Compiler Framework exemplifies Nuro’s AI-first approach, allowing the company to efficiently optimize and deploy cutting-edge ML models. This framework reinforces Nuro’s leadership in autonomous vehicle ML development, demonstrating the company’s commitment to creating safer, smarter, and more efficient self-driving technologies that will benefit society as a whole.

Special thanks to the team at Codeium, Timothy Chou, Shuai Shao, and Xin Liu for their contributions to FTL.

If this kind of work excites you, join our team!

By: Ali Boubezari, Muyang Yu, Nikolay Korovaiko, Hongze Zhao