How Lyft Uses PyTorch to Power Machine Learning for Their Self-Driving Cars
Authors: Sammy Sidhu, Qiangui (Jack) Huang, Ray Gao — Lyft Level 5
Lyft’s mission is to improve people’s lives with the world’s best transportation. We believe in a future where self-driving cars make transportation safer and more accessible for everyone. That’s why Level 5, Lyft’s self-driving division, is developing a complete autonomous system for the Lyft network to provide riders’ access to the benefits of this technology. However, this is an incredibly complex task.
In our development, we use a large variety of machine learning algorithms to power our self-driving cars, solving problems in mapping, perception, prediction, and planning. To develop these models, we train and validate them on millions of images and dense LiDAR/RADAR point clouds, as well as many other types of inputs such as agent trajectories or video sequences. These models are then deployed on Lyft’s autonomous vehicle (AV) platform where they need to generate inferences such as bounding boxes, traffic light states, and vehicle trajectories on the order of milliseconds.
The frameworks we built early on were designed to quickly ramp up machine learning efforts in the first year of our program, allowing us to get our product on the road. But, these solutions did not give us the development velocity and scale we needed to solve problems that arose as our program grew. Our machine learning framework needed to allow our team to build and train complex models in hours — not days — and enable them to validate metrics and deploy deep learning models to the AV fleet. We now have a framework that not only satisfies all our requirements to iterate fast and scale, but also unifies machine learning development for all engineers and researchers across Lyft Level 5. Read on to learn how we got there.
Building the Right ML Framework for Self-Driving
We believe that rapid iteration and adaptation is a key to success at Level 5. This principle applies both to our Machine Learning (ML) models and our ML tools. When we started Lyft Level 5 in 2017, we trained some basic computer vision models on our desktops. Just a few months later, we built our first in-house training framework to enable us to scale. With this framework, we deployed nearly a dozen models on the AV, but soon realized that we needed a paradigm shift with focus on the following key principles:
- Iterate on Models in Hours: Our first production models were taking days to train due to increases in data and model sizes. We wanted our engineers and researchers to be able start with a new idea, implement the model, and see production quality results in a matter of hours, even as our dataset sizes continue to grow.
- Seamless Model Deployment: Our initial deployment process was complex and model developers had to jump through many hoops to turn a trained model into a “AV-ready” deployable model, meaning it can run in our cars, in a C++ runtime, with low latency and jitter. We wanted to make the problem of inference speed and deployability a priority. We also needed it to be easy for engineers to add new custom layers for both training and inference.
- Unified Experimentation and Research: There is often a divide between the ideal development environment for engineers vs. researchers. At Lyft, we saw the opportunity to eliminate this divide as one of the keys to rapidly advancing our ML systems; we do not want two stacks. Things like build environments, cloud compute resources, logging, etc. should be turnkey and just work for everyone.
- Resource Optimization for Hardware: We saw from low GPU and CPU utilization that our initial training framework wasn’t able to completely utilize the hardware. As our dataset sizes and team grew, this led to longer training times for model developers and potential cost savings. We wanted to make our framework hardware-aware to squeeze all the performance we could, for both faster training and lower cost.
After considering these principles, we decided to create a solution that incorporated PyTorch at the heart of our next-generation machine learning framework. In 6 months, we built a prototype, migrated over a dozen production AV models from our last framework, onboarded over 40 machine learning engineers, and created a unified ML framework for all of Level 5.
Building the Proof-of-Concept
Before we went all in on PyTorch, we wanted to validate that it could accommodate our use cases. To accomplish this, we would need to see one of our models implemented in PyTorch, trained on our data, and deployed to our C++ self-driving stack.
The candidate modeling task we chose to implement end-to-end was LiDAR Segmentation, which is a task that takes in a 3D LiDAR point cloud and classifies each point into a class, such as ground, car, person, etc (See Figure 1).
We started by writing a PyTorch DataSet class that operated on binary files of annotated point clouds. Then we had to write a custom collate function to aggregate different data items into a batch to use a PyTorch DataLoader.
Given that we now had a working data loading pipeline, we set out to implement our model. Our model at the time had the following structure:
- Take in sparse point cloud + per point metadata
- Featurize each point
- Voxelize featurized point cloud
- Run dense convolutions in voxel space
- Map voxels back to per point (deVoxelization)
- Classify each point
Some of these stages such as (De)Voxelization in our prior implementation were handwritten in CUDA and took weeks of engineering time. We found that implementing these in native PyTorch with primitives such as scatter_add and index_select gave us similar performance without having to resort to handwritten kernels, allowing us to produce the same models in days.
For the rest of the model, we were able to leverage modules in the `nn` package of torch utilizing operators like convolution and various loss functions. After we had a model implementation, we wrote some standard boilerplate code for training and were able to converge our model on our dataset.
Now that we had a way to produce a trained model on our dataset, the only thing left was to get it working in our C++ self-driving stack. At the time of this Proof-of-Concept (PyTorch 1.3), we found two options for how to export and run a “frozen model” in a C++ runtime environment:
- Export our Torch Model to ONNX and run the ONNX model in TensorRT or ONNX Runtime
- Build our models with TorchScript and run the saved serialized models in LibTorch
Learning from our past framework: While ONNX and TensorRT has been around longer and is more optimized for inference speed in some cases, we valued the ability to deploy models quickly and a significantly more lightweight pipeline (fewer external libraries and dependencies), allowing us to experiment quickly without locking ourselves into various limitations of ONNX and writing custom TensorRT plugins. We realize that we always need to write custom kernels and operations, but we would rather write LibTorch extensions (see section below for more details) than to add more blackbox external layers. We decided to evaluate TorchScript for inference for these reasons.
The last step of the evaluation was to integrate our TorchScript model into our self-driving stack. As a starting point, we used a PyTorch provided shared library build of LibTorch and integrated it into our build. Then we were able to utilize the LibTorch C++ API to integrate the model into our LiDAR stack. We found the API as user friendly to the python PyTorch API but in C++.
Finally, we had to validate that the model is within our latency budget and found that we actually had achieved lower latency and lower jitter than the current production model.
Overall, we had felt that our Proof-of-Concept had gone very well and we decided to move forward with productionizing PyTorch at Lyft Level 5 for both training and deployment.
Creating a PyTorch Production Framework
As part of productionizing PyTorch and to empower our ML engineers, we created a Lyft-Internal framework called Jadoo (See Figure 2). Unlike some frameworks, our goal was to provide a robust ecosystem that simplifies the runtime environment and compute infrastructure, allowing the ML engineer and researcher to iterate and deploy code quickly, instead of trying to make ML “easier” for non-experts and abstracting away all the great things of PyTorch.
Some core features of Jadoo include:
- All jobs are distributed. Jadoo is distributed from the get-go; all jobs are natively distributed jobs with the base case of one-GPU-one-node. An engineer building a model locally can then train a job in the cloud with hundreds of GPUs with a single command line argument change. Whether an experiment is running on one GPU, one node, or dozens of nodes, we maintain the same code path and exact same Docker image which allows us to avoid any surprises with local/cloud training. We also provide the tools for users to be able to quickly discover issues in the distributed setting such as GPU-utilization, network bottlenecks, and multi-node logging.
- Inference is a priority. In Jadoo, all models are profiled for runtime and made to be deployable for the AV. We capture all operation counts for all layers and measure inference latency and store this information for users to perform pareto-optimal speed-accuracy tradeoffs. We also ensure that every model trained can be deployed with TorchScript for both C++ and python deployments and do not require special pipelines to convert it to an inference model.
- Combine research iteration speed with production quality. Jadoo aims to give researchers the freedom to experiment and iterate but at the same time enforces engineering best practices. All code checked into Jadoo goes through GPU based CI (Continuous Integration) and checks for unit-tests, static type-checking, style linting, documentation checks, determinism, and profiling etc. To enable fast experimentation, we use the code-as-configuration and eliminate large and unwieldy json/yaml files and extract experiment parameters from the user’s code automatically. We enforce strict reproducibility for all jobs by versioning and recording data, experiment code and artifacts. Finally, we provide an ecosystem of tools for logging, visualization and job tracking that allows users to quickly debug and introspect their work.
We designed our distributed training environment to mimic the local environment such that users can seamlessly switch between local and distributed cloud training. The first step to achieving this is to make sure the local development environment is well controlled and containerized. We then use the same container with the environment as well as Jadoo/user code for local development, distributed cloud training, as well as Continuous Integration. For distributed training, we are able to rely quite heavily on the distributed package in PyTorch. We followed the paradigm of making each GPU its own process and wrapping our model in DistributedDataParallel.
Jadoo takes care of data sharding across nodes and workers so the user just has to create their datasets in a way that can be easily partitioned. We also found the Nvidia NCCL backend quite performant for both training and other operations such as distributed all-reduce, scatter and gather. Coordination and provisioning of the compute nodes are handled by our underlying infrastructure. We also control the checkpoint of the model state to allow for node preemptions and interruptions for cost saving measures like training on spot instances.
Inference with LibTorch and TorchScript
With Jadoo, we want to prioritize building models that can run efficiently in the AV with a C++ runtime. We found that LibTorch allows us to easily deploy trained models via TorchScript and the C++ API makes it really easy to use. The C++ API is especially useful when there is pre or post processing required for one of our deployed models since the API follows the familiar PyTorch.
One thing to note is that, although we started with PyTorch provided LibTorch builds for the Proof-of-Concept, we found the large static linked library was difficult to manage. To alleviate this, we compile LibTorch from source with our own dependencies, linking LibTorch’s dependencies via shared libraries. By doing this, we were able to reduce LibTorch’s binary size by an order of magnitude.
To ensure users can easily deploy their trained models, Jadoo checks if models can be converted to TorchScript during training and if so, periodically emits checkpoints that contain the TorchScript model as well as any additional metadata that allows the model to be traced to its origins. This metadata includes information such as the train run, GitSHA, username and any other metadata the user opts in to track. In addition, Jadoo automatically profiles the latency of this TorchScript model as well as its MACs (multiply-accumulate) and parameter count.
When the user is ready to deploy the model, they simply need to point to the model from the train run they want and then it can run in our C++ runtime inference built on LibTorch.
We found it is best practice as users are building their models to keep TorchScript in mind. This avoids any headaches of having a complex model, only to discover when trying to deploy that much of the model APIs need to change due to TorchScript incompatible syntax. We also found that unit tests are a good way to ensure models and modules are and stay TorchScript compatible. For example, often one developer might change a common module used by another developer and add syntax that is not supported by TorchScript, this would be caught early-on in CI (Continuous Integration) and never make it to production.
Combined Research and Engineering
Often in engineering organizations, there are differences between how researchers and engineers operate; researchers want the freedom to explore and iterate quickly on ideas while engineers want to build well-tested and reliable products. With Jadoo, we built a paradigm where both researchers and engineers can co-develop using the same framework, allowing us to create an iteration cycle from idea-to-production at a rapid pace. We achieve this by:
Heavily optimizing the ML developer iteration cycle
- Users can kick off a job in less than 5 seconds
- Jobs that use hundreds of GPUs start in minutes
- We heavily revamped our data loading pipelines with consideration for hardware to make sure distributed jobs are never bottle-necked by I/O , CPU, and network bandwidth.
- Median job training time is ~ 1 Hour For typical image/lidar based detection models deployed on the AV (see Figure 4.).
Making experiments easily reproducible and comparable
- We set up model and experiment configuration as python code which is all checked in into git and in our experiment tracking tools, eliminating the need for managing countless model json and yaml files.
- We also utilize code introspection to track and log constants in the code to make ablations and comparing experiment inputs and results very easy.
- We automatically log and record the datasets used in the experiment, remote git SHAs, as well as local code changes to make any experiment 100% reproducible at all times. This is critical for rigorous experimentation.
Keeping a high coding standard
- We dispelled the myth that rigorous coding practices slow down the experimentation cycle by making things like documentation, linting, type-checks, unit-testing are as automated as possible and never add more than a few minutes of developer time to merge code.
Results and Looking ahead
We believe that we have made tremendous strides in our ML efforts by adopting PyTorch and building Jadoo. In just a few months of work, we have been able to move a dozen or so models from our old framework to PyTorch and have them deployed with TorchScript + LibTorch. We have also been able to reduce our median job training time for heavy production jobs such as 2D and 3D detectors and segmenters from days to about 1 hour (see Figure 3) allowing users to specify how much compute and time they need. This allows our users to have many iterations a day for model development which was impossible prior. We believe we’ve also built a unique framework that truly empowers our ML engineers and researchers to do more in significantly less time, allowing them to iterate quickly on ideas and deploy them to the vehicle platform, without any attempt at democratizing and simplifying machine learning.
Looking forward, although we already have millions of data samples that we train on, our dataset sizes are growing exponentially. We need to be able to sustain training on more and more data, we need to be able to scale to thousands of GPUs for a single job with fault tolerance. For this, we are looking into technologies such as PyTorch Elastic for fault tolerance. We also plan to expand our tooling around inference profiling and optimization as well as model and data introspection and visualization.
We would like to thank the PyTorch Team including Michael Suo, Dmytro Dzhulgakov, Geeta Chauhan, Woo Kim, and Soumith Chintala for their assistance and helpful suggestions over the last few months.