Wrapping up my journey scaling Software 2.0 development for AV

Clement Farabet
13 min readNov 9, 2022

--

I spent the past 5 years of my life dedicated to figuring out how to build the right software infrastructure to enable Software 2.0 development in complex applications like Autonomous Vehicles.

It was an incredibly challenging, yet rewarding and humbling journey, and I thank all my amazing colleagues at NVIDIA for it.

While I’m exploring what’s next, and digesting this amazing journey, I wanted to share some of my learnings, and views on this, so it can maybe help some of you.

Software 2.0 is eating the world

First things first, if you’re not familiar with what Software 2.0 is, please check out Andrej’s great original write up on the topic. I introduced a similar framing then, which I used to guide our development efforts, and I will recap it here concisely.

The first diagram shows a traditional SW (1.0) development process. The second shows a Deep Learning (DL)-based SW development process — what Andrej dubbed SW 2.0.

Looking at both pictures, one can easily draw these analogies:

This is the foundation of what SW 2.0 is, and this is what has guided the entire revolution we’ve been witnessing in the past 10 years.

Essentially, anywhere where sufficient data is available to capture an association between two domains (e.g. images to text, text to images, text to text, video to car actuation), we’ve been aggressively replacing traditional software (handwritten source + compiler) by software 2.0 (data + deep learning).

In other words, data is the source code of AI.

The rate at which this is happening is astounding. Now we’re essentially feeding raw recordings of surround sensor data from a vehicle, and letting DL fully learn the mapping to high-level 3D/4D symbols fully describing scenes. Or feeding raw associations between text and imagery data, and letting models fully learn to associate a plain English prompt to a possible image rendering.

Many believe that Software 2.0, as described above, is now eating the world — just like Software [1.0] was eating the world in 2011 (original quote by Andreessen). I certainly do.

Some believe we’re missing some key ingredients to get to AGI, but this is a different debate, as the technology used to power SW 2.0 still has a lot of runway, is still getting better and better when the models are grown in size (# parameters) and fed with more data — and a wave of very concrete innovations are being derived from it as it is. In other words, we have a technology that’s ripe to enable amazing new products, and we’re in the middle of this cycle already.

SW 2.0 in AV

Now that we’ve set the stage for SW 2.0, I want to zoom into Autonomous Vehicles (AV), and how SW 2.0 powers their development. Why this topic? Well this is where I’ve spent most of my focus in the past few years, and though I worked to support several other domains, this was the most complex, most challenging, and most cross functional problem I’ve encountered in my life.

First some respect for the problem. Enabling vehicles to drive autonomously means enabling them to build a full understanding of their environment, spatially, present and future, and then leverage that understanding to plan and actuate the vehicle (accelerate/brake and steer).

The problem is daunting for a few reasons:

  • Building this understanding of the environment would be trivial if the environment was finite, and a set of sensors existed that could map the environment into a reliable internal representation. BUT, of course the environment is our physical world, with the diversity of environments you can imagine, across roads, vehicles, weather, intersections, buildings, construction sites, traffic lights, signs, etc. And then there’s no sensor that can reliably transform those real world attributes into an internal representation directly — all you have is cameras, ultrasonics, lidars, radars, microphones, etc. each giving you a clue about the physical world, by transforming it into a raw vectorial representation of it, that needs tremendous work to make sense of (that’s where DL comes in, but more on that next).
  • Even if the environment was more contained (finite diversity of roads, buildings, etc.), the dimensionality of the sensorial input is staggering. You end up requiring 10–30 unique sensors distributed around the vehicle, to give you enough sensorial input and diversity to give you a chance at solving the problem. The distribution of sensors vary across the industry — on the one extreme, the choice to have fewer of them (e.g. cameras only) and bet on solving the problem entirely via DL, on the other other extreme the choice to have many of them (cameras, ultrasonics, radar, lidar) and fuse to achieve the same result but hopefully less data as each sensor senses different aspects of the physical world. Either way, your input vector describing the current input is incredibly high dimensional, rich, and “Raw” compared to the representation you need to create to enable actuation.
  • Even if the sensor input was constrained, and you had a simpler way to get a representation from the surrounding environment, the dimensionality of the state space is crazy high — think all the actors you encounter at any one time, what their intent is (is that pedestrian going to cross the road or read their iPhone and not move?), and the distribution of possible futures. In other words, even with perfect perception, the problem of predicting all possible futures, planning through them, and ultimately actuating, is damn hard.

There’s many other aspects, but I’ll focus on just those for now. To recap them concisely:

  • Dimensionality of input space (how diverse the physical world is)
  • Dimensionality of sensor space (how large and diverse the sensors are)
  • Dimensionality of state space (how many actors and possible actions)

How do we go about solving this? That’s where Deep Learning (DL) comes in, and the SW 2.0 paradigm associated with it. There simply is no way to program a computer manually (SW 1.0) to fully represent the mapping between the raw input space as presented to us via the sensors we’ve picked, and the final actuations that need to occur. Plain undoable.

Instead what we do is we rely on DL to learn the mappings between that raw sensor data and the representation we need to enable actuation. That’s how it goes:

  • Input: data acquired by a set of sensors (cameras, radar, etc.). These are typically fairly low level, raw vectors that capture attributes of the world. In the case of cameras a 2D projection of the 3D physical world, in the visible space. In the case of lidar, a 2D projection of active laser beams projected on 3D surfaces and sensed back, giving a sense of distance to elements in the world.
  • Output: a single, unified representation of the entire surroundings of the vehicle, describing each relevant object in 3D, with temporal information (direction + velocity vector, etc.)

Learning that representation is …. A challenge :). For those used to training DL models, you’ll immediately guess what’s required:

  • A great dataset representing pairs of {input, output} above to start with
  • A great infrastructure to train models off of the current dataset, and test them
  • A great infrastructure to continuously refine the dataset based on the latest model
  • A great team and culture to iterate over 2 points above as many times as possible :)

Many other problems have similar attributes and needs across the industry. In other robotics applications of course, but also in the medical imaging space, in recommendation systems, etc.

New infrastructure paradigm for SW 2.0

What drove me for the past decade or so was the realization that as software moves from 1.0 to 2.0, our software development process and infrastructure needs to get re-thought, and re-architected to enable that revolution fully.

Taken from “What is MLOps”, a diagram from Nicolas Koumchatzky.

What does it mean concretely — for AV and for anybody else getting into infusing their SW with AI, and therefore having to embrace SW 2.0 practices? At a high level, it means building an infrastructure that enables the MLOps loop depicted above.

If you look at the diagram closely, you’ll see that the upper part focuses on building datasets, while the lower part focuses on building models from them. These are the 2 points I listed in the previous section, and I’ll expand on both now:

1. A great infrastructure to train models off of the current dataset, and test them

This is arguably the most obvious one, as most of the DL community evolved from a world where datasets were fixed, typically used as benchmarks for algorithms, and training the best model off of these datasets was the top goal. So we’ve seen a lot of progress there, from open-source frameworks to cloud services. Some of the key ingredients as I see it:

  • It should be fast to apply the latest state of the art model architectures to your dataset (transfer research to prod) => the training framework itself should be optimized for community size, ease of access to the latest and greatest model architectures. Contenders there are PyTorch, Jax. Generally this is a Python world, as this is the main language for that community. Don’t create a world where research uses a diff framework as prod… remove all barriers.
  • It should be easy to schedule many training jobs on some compute cluster. There most likely you’ll need access to a cluster of NVIDIA GPUs as the most straightforward way to accelerate your favorite framework above. Ideally it’s also easy to schedule these jobs non interactively but as part of automated training jobs, so they can be hooked into broader automation workflows (new dataset comes => new model gets produced and tested). That part not easy to build right, and if the problem is AV, the cluster will be huge. For smaller problems CSPs do the heavy lifting.
  • It should be easy to diff models across runs, share models, share recipes. Plenty of good companies have emerged there, from Hugging Face, to Weights and Biases. The landscape is still evolving, but avoid reinventing the wheel, and re-use so you can focus on your problem (building the best models for your dataset).
  • It should be easy and fast to ship your trained models to production, so you can test them. In the case of AV/robots: at the edge. In the case of web applications, in your backend in A/B tests. Etc. NVIDIA has a great platform for this called Triton. This will ultimately depend on your application software, and there’s no silver bullet there.

Honestly, that’s it. There’s other more subtle things, but if you take a step back, you can rephrase all this as:

  • Ensure that the greatest findings from research can flow to your prod in no time, no conversion, no framework change.
  • Ensure you can run enough parallel experiments, interactively and in automated pipelines
  • Ensure that you use good community tooling and services to help manage models, share them, etc.
  • Have a clean fast way to ship to prod to test in the target environment

2. A great infrastructure to continuously refine the dataset based on the latest model

Now … this is the non obvious one, and the one I’ve been most passionate about. I call this the data loop. Andrej/Tesla has called this their data engine. We’ve also called this our data factory. Whatever you call it, this is the single most important thing to get right to enable a proper SW 2.0 environment. Why? Again: data is the source code of AI.

This infrastructure needs to enable developers to build their datasets like they would their source code in traditional SW 1.0 programs. Nurture them, cherish them, debug them, continuously find issues, refine them, polish them…. Love them :). If you know how a SW developer cherishes their source code, how they care about formatting, elegance, simplicity… then you should expect a SW 2.0 dev to do the same exact thing with their dataset. If they don’t and datasets are an afterthought, then you have a big red flag — your org, program, culture, is prob not set up for success.

Again: data is the source code of AI.

At this point some companies completely internalized this, and are flying years ahead of the rest.

So let’s look at this closer. How do you get there?

Two important points to internalize:

  • Unlike source code, datasets cannot be written out of thin air. They need to be collected or synthetically generated. This is the most brutal difference with SW 1.0. This is the very first thing you need to completely internalize, and not shy away from.
  • Like source code, datasets need to be versioned, compiled (DL trained on it and produce some results), analyzed, optimized, refined, over time. When an issue gets found, the datasets need to be patched, that is, the faulty data needs to be removed, or corrected; and the missing data needs to be plugged (new relevant data collected and/or generated).

Read these 2 points again. They are loaded. These imply that you will need to build a considerable amount of new infrastructure to enable this effectively. Weird infrastructure. SW 2.0 infrastructure. Unlike training infrastructure, there’s really not much there available out of the box. Companies who get it are building great infra internally to achieve this, for their own vertical problems. We’re going to need to see much more done in open source or via startups to plug this hole.

So let’s process these 2 points, and turn them into actionable ingredients you need:

  • It should be easy to acquire/collect data from the target distribution. So say you’re building a robot that will actuate on factory floors, and you have a given sensor set of a few cameras, some ultrasonics — you should be able to rapidly query and obtain full sensor recordings from a fleet of such robots, in their target environment. Some companies have unfair advantages there as they own consumer platforms/products, that already sit in that target environment. If you’re not there yet you need to find proxies or get there as fast as possible.
  • It should be easy to synthetically generate data from the target distribution. If you’re limited in your ability to collect real data, then this is a must and most likely the way you can make progress until you solve your problem of acquiring real data at scale. This is an incredibly hard problem on its own, as generating synthetic data from the target distribution not only implies having a good simulator for that data (model of the world, model of the sensors) BUT also a good model of the target distribution… which often ends up requiring access to real data anyway. I think of this as complementary to real data.
  • It should be easy to produce ground truth for your data. The data you acquired above is the input part of the {input, output} pairs — the ground truth is the output, what you actually want to predict. The most basic way to approach this is to throw human labelers at the problem — today you have great companies/offerings like Scale.ai providing such services. But this is where you need to get way more creative. This process, as much as possible, needs to be automated, and leverage models to pre-label, as much as possible. In the case of AV, there’s so much you can do using larger models or other sources of data like HD maps to pre-label data and minimize the amount of manual labeling. Each domain has its own tricks to be found, to semi-automate this step as much as possible.
  • It should be easy to analyze datasets for errors or gaps. Rapidly you’ll find that a core skill your org needs is to find errors in your datasets (the equivalent of bugs in source code), and fix them (re-label, or remove), as well as find gaps/holes, e.g. parts of the target distribution that are under-represented. This requires a blend of methods (e.g. using multiple trained models and look for divergence of opinions, or active learning techniques), as well as infrastructure to enable this to be easy. My team published several such techniques [here, here, and there]. This space is large and is still under-explored. Possibly because it’s domain specific, company infra-specific and not easy for 3rd party/academics to explore.
  • It should be easy to mine or curate data. Finally, and in support of the above, you need a way to mine data either directly at the source/edge (in the case of AV in the fleet), or cloud side (say you’ve over-collected, and want to target within that larger set). Mining cloud side is an easier way to get started — and in a way many web companies do that as well, log everything to start with, build a large data lake, and then have the machinery to query it later (spark, presto, etc.). But ultimately the ability to query at the edge/source is key, and I believe all AI applications will end up built that way, for scalability reasons, and for privacy reasons.

There’s more than that, but these are the pillars I see, that have to get done right for anything else to work well.

What’s next

So is the problem solved? Are we as a community building everything that’s required to enable SW 2.0 to be developed effectively and by anyone across industries?

Certainly not. I believe we have found the right pillars, the right way to frame SW 2.0 development, the right vocabulary (MLOps as an analogy to SW1.0 DevOps), and some companies are way ahead in enabling this type of SW 2.0 development loop, process and infrastructure. These need to mature up, and then get opened up so more can benefit. We will get there in the next 5–10y.

As for the method itself, SW 2.0 is ripe, works, and if you deploy all the right ingredients, you can solve amazingly complex problems (AV, robotics, chat bots, etc.) today.

Now having said that, I believe the next 5y will yield new ingredients getting us closer to AGI, agents capable of leveraging big models (transformers trained on large data), as well as memory, ability to offload, retain, organize thoughts, and up the game one more notch. This will utlimately simplify the development of such applications further and SW 2.0 will eat even more of the application logic.

I’m excited about that, and I’m excited about seeing the current approach fully mature up.

It’s an amazing time to be working on AI and its applications. To another great 5y!

--

--