The End-to-End False Dichotomy

Roboticists arguing Lego vs. Playmo

Vincent Vanhoucke
Predict
6 min readOct 28, 2018

--

There is a strange controversy that has been the undercurrent of much discussion in robotics circles over the past few years: the end-to-end debate.

In one camp, the modular perspective: the robotic perception-control loop should be comprised of a cascade of independent blocks, from low-level perception to planning and finally control. Much of traditional robotic systems are architected this way, and that particular approach has vocal advocates. For many, the advent of deep learning has done little to change that picture, convolutional networks being essentially incorporated as the low-level perceptual feature extraction early in this pipeline.

In the other camp, the end-to-end perspective: the entire perception-control loop should be treated as a single system and optimized jointly. Much of the recent work on deep reinforcement learning falls into this category, and I myself have, on occasion, been labeled as a poster child for that approach.

The most interesting part of that controversy is how natural a dichotomy this way of slicing the problem space appears to be to many people in the field. I often get asked where I land on this ‘end-to-end business’ — as if the distinction was so self-evident that anyone should obviously land on one side or the other of this debate. I am amusingly reminded of the old LEGO (modular) vs. Playmobil (end-to-end) debates of my youth — though no room for controversy there: I am forever a LEGO kind-of-guy.

I’ve always had a major issue with the end-to-end debate itself, because it is fundamentally ill-posed. Notice that the modular perspective focuses on the structure of the system: it makes a statement about system architecture, and how things should be organized. The end-to-end perspective, on the other hand, makes a statement about optimization, and how the system should be trained. Looking at the debate through this lens, these two supposedly opposing views actually talk about very different aspects of the problem, while treating them as if they were antithetical.

Credit: rawpixel.com

In my view, nothing is further from the truth: you can, in fact, have a modular architecture optimized end-to-end. The difficulty is that both perspectives come with a lot of baggage, which color the argument. In a modular system, you typically have a lot of hand-engineered modules, which makes them non-differentiable. As a result, they prevent any end-to-end optimization from happening, by not enabling gradients to propagate between modules. On the flip side, end-to-end approaches often start from a simple monolithic neural network, only reluctantly injecting some degree of modularity into the architecture when it helps provide a more task-appropriate inductive bias.

I’d argue that both modularity and end-to-end differentiability are both going to be essential to succeed in robotics. Most complex machine learning systems that go beyond simple perception end up looking like modular differentiable pipelines today. In NLP in particular, you often see a feature extraction component (e.g. Word2Vec), a sentence-level encoder, sometimes attached to an attention mechanism or to differentiable memory units, and a distinct decoder which projects the embedded representation back into a parseable form, for instance a translation, a semantic interpretation, or an agent’s action. These systems are fundamentally modular, and the modules themselves can be grounded into semantics at every level: you can look at the attention mask, interpret word embeddings geometrically, and inspect the memory. But they don’t prevent gradients from being propagated across modules, which is the key to the success of this general class of ‘differentiable programs.’ In robotics, you are starting to see differentiable planners being incorporated into end-to-end pipelines, differentiable particle filters, or differentiable mappers. And if your favorite piece of software remains stubbornly non-differentiable, there may be hope still.

Why insist on end-to-end differentiability? After all, if you could build a very good general-purpose perception system, you wouldn’t have to optimize it jointly with the rest of your robot software stack, so why not start there? At the risk of stretching the analogy way too far, it’s a bit like wishing to play LEGO with brand-name pieces, when life only throws ill-fitting knockoffs at you: it’s the subtle variability in tolerances that kill you. Here’s a story I like to tell from my days as speech recognition researcher because it illustrates what happens when separate modules of a pipeline are not perfectly calibrated with respect to each other. We had very classic speech recognition system at the time, with an acoustic model, a pronunciation model, and a language model. Our colleagues from Switzerland came with a complaint: they couldn’t get Voice Search to recognize the word “Zürich”. And indeed, none of us could. We started digging: the pronunciation was there, there was a model for the syllable “zür”, we had lots of training data for it, all looked well, but the probabilities when someone said “Zür-ich” were completely off. As it turns out, there were exactly two words in our English dictionary with the phonetic syllable “zür”: “Zürich”, and “Missouri” (“Mizüri”). Yes, “Missouri” mostly sounds like “Zürich” — except if you actually are from Missouri!

What happened is that all the southern speakers saying Missouri ‘the southern way’ had completely corrupted the model for Zürich, at the expense of our Swiss colleagues. The lesson? You may think that the various modules of your system, who each have clear input and output semantics, are independent: here, the acoustic and pronunciation models. In practice however, these boundaries are always violated by real data: how you design your pronunciation model is affected by how you use it in conjunction with your acoustics. More generally: presuming that every other module in your stack of cards will do the right thing irrespective of how you define your own module’s semantics is hopeless. They have to be co-designed and co-optimized.

This is particularly important when trying to propagate confidence measures through your pipeline: the naive independence assumption would state that all you have to do is multiply probabilities together to get the final confidence, but in practice errors are strongly correlated across semantic layers, and you can’t calibrate one without understanding the sensitivity of all the others in your chain, which is the essence of the math behind back-propagation, and why it works so well.

For a great position paper on the pitfalls of stacking machine learned systems on top of each other without regards for calibration between them, I highly recommend reading ‘Machine Learning: The High Interest Credit Card of Technical Debt’ by some of my colleagues. The best way to surface these interactions is to co-train your entire system end-to-end, such that any module can compensate for the limitations of another, and calibrate its own expectations about how well every other part of the pipeline behaves in any particular scenario.

Modularity has a lot going for it beyond the mere aesthetic appeal of building systems that are functionally decomposable. One important one is that it enables different sources of data and supervision to be plugged into the system at various levels. There is a nice paper currently under review at ICLR which exemplifies this benefit. It takes a modular, yet learned approach to the self-driving problem, and enables some parts of the training and evaluation data to be synthesized, for example to ensure coverage of unseen corner cases, while others are trained on real data that matches the ultimate use case. Another aspect also illustrated in that paper is that when one can ground intermediate representations into something that makes sense physically (in this particular instance: top-down views of the environment and vehicle trajectories), these representations make inspecting the system’s behaviors much easier. An interesting challenge for the community is to enable the combination of these explicit representations with end-to-end optimization, similar to what’s done here for instance.

I am certain that this debate hasn’t run its course yet, though I think (and hope) we will look back at it in a few years and merely shrug. It mirrors very similar debates I’ve seen over the years both in speech recognition and computer vision about the right way to incorporate prior information, when all-of-a-sudden the vastly better and yet ever-so-unsatisfactory answer had become to simply throw a neural net at the problem, to many people’s disappointment. Today, these questions have largely shifted to questions about better representation learning within the general framework of differential programming.

--

--

Vincent Vanhoucke
Predict

I am a Distinguished Scientist at Google, working on Machine Learning and Robotics.