Shifting Winds in Robot Learning Research
Reinforcement Learning is dead, long live Reinforcement Learning!
When I tell people in tech circles that I work on machine learning for robotics, it’s not uncommon for their immediate reaction to be ‘oh, reinforcement learning …’ I used to not think twice about that characterization. After all, a large fraction of the successes we’ve seen in the past few years were about framing robot manipulation as large-scale learning, turning the problem into a reinforcement learning self-improvement loop, scaling up that flywheel massively, learning lots of lessons along the way, and voilà! In the immortal words of Ilya Sutskever: ‘Success is guaranteed.’
Except … it’s complicated. Reinforcement learning (RL) is arguably a difficult beast to tame. This leads to the interesting research dynamic whereby, if your primary goal is not to focus on the learning loop but, say, on the representation or model architecture, supervised learning is just massively easier to work with. As a result, many research threads focus on supervised learning — a.k.a. behavior cloning (BC) in robotics lingo — and leave RL as an exercise for the reader. Even in places where RL ought to shine, variations around random search and blackbox methods give ‘classic’ RL algorithms a run for their money.
And then … BC methods started to get good. Really good. So good that our best manipulation system today mostly uses BC, with a sprinkle of Q learning on top to perform high-level action selection. Today, less than 20% of our research investments is on RL, and the research runway for BC-based methods feels more robust. Are the days when robot learning research is almost synonymous with RL over?
As tempting as it sounds, I believe calling it quits today would be extremely problematic. The main promise of RL is autonomous exploration: scaling with experience, without any human babysitting. This has two major consequences: the opportunity to perform a lot of experience gathering in simulation, and the possibility of autonomous data collection in the real world.
With RL, you have a robot learning process which requires a fixed investment in simulation infrastructure, which then scales with the number of CPUs and field-deployed robots — a great regime to be in if you have access to lots of compute. But in a BC-centric world, we end up instead in the worst local optimum from a scalability standpoint: we still need to invest in simulations, if only to perform quick experimentation and model selection, but then when it comes to experience gathering we can essentially only scale with the number of humans controlling robots in a supervised environment. And then when you deploy the robots autonomously, not only are human-inspired behaviors your ceiling, but closing the loop on exploration and continuous learning becomes exceedingly difficult. Sergey Levine speaks eloquently of the long-term opportunity cost that this represents here.
But it’s tough to break through the appeal of BC: betting against large-scale models is rarely a good idea, and if those models demand supervision instead of reinforcement, then who are we to argue? The ‘giant language model’ revolution ought to give anyone pause about focusing on devising complex training loops instead of diving head-first into the problem of collecting massive amounts of data. It’s also not impossible to imagine that once we’ve come to terms with the large fixed cost of supervising robots, we can get them all the way to a ‘good enough’ level of performance to succeed — that is, after all, the self-driving car industry’s entire strategy. It’s also not impossible to imagine that, once we’ve found more scalable ways to unleash self-supervised learning in a real-world robotic setting, the cherry on Yann’s cake starts tasting a bit more sour.
I am not the only one to notice the changing winds in RL research. Many people in the field have set their sights on offline RL as the way to break through the autonomous collections ceiling. Some of the recent focus has been to make BC and RL play nice with each other, to bring scalable exploration to the supervised setting, or to make RL pretend it’s a supervised sequential decision problem in order to preserve the desirable scaling properties of large Transformers. It is a refreshing departure from the steady stream of MuJoCo studies with error bars so big they barely fit on the page (hah!). I expect a lot more tangible expressions of this healthy soul-searching process to come out in the coming months, and hopefully new insights will emerge on how to best navigate the tension between the near-term rewards of BC vs the longer-term promise of RL.
With thanks to Karol Hausman for his feedback on drafts of this post. Opinions are all mine.

