How Robots Can Learn End-to-End From Data

Sergey Levine
15 min readApr 29, 2021

--

The year is 2030, and you’ve purchased a brand-new general-purpose home robot. The robot comes equipped with a few skills that it knows how to perform, but everyone’s demands will be unique. Today, you would like for it to clean your bathroom. You turn on the robot, and leave for work. When you come home in the evening, the bathroom is not as clean as you might like. While you were away, the robot tried a variety of different approaches. First, that oddly-shaped sink nestled awkwardly in the corner of your apartment’s tiny bathroom gave it some trouble, and then it spilled a bucket of soapy water on the floor and spent the next two hours cleaning it off with a mop. With some robotic determination, it got things into a state that was just a little bit better than when it started. The next day, it will try again, and it will get further, and the day after, it will master this task, cleaning your entire bathroom in under an hour with well-honed robotic precision. What kind of algorithms would make this possible, and what are the technologies that are being developed today that will lead to such capabilities in the future?

Flexible and adaptable robotic systems will have a far-reaching impact, beyond just cleaning your bathroom. Robots that can perform a wide range of tasks can supplement human labor in dangerous industrial environments, such as mining and construction, provide assistance with everyday tasks to the elderly and persons with disabilities, and assist with disaster relief. Such technology might change our society and economy in complex ways, some quite unpredictable. It will also teach us a thing or two about our own intelligence: an adaptable and flexible agent that can rapidly master various tasks in the real world would need to capture at least a tiny grain of the mental agility possessed by humans and animals. The focus of this article will be on recent technical approaches that could move us in this direction.

Data and Reinforcement

The biggest challenge with “open-world” robotic systems of the sort described above lies in the variability and complexity of the real world: we’ve known how to build robots that move with remarkable precision for decades, the hard part is when the robot unexpectedly spills soapy water on the floor and has to clean it up. People possess common sense, making us resourceful and flexible. Robots do not. One of the most powerful tools we have today for intelligently handling the complexity of the real world is machine learning: learning algorithms equipped with large datasets and high-capacity deep network models can classify never-before-seen images, translate text, and recognize human speech. But these are all passive recognition tasks, while our robots — and, indeed, any rational agents that act intelligently in their environment, — need active behavioral skills. Learning behavioral skills is the purview of reinforcement learning, algorithms that acquire behaviors through trial and error. Such algorithms do not need to be told how to do something — they figure it out on their own from reinforcement (i.e., reward and punishment, much like how you might train a dog). They can master a wide range of tasks in simulated environments, such as video games, but can struggle to generalize to the kinds of open-world settings that our robots will encounter. The problem with current methods is that we need both the broad generalization that comes from training on large datasets, as in the case of the recognition models used to tag photos, translate text, and recognize speech, and the ability to acquire behaviors through trial-and-error reinforcement.

Over the past few years, a significant development in reinforcement learning has been the advent of effective offline reinforcement learning (offline RL) methods. While the basic principle behind such methods has been known for decades, their effectiveness has improved considerably over the past three years, making them a viable tool for real-world robotic learning. In its basic form, offline RL learns skills from data. Provide the algorithm with logs of transactions, costs, and profits for an inventory management task, and it will optimize for an inventory management policy that will maximize revenue. Provide the algorithm with logs of robotic interaction, and it will optimize for a better controller to solve the task more effectively. Of course, such algorithms can only learn skills based on what is present in the data: if the robot has no prior data of interacting with your particular bathroom, even the best offline RL method will, at best, only be able to make educated guesses So what do we need to make the kind of on-the-job learning that I described above possible? I believe we need at least the following ingredients:

  1. The ability to learn from diverse prior data: Reinforcement learning algorithms that can use broad and diverse datasets collected before — in other places, for other tasks, and possibly even by other robots or even by people, — to provide an initial starting point, generalizable capability, and a kind of robotic common sense.
  2. The ability to rapidly master new tasks and environments: Robots must be able to efficiently explore new environments and tasks, so as to master a specific behavior in a specific context, while leveraging the prior experience — much like how the home robot masters the task of cleaning your bathroom after the first two days.
  3. The ability to do all this autonomously in real-world settings, where unexpected events, such as spilling water on the floor, necessitate a change of plans: instead of trying the same task repeatedly, a robot that can learn in real-world settings would flexibly deploy a variety of skills depending on the demands of the current situation, sometimes in service to the need to try again (mopping up the spill is good not only because it makes the floor clean, but because the robot can then make another attempt at the original task it was trying to do).

Next, I will discuss a few case studies from our research at UC Berkeley and Google that touch on some of these points, and provide some early indications for the kinds of machine learning methods that might enable these three capabilities in the future.

A Robotic Delivery Driver

Two recent papers authored by students & post-docs in our group at UC Berkeley (Dhruv Shah, Greg Kahn, and Nick Rhinehart from Berkeley, as well as Ben Eysenbach from CMU) examine some of these questions in the domain of robotic navigation, in the context of two robotic navigation systems: ViNG and RECON. While this setting does not require cleaning any spills in a bathroom, it provides a way to isolate a subset of the challenges that a real-world robotic system might encounter: open-world navigation requires generalizing to never-before-seen locales, handling unexpected events (such as new obstructions), and quickly learning how to navigate in new environments while leveraging prior knowledge about navigational affordances. A demo of the ViNG system delivering mail, based only on on-board camera streams and photographs of target front doors, is shown below.

ViNG delivering mail, using photographs of customer front doors as goals.

Our approach to challenge (1) in these works is to utilize prior data to learn general navigational skills that transfer across environments, while addressing (2) by dynamically constructing a short-term memory in each new environment. This can be instantiated in a simple algorithm through a combination of techniques inspired by offline RL and nonparametric models.

The particular model we employ solves a very general task: predicting the distance between the current camera observation and a goal observation, in terms of the number of time steps needed to transit between them, as well as the action the robot should take at the current time to eventually reach the goal. In RL parlance, this corresponds to an actor (the prediction of the action to take to reach the goal) and a critic (the prediction of how long the actor would take to get there). A schematic of this model is shown below.

The ViNG architecture, at a glance.

The advantage of picking such a general task is that it can be trained on all data collected for any navigational task. Indeed, we were able to train this model using data collected for an entirely different project in spring 2020, and still use the model to navigate to goals during the fall and in the winter of 2021. To me, this illustrates the promise of leveraging broad and diverse datasets for RL (i.e., ingredient (1)): if we design algorithms that can use large previously collected datasets, they can learn to generalize to a wide variety of environments, without requiring either hand-designed simulators or costly and time-consuming data collection for every experiment. In the case of this navigational system, we essentially maintained a single dataset containing all driving data collected from this robot, and reused this same dataset in every experiment. Our final model was trained on a total of about 40 hours of data, including off-road navigation, driving through office parks, parking lots, and other scenes. Some examples of the environments in the training set, compared to unseen environments we use for testing, are shown below:

We can train from diverse offline data (top) and test in new environments (bottom)

Of course, even a person would not know what to do if placed in an entirely new environment and then told to reach some goal. After all, in this new environment they don’t know where things are! This is essentially challenge (2): how to leverage prior knowledge to quickly adapt a skill to a new setting. Much like the household robot in the introduction, which figures out how to clean your particular bathroom, the RECON system uses the distance/action model described above to quickly explore a new environment, construct a sort of “mental map,” and then use this mental map to quickly reach user-specified goals, which are provided as before in the form of goal images. This “mental map” is represented as a graph, where nodes correspond to landmarks that the robot has seen, and edge costs correspond to the distances between these landmarks, as predicted by the learned model. In effect, the model learned from large amounts of prior data allows the robot to understand the navigational relationships between landmarks in the new environment. An illustration of such a graph is shown below (the satellite image is not actually available to the robot, but only provided for visualization).

Illustration of the topological graph constructed by ViNG from image observations. The satellite view is only shown for visualization, and is not available to the robot — the graph is not geometric, but rather encodes connectivity between first-person observations.

To actually explore a new environment rapidly, RECON makes one modification to the distance/action model: the goal is bottlenecked through a stochastic latent variable (using a variant of the variational information bottleneck), which allows it to be sampled from a prior distribution while exploring.

The RECON architecture, with latent goal variable.

This effectively means that, when placed in a new environment, if RECON does not know where the goal is, it “imagines” a random goal that it can drive towards to explore, until believes it can reach the target goal image This allows RECON to “search” for the goal in an unknown environment, all the while building up its mental map. This process is illustrated in the animation below, where “Run 1” corresponds to searching for the goal in a never-before-seen environment, and “Run 2” corresponds to using the resulting graph (mental map) to rapidly navigate to the goal.

On “Run 1,” RECON explores a new environment, building up its “mental map.” On “Run 2” it uses this mental map t quickly navigate to a user-specified goal.

An illustration of this exploration process from an overhead view is shown below:

(Left) The goal specified by the user. (Right) The path RECON takes when exploring for the first time (cyan), and the path it takes when then revisiting the goal using the mental map (red).

RECON uses diverse prior data (challenge (1)), and rapidly builds up proficiency in new environments by leveraging models built on that prior data (challenge (2)), and it does so in open-world outdoor environments. But, of course, it does not address complex manipulation tasks — for that, the robot need’s dexterity and an in-depth understanding of physical interactions, which must be acquired through trial-and-error learning in the real world, which I will discuss below.

Learning to Try Again

A particularly vivid and challenging instance of this problem is dexterous manipulation with multi-fingered hands. Unlike the navigational tasks I discuss above, such settings require intricate and precise physical motions to impart just the right forces on an object using the fingers, in order to pick things up, reposition them, and accomplish the task. This in turn means that the robot must be able to practice a task many times in order to master a particular skill. Each attempt at this skill can result in failure, which in turn requires the robot to reset the environment to try again. This prevents a major challenge for real-world robotic systems: just like the bathroom cleaning robot needed to mop up its spill before it could resume cleaning the bathroom, a robot performing an in-hand manipulation task, such as the one shown below, must recover when it accidentally drops an object, pick it back up, and get ready to try again.

MTRF performing a learned lifting and in-hand reorientation task using a combined hand plus arm system.

In real-world settings, generalist robots such as the home robot in the introduction will need to perform many different tasks. This presents both a challenge and an opportunity: while multi-task learning is more difficult, when many different tasks are learned simultaneously, they can effectively reset each other. When the bathroom cleaning robot makes a spill, this is an opportunity for it to learn how to use a mop. In the same way, when the robotic hand above drops the object while attempting an in-hand reorientation, it can use this as an opportunity to try to pick up the object off of the table. A set of tasks can thus form a network, where different tasks serve to meet the preconditions of other tasks.

In our initial prototype, which we call MTRF (Multi-Task Reset-Free Learning), developed by Abhishek Gupta, Justin Yu, Zihao Zhao, Vikash Kumar, Aaron Rovinsky, Kelvin Xu, and Thomas Devlin, we construct this network and define reward functions for each task manually. For example, the network for the in-hand reorientation task is shown below:

Task graph for the in-hand reorientation task.

Of course, in the future, we could also imagine algorithms that automatically discover repertoires of skills in an unsupervised manner, leveraging recent advances in unsupervised reinforcement learning, and then automatically construct such a task network.

The major advantage of such a task network is that now, the entire training process for all of the tasks can be fully automated. This is very important: while many prior works study learning of individual robotic skills in the real world, they often require considerable manual effort or instrumentation to actually make real-world training possible, from manually resetting objects in the environment to devising dedicated hardware contraptions, as shown below (left to right: a motorized reel inside a door to close it after attempts at opening, a person closing doors after robots attempt to open, and a dedicated robotic arm that puts objects back into a hand when it drops them, using a scripted controller).

Manually designed reset mechanisms in prior work. Left to right: a door with a motor inside, a person closing doors after robots attempt to open them, a second robot replacing objects into a hand to allow it to retry.

By simultaneously learning multiple tasks, the robot can train completely without any human intervention, and without any special mechanisms to reset the environment between attempts. Indeed, we can record the entire training process, which takes several days:

Excerpt from the training process: MTRF learns the in-hand reorientation task without any human intervention, using about 64 hours of autonomous interaction.

The same approach can learn other tasks, such as this connector plugging task:

Excerpt from MTRF learning the connector plugging task, full video on project website.

While this may be a far cry from the bathroom cleaning robot in the introduction, it exhibits some of the traits we might want: it posses a (small) repertoire of tasks, automatically chooses which task to practice based on the current circumstance, and automatically recovers from failures by attempting the tasks that would be most suitable. Perhaps some day it will be the (distant) ancestor of a fully autonomous household robot that can learn tasks on its own in your home, cleaning up its own messes, trying again, and mastering whatever chore is set before it, though for now such technology is still in its infancy.

The Benefits of Multi-Task Learning at Scale

A complete multi-task robotic system should also leverage diverse offline data, master a broad repertoire of behaviors, and be able to apply those behaviors to a wide variety of different objects. While the projects discussed above address some of these facets, a more complete evaluation of a multi-task robotic manipulation system that builds toward challenges (1) and (2) can be found in some recent work from Google Research: MT-Opt and Actionable Models. In this work, we specifically studied how challenge (1) could be tackled with multi-task offline reinforcement learning from diverse data, and challenge (2) could be tackled by finetuning a policy for a new task on top of an initialization trained on this diverse data.

Collecting diverse data for multi-task robotic learning.

The dataset for these projects was collected by six separate robots, and consists of hundreds of thousands of individual trials tackling about twelve distinct training tasks, corresponding to various rearrangement behaviors (picking up various kinds of objects, placing them on plates or bowls, etc.). Critically, the aim was not just to solve these twelve tasks, but rather to train up a general-purpose multi-task initialization that could then be finetuned to solve a much wider variety of downstream tasks from relatively modest task-specific datasets. The animation below shows some of the tasks performed by the MT-Opt system after training on the twelve training tasks.

MT-Opt performing a wide range of different tasks at test-time.

Actionable models significantly extends on this recipe by dispensing with hand-specified tasks during training entirely. Instead, it uses a task definition analogous to the RECON navigation system described before, where tasks are defined by goal images. This flexible scheme for task specification can be used to directly specify desired outcomes to the robot, as shown below:

Actionable models being used to reach various task goals (lower-right corner) from raw camera images at test-time.

But perhaps more importantly, it can be used as a pretraining or joint training objective for downstream reinforcement learning problems, which could be specified with conventional reward functions. Analogously to unsupervised pre-training in computer vision or NLP, these goal-conditioned policies can utilize all available manipulation data, regardless of the task, to pre-train a general purpose policy with offline RL. In experiments with fine-tuning this policy to specific tasks, we found that it could increase success rates on small-dataset semantic grasping tasks from 0–4% to around 20–27%. While this is still a long ways away from the bathroom cleaning example in the beginning of this article, this result suggests that general-purpose pretraining of this sort could someday provide a basic substrate of generalization and common sense on which downstream skills could be learned efficiently through real-world interaction.

Where to From Here?

While this article has focused on robotic learning systems, much of the progress in addressing challenges (1), (2), and (3) will also come from advances in core machine learning algorithms. For example, recent years have seen tremendous advances in algorithms for offline reinforcement learning — RL methods that can utilize static datasets, an essential ingredient for broad generalization. I discuss these advances, such as AWAC and CQL, in more detail in an article from 2020. Tackling challenge (2) connects closely to algorithms for meta-learning, as well as methods that can learn to explore from prior data. While the current algorithms take steps toward addressing these challenges, there is still a lot of room for improvement, from the basic stability and reliability of reinforcement learning methods to the way we handle diverse and heterogeneous robotics datasets. I expect that over the next few years, we will see robotic learning transition from a paradigm that is driven primarily by narrow-domain experiments (e.g., learning a single skill in a single setting) to one that is dominated, like most other areas of machine learning, by large standardized datasets that are broadly shared and reused. This will help us focus on the challenges associated with ingredients (1) and (2) more directly.

Challenge (3) has received comparatively less attention, perhaps because so far it has been less pressing (with a few notable exceptions), but I expect it too will come to the forefront as roboticists focus on increasingly larger-scale learning experiments, where manual instrumentation and human oversight are impractical.

Meaningful progress toward addressing these challenges will bring us closer to truly in-the-wild robotic learning systems that can both acquire and perform varied tasks without human involvement, and perhaps one day deliver not only a home robot that can clear your bathroom, but also fundamental progress in how we build artificial intelligence systems.

A talk that covers the material in this article can be found here. This article covers the following papers: ViNG, RECON, MTRF, MT-Opt, Actionable Models. Big thanks to Nick Rhinehart and Dhruv Shah for helpful comments on an earlier draft of the article.

--

--

Sergey Levine

Sergey Levine is a professor at UC Berkeley. His research is concerned with machine learning, decision making, and control, with applications to robotics.