Unlocking the Future of Robotic Intelligence

Michael Vogelsong
15 min readAug 13, 2024

--

The fields of robotics and machine learning are becoming more and more intertwined. It’s amazing how so many different forms of learning are coming together to address challenges we see in robotics. In the past, I’ve followed my curiosity in a few different machine learning fields like vector search, reinforcement learning for robotics, computer vision, and ML for cancer treatment. Today, at Cobot, that same curiosity fuels our mission to tackle one of the most challenging problems in robotics: foundation models for robotics. We’re particularly focused on generalized manipulation, an area that will transform the ability of robots to collaborate with humans in the world around us. We believe that the key to unlocking real-world applications lies in developing models that can generalize across a wider range of tasks.

In this blog, I’ll discuss a few ways that different forms of machine learning are being applied to robotic manipulation, including practical aspects and areas for future improvement. Whether you’re an ML researcher or just curious about the future of robotics, I hope this exploration provides insight into the developments happening at the intersection of AI and robotics.

Robotics Foundation Models

Most of us are familiar with large language models like ChatGPT or Claude by now. Its development was driven by a few factors: (a) abundant publicly-available data, (b) post-training to align the models with human intent, and (c) the high degree of overlap between the training goal (predict next word) and the interface in which the models are used. The fourth aspect is a little less appreciated, but it’s important, too: human language was shaped by us to be pretty relevant for communicating information between people. These different factors come together to give us something powerful, and predictably improvable. We also get “emergence” of abilities as we scale up the size of the models and data–abilities like in-context learning or chain-of-thought reasoning.

So where does AI fit in with robotic manipulation? What would a robotics foundation model look like? At this point, robotics foundation models are more of an aspirational idea than a concrete existing ability. In an ideal world, we’d have a huge amount of high-quality robotics data from which to train a big model that just predicts next tokens. In this world, tokens would come from different domains: language (like a natural language description of a task, or audio feedback), vision (e.g., images and videos of robots doing useful stuff, lidar point clouds, etc.), actions (recordings of the motor joint positions or end effector states), and sensor readings (e.g., force feedback). We’d have a huge collection of this robotics-relevant data, we’d encode all of this data into a big sequence of tokens, and then train a model to predict next tokens in this dataset. Based on what we’ve seen in language scaling laws, we’d expect this approach to get better and better with more data.

But there are a few bottlenecks that keep robotics qualitatively many steps behind language models. First, we just don’t have anywhere near as much publicly available action and sensor data. Second, we may have higher accuracy expectations on robotics model performance versus language (or image) generation. Third, robots live in the real world with messiness and dynamics and safety considerations.

So how do we tackle those issues? On the foundation models front, the most relevant work are “Vision-Language-Action” (VLA) models like Octo, LLARVA, OpenVLA, and the RT-X family of models. These models are variations of the same idea: use a relatively simple next-action(s) modeling approach, and train on the biggest robotics dataset they can find. Most of these models use the Open-X Embodiment dataset. This dataset has 2.4M episodes of robots being tele-operated to accomplish tasks, mostly collected on single arm robots. By training on (filtered versions of) this dataset — which have different numbers of cameras, different hardware embodiments, etc. — the authors hope to learn an underlying foundation model capable of accomplishing different robotics manipulation tasks.

The general format of the inputs for training data is a mixture of short natural language text descriptions of a goal, series of image frames from one or more cameras (usually RGB, but sometimes with depth or LIDAR), and current and/or recent states of the robotic joints and sensor readings. Sometimes the goals can also be encoded through a demonstration image of the desired goal state. The model is then trained to predict some number of future movements of the robotic arm.

With this trained model, we can see how the model performs on different robotic hardware and on different tasks. There is generally a lot of variation in the performance–some tasks are much simpler than others. The “ideal” would be high zero-shot performance–given a new situation and a new task, the model would “just work” right out of the box. That level of generalization is hard, and while there are some intriguing hints of this ability, we’re not really there at this point at a large scale. So the more practical approach is to use the base model as a starting point for learning new task(s). Compared to learning a completely new model from scratch for each task or environment, we finetune the pretrained model on a new task, with the goal that learning this new task, and performing well on variations of it, is easier than it otherwise would have been to learn from scratch. OpenVLA presents some initial evidence of this fine-tuning ability.

Even though the Open-X Embodiment dataset is large, it still pales in comparison to the trillions of tokens of text and images on the internet. So an intermediate goal is to re-use language and vision-language models (that have been pre-trained on larger datasets) for the things they are good at, and then adapt / incorporate smaller amounts robotics information on top of that. This is done with the hope that (a) there’s a good amount of semantic overlap between the language / vision domain and robotics and (b) robotic action learning can be simpler than world knowledge. These foundation models use pre-trained language and vision encoders for this reason.

One thing to keep in mind with robotics models is that the size and efficiency of these models are important. For many reasons, we are highly incentivized to have fast, small models for robotics so the models can be run on smaller robotic compute hardware and make robotic movement predictions at a low enough latency to handle the tasks. Octo, OpenVLA, and RT-X are on different parts of this continuum–Octo is the smallest, at 20M-90M parameters. OpenVLA is much larger at 7.6B parameters. And then the largest RT-2-X model has 55B parameters. Octo can run on smaller GPUs like an NVIDIA Jetson. OpenVLA needs some optimizations to run on a much bigger NVIDIA 4090 GPU, and it still runs at only a few inferences per second. RT-2-X is closed source, but requires large hardware to achieve decent latency.

So, robotics foundation models are not the winner in all dimensions at this point. So what other AI techniques are people working on to push robotic abilities forward? (Note, I’m focusing here on some of the more ML-based approaches–there are many classical / non-learning approaches too!)

Imitation Learning

This is the most common approach for ML-based robotic manipulation. The robotics foundation models above are doing a form of imitation learning (you may also hear the term “behavior cloning”). The idea is simple–take a bunch of demonstrations of a task being done well, and then learn to copy them by outputting the same movement commands when given similar inputs to those seen during the demonstrations. The concept is straightforward: observe demonstrations of a task, and then train the robot to replicate those actions when faced with similar situations. This method is effective and predictable, especially if the training data accurately represents the task and environment. However, its main drawback is that it struggles with real-world unpredictability — robots may encounter scenarios not covered in the training data, leading to errors that quickly compound. While researchers are working on improving these models to handle more varied situations, they generally perform well only within the scope of the training data and struggle to generalize beyond it.

In addition to the VLAs, some other relevant modern imitation learning methods are DAgger, ACT (Action Chunking Transformer), Diffusion Policy, and PerAct.

Model-Free Reinforcement Learning

Let’s say we don’t have a plethora of expert training demonstrations; can we still learn how to do tasks? Yes, but it usually requires a lot of interaction with the environment. In reinforcement learning, instead of trying to copy what experts have demonstrated, we use some measure of reward as the learning signal. As our robot interacts with the environment, it may receive rewards. When it receives rewards, we have to figure out which actions led (directly or indirectly) to receiving those rewards, so we can do more of that in the future. In theory, this is a more powerful, general form of learning. But it has many drawbacks too–the rewards could be very infrequent (“sparse”), and so we may not have any rewards from which to learn (and if we were already getting a lot of rewards with our model, then we’ve already solved the problem!). The reward function usually has to be hand-engineered for each task, and that can take a lot of effort and tuning to get right. And our models have to explore the environment, which can be really inefficient–especially if we’re trying to learn in the real world that may have many safety considerations. Therefore, model-free RL is more commonly applied in simulated environments.

The most common model-free RL approaches are policy gradient methods built on top of algorithms like PPO (Proximal Policy Optimization) and value iteration methods like Q-learning.

Model-Based Reinforcement Learning

In model-free RL, we don’t have access to a model of the environment, so we have to learn everything from the data. With model-based RL, we are given or learn a model of some part of the environment. Once we have a good model of the environment, then learning inside that environment becomes easier. The environment model lets us anticipate what would happen in different hypothetical situations. Once we have that, we only have to learn decisions to act in that environment to maximize reward, and we can use our model to “look ahead” and plan. This seems strictly better than model-free RL, so why not always use this? Well, you still need to obtain the model of the environment in the first place–this is easier in simulation (where you control the environment and have access to the underlying state), but much harder in the real world. And there can be times when it’s simpler to just learn the robotic policy, rather than trying to model all of the environment. This approach is more similar to classical model predictive control (MPC).

Alpha Go Zero is a particularly well-known example of this approach, and Dyna is an earlier approach by which many subsequent developments were inspired.

Inverse Reinforcement Learning

In RL, we assume the reward function is given to us–usually we designed it specifically for a task. But what if the task(s) are more ambiguous, like autonomous driving? With inverse reinforcement learning, we are given demonstrations from an expert, and then we try to infer the reward function that the experts are working towards. Then, we can use that learned reward function as guidance for regular reinforcement learning. Again, this method could in theory be more general and more robust to variations in expert demonstrations, but it introduces an additional step in the learning process, which can make learning even harder and less stable.

For some foundational work in IRL, check out Maximum Entropy IRL. Some recent improvements can be seen in EvIL, for example. And IRL has been connected to other ideas as well, like framing IRL from a generative adversarial perspective with GAIL.

Simulation, Generated Data, and Sim-to-Real

Robots are expensive, difficult to build / set up, and require effort to manage. Simulated data lets us train and/or evaluate models on virtual environments. This potentially can be a lot cheaper, more scalable (can run simulations faster than real time, and with many copies), and safer. So many groups use simulated training data and evaluation environments to move more quickly. In an ideal world, we’d have perfect, fast, easy-to-create simulators of realistic environments that we’d want to put our robots in. However, it’s not always easy to build realistic simulators, especially for fine-grained manipulation–it usually requires physics and simulation coding expertise, and can be a slow process.

An idea that is gaining more traction recently is to use generative models (images, videos, procedurally generated environments) instead of more traditionally-coded simulators. For example, if we could give a model the current observations of the scene and some planned future actions, and then ask the video generation model to show imagined future renderings of what might happen next, then that could be powerful for models that leverage planning to choose actions. And we could alleviate some of the data bottleneck. Will these generative models become good enough for this sort of use case? They are not performant yet in terms of quality, physics coherence, and speed, but given the abundant video data available, it’s not hard to believe that these approaches will get better soon. Alternatively, we could use generative models to augment our existing training data–for example, asking the image editing model to create variations of scenes that can help make our models more robust to potential variations we’d see in the real world.

Demonstrations

We’ve mentioned demonstrations many times, but how do we create demonstration data? There are many approaches, and the important dimensions to think about are ease of collection and transferability to the target embodiment. The simplest form of demonstration would be kinesthetic teaching–this is where a person manually moves the robot arms through a series of movements to accomplish the task, and we record data as the robot is being led. This can be pretty slow and unnatural, but it is in the same embodiment as we are targeting. A similar approach is a leader-follower tele-operation. Instead of moving the target arm directly, we move a “leader” arm through a series of motions, and then a follower arm reproduces the demonstrated actions–this is what is done with leader-follower ALOHA arms. Another form of demonstration would be a hand-coded heuristic policy–a piece of code that a human (or LLM…) has written to accomplish a task.

We can also use AR/VR headsets with controllers to tele-operate the robot. These demonstrations involve a mapping from the controller movement space to the robot’s end effector space. Or, we can create similar end-effectors as are on the robot arms, and track the movements. A particularly interesting implementation is UMI (Universal Manipulation Interface) from Shuran Song’s lab at Stanford, where they 3D-print handheld grippers to mimic the end effector on a robot.

Finally, we can also think about passive video demonstrations. These could be videos of robots doing tasks (same robot or different robots). Or they could be videos of humans doing tasks, and then we try to learn a mapping between the human’s movements and the corresponding target robot movements. This approach has the most data already available, but it involves more complicated processing to transfer the learning to a target robot (and the data quality can be pretty bad). HumanPlus is a recent example with very impressive results when transferred to a humanoid robot.

Curriculum Learning

In machine learning, we use data examples and measures of success as feedback to adjust our model in the direction of improvement. So, the order in which we present data to our training algorithms matters. Curriculum learning focuses on setting up the training data ordering in a way that most efficiently leads to model improvement. For example, if you give a completely untrained model the hardest, most complicated tasks right off the bat, the learning algorithm may struggle dealing with all of the complexity. But if you instead start with simplified versions of your problem, and then gradually increase the complexity and difficulty, you can steer the model to first learn the easier fundamentals, and then build on top of that more effectively. This can make learning faster, or it can help the model reach even greater performance. We can imagine curriculums guided by heuristics, or we can think about ways to automatically design the curriculum.

Along with curriculums, we can also look at how different parts of the learning problem are set up, and add levels of automation. Usually, humans design rewards, analyze model results, tweak algorithms and data, etc. The ML process involves human researchers in the loop. But as LLMs, VLMs, etc. get better, can we use those models to provide supervision signals or adjust aspects of the model, environment, or data pipeline on the fly? If we can automate parts of the ML labeling (e.g., see the auto-validation with VLMs in Manipulate Anything), then we can have compounding effects.

Vision-Language Models

Language models help us understand internet-level text-based knowledge, but vision-language models attempt to incorporate vision understanding into the same space. One large class of models is based on embedding language and vision understanding into the same space–this includes models like CLIP and its variants (SigLIP). These models embed short language descriptions and image encodings in the same space, so the description “a photo of a dog” and an actual picture of a dog show up as similar representations. Another class of models are open-vocabulary object detection or segmentation models–like OWL-ViT or Grounded-SAM.

And then there are more LLM-esque vision-language models that try to take a “tokenize-the-world” approach. In these models, we treat images and text as a sequence of tokens. We break apart images into smaller patches, and learn representations of those patches, and feed those into the model as part of the sequence. An approach like LLAVA uses an open source LLM for the base model of processing and producing outputs, and we map the images into the “language token space.” And we train on mixtures of data–like interleaved images and text in web articles, alt-text captions of images on the internet, etc. These VLMs tend to have variable performance based on the task. The models can do decently well on high-level image descriptions and simpler visual questions, but they can struggle with understanding more details about the images or spatial reasoning. GPT4-v, Gemini, and Claude provide APIs, and there are open-source model options like LLAVA, Chameleon, CogVLM, and Cambrian.

What’s different between robotic manipulation and other applications of ML?

One of the biggest differences is the difficulty and reproducibility of evaluations. In other types of ML–like image classification, for example–we have “test sets.” These are datasets of inputs and known correct outputs, and they can be easily run by anyone (and can be run quickly). Manipulation is different–to really reproduce an evaluation, another group needs to (a) pay for and set up the same hardware, (b) understand in a very detailed way the steps the original author did to set up the environment, and (c) run the rollouts in the real world. This difficulty of evaluation slows down the field’s progress. The amount of time, effort, and cost needed to conclusively say method A is better than method B is significantly higher.

The second reason robotics manipulation can be much harder is because of the sequential nature of the problem, where the model takes actions that change the environment. In image classification, we are doing one step of prediction for each example, and we’re not changing anything based on our predictions (“IID” — independent and identically distributed). With robotic manipulation, we are making many predictions, one after another. This sequence of actions means that errors can compound. And, the robot can take actions that change the environment and nature of the problem. This adds two big sources of variation that make the problem harder to solve.

Effectively Benchmarking New Research

The ML field is growing, and new papers and demos are coming out every day. But these new methods can be of vastly different quality. We have to come up with ways of quickly understanding what a method or demo is really showing, and what questions to ask ourselves when projecting how it might continue to progress or apply in our domain. When a new paper, demo, or announcement comes out around robotic manipulation, it’s worth thinking through the ways the approach is an advancement, and where it may have difficulties. We want to celebrate progress, but also be realistic in how it can be applied. Here are a few questions to ask:

  • What dimensions are being controlled?
  • What input is being fed to the model?
  • How much variety is accomplished by the same model, versus different models for different tasks?
  • How much calibration and tweaking was done on the target embodiment that would have to be rerun on a new robot?
  • How were the tasks chosen, and where could sampling biases creep in?

Conclusion

The convergence of AI and robotics holds immense potential, particularly in the field of robotic manipulation. Foundation models, with their ability to generalize across a wide range of tasks, are paving the way for more versatile and capable robots. While challenges remain — such as data scarcity, the complexity of real-world environments, and the need for efficient model deployment — the progress we’ve seen so far is promising.

At Cobot, we are committed to pushing the boundaries of what’s possible in robotic intelligence. By integrating insights from different fields and leveraging foundation models, we are developing robots that can better understand and assist humans across a variety of industries. It’s an exciting time to work in this space!

Michael Vogelsong is Head of Foundation Models AI for Cobot, a Sequoia-backed robotics company leading the deployment of robotic intelligence at scale. Prior to joining Cobot, Michael was Chief ML Engineer for Groundlight AI. Prior to that, he led ML projects at Amazon in the areas of vector search, robotic AI, and has also done work in ML for cancer treatment. Michael holds a degree in biomedical engineering from Duke University.

--

--

Michael Vogelsong
Michael Vogelsong

Written by Michael Vogelsong

Head of Foundation Models AI at Cobot

Responses (6)