Web Data to Real-World Action: Enabling Robots to Master Unseen Tasks
To bring the vision of robot manipulators assisting with everyday activities in cluttered environments like living rooms, offices, and kitchens closer to reality, it’s essential to create robot policies that can generalize to new tasks in unfamiliar settings.
In a new paper Gen2Act: Human Video Generation in Novel Scenarios enables Generalizable Robot Manipulation, a research team from Google DeepMind, Carnegie Mellon University and Stanford University presents a novel language-conditioned robot manipulation framework called Gen2Act. This system achieves generalization to unseen tasks using publicly available web data, eliminating the need to collect specific robot data for every task.
The core idea behind Gen2Act is leveraging zero-shot video prediction from web data to predict movements in a highly generalized way. By tapping into the advances made in video generation models, the researchers design a robot policy that is conditioned on these generated videos, enabling the robot to perform tasks it has never encountered in its own dataset.