Robotic Marvels: Conquering San Francisco’s Streets Through Next Token Prediction

Synced
SyncedReview
Published in
3 min readMar 30, 2024

In recent years, there has been a remarkable surge in the effectiveness of large transformer models trained through generative modeling on extensive language datasets sourced from the Internet. These models have demonstrated impressive capabilities across diverse domains. By predicting subsequent words, these models glean intricate understandings of language, which can then be applied to various tasks through multi-task learning and efficient few-shot learning techniques.

This success has led researchers to ponder: can we replicate this approach to develop robust models for sensory and motor representation? While there have been encouraging signs of progress in learning sensorimotor representations within manipulation contexts, this realm remains predominantly unexplored.

In a new paper Humanoid Locomotion as Next Token Prediction, a research team from University of California, Berkeley presents a causal transformer model trained via autoregressive prediction of sensorimotor trajectories, culminating in the remarkable feat of enabling a full-sized humanoid to navigate the streets of San Francisco in a zero-shot manner.

The team conceptualizes humanoid control as akin to modeling vast collections of sensorimotor trajectories. Analogous to language processing, they train a general transformer model to predict upcoming sequences of inputs in an autoregressive manner. Recognizing the inherent complexity of robotic systems characterized by high dimensionality and multiple input modalities, they tokenize the input trajectories and employ a causal transformer model to predict subsequent tokens. Crucially, they predict complete sequences encompassing both sensory and motor components, thus modeling the joint data distribution rather than merely the conditional action distribution.

This design choice offers several advantages: Firstly, the trained neural network can capture more nuanced information, resulting in a richer understanding of the environment. Secondly, the framework can accommodate noisy or imperfect trajectories, which may include suboptimal actions, enhancing robustness. Thirdly, it can generalize to learning from trajectories with missing data, further expanding its applicability.

To validate their proposed model, the team deployed their policy across various locations in San Francisco. Impressively, they found that their autoregressive policies trained solely on offline data perform comparably to state-of-the-art approaches employing reinforcement learning. Additionally, their model exhibits an ability to effectively utilize incomplete trajectories and demonstrates favorable scalability characteristics.

These findings underscore a promising avenue for addressing complex real-world robot control tasks through the generative modeling of extensive sensorimotor trajectory datasets. By leveraging the principles of autoregressive prediction within a causal transformer framework, this research opens new horizons for advancing robotic capabilities in navigating and interacting with dynamic environments.

The demo is available on project’s website. The paper Humanoid Locomotion as Next Token Prediction is on arXiv.

--

--

Synced
SyncedReview

AI Technology & Industry Review — syncedreview.com | Newsletter: http://bit.ly/2IYL6Y2 | Share My Research http://bit.ly/2TrUPMI | Twitter: @Synced_Global