Recent Advancements in Language-Capable Robotics

Published in

MLPurdue

8 min readJan 30, 2023

Introduction

A mobile manipulator from Everyday Robots

Language-capable robots are the love child of exciting recent developments in Large Language Models (LLMs), Natural Language Processing (NLP), and robotics. This interdisciplinary field enables robots to interact with the world around them while understanding natural language prompts, opening exciting doors for robot-human interaction that was previously not possible.

Introduction to LLMs

LLMs are deep learning algorithms that understand human-like language with a high degree of accuracy. They are trained on massive datasets of text data from the internet, where they learn the complexities of language and the ways in which words and phrases are used in context by observing human-like text [5]. These models don’t just simply replicate or recall text verbatim, they are able to encode semantic and common-sense knowledge about the world.

This ability to understand and generate human-like text has made them useful in a wide range of natural language processing tasks such as language translation, summarization, question answering, and text generation. Consumer-facing large language models like ChatGPT and GitHub Co-pilot have generated a lot of hype for their impressive conversational and programming abilities respectively.

Applying LLMs to different fields

The potential for automating tasks and creating new opportunities in different fields is an active area of research, and companies like Everyday Robots and Google Robotics have been at the forefront of this movement, pushing the boundaries of what’s possible with language-capable robots. The robotics research labs under Alphabet have been making impressive contributions to the field for the last year or so, and their progress has been further enhanced by recent developments in natural language processing and understanding.

Unlocking communication with robots through natural language allows for a more natural and intuitive medium for communication between humans and robots. It eliminates the need for specialized interfaces and allows for flexible and dynamic interactions. This can make robots more accessible and user-friendly in the long run, bringing us one step closer to a world where robots are seamlessly integrated into our everyday lives.

Below are a few papers I found especially interesting. Each utilizes LLM’s understanding of semantics and the world around us for complex reasoning tasks.

Do As I Can, Not As I Say: Grounding Language in Robotic Affordances

SayCan

We benchmarked the proposed algorithm Saycan in two scenes, an office kitchen and a mock office kitchen with 101 tasks…

say-can.github.io

As mentioned previously, LLMs understand a wealth of semantic/common knowledge about our world. They are able to understand how different tasks are commonly done. The goal of this research is to utilize an LLM’s understanding of the world to act on high-level instructions given through natural language and to use the LLM’s reasoning skills to control a robot. But how do we prompt and break an LLM’s responses into instructions a robot can understand?

If you ask a normal LLM how to put an apple on a table, you might get a wide range of different answers…

Grab the apple with your hand and place it on the table.
Pick it up and place it on the table.
First, go to the grocery store and buy an apple. Then drive back home and…

These answers, though correct, are not grounded in reality. They lack the context of what the asker (in this case, a robot) is able to do, and provide a generalized answer to a question. The resulting answers are not feasible for a robot to execute in the current environment.

The main idea of “Do As I Can, Not As I Say: Grounding Language in Robotic Affordances” (SayCan) is to limit the vocabulary of the LLM to tasks a robot can perform rather than doing the default response the LLM says.

The robots are already able to perform 101 diverse sub-tasks in a kitchen environment, and each is labeled with a natural language description. These are tasks like… Find an apple, Find a coke, Go to the table, Go to the counter.

The authors use these natural language descriptions and prompt the LLM to create a list of sub-tasks that would satisfy a long-term task. This is done with some clever prompt engineering.

First, you ask your question… “How would you put an apple on the table?” And you structure the response of the LLM to fit a list. In the paper, the authors fit the responses to “I would: 1. (Subtask 1) 2. (Subtask 2)…”

This would generate responses similar to…

“How would you put an apple on the table?”

“I would: 1. Go to the counter 2. Find an apple 3. Grab an apple 4. Go to table 5. Place apple on the table”

Once the list is generated, you can run the sub-tasks from the list in sequence to control your robot!

Now we’re getting somewhere!

However, this approach still lacks context about the environment around the robot. If the robot is already at the counter, or already has an apple in its hand, this response and the commands generated would be nonsensical.

The authors introduce a Task Affordance Function to remedy this issue. It’s a separate function that scores the feasibility of completing each sub-task at that moment. If the robot already has an apple in its hand, the function would give a low score for the sub-task “Grab an apple” or any other grabbing task, for instance. If the robot has a clear path to move toward a counter, the sub-task “Go to the counter” would be scored highly.

Combining this additional Task Affordance Function with the potential LLM outputs grounds the available sub-tasks to the real world. This system creates a powerful robot-reasoning loop that you see below.

A video of their robots performing a 16-step task using this reasoning loop can be found here.

To recap, given a high-level instruction, SayCan combines the scores from an LLM with the scores from a Task Affordance Function to select a sub-task to perform. This selects a task that is both reasonable for the robot to do and useful for completing the high-level instruction. This process is repeated every time the robot completes a sub-task until the high-level instruction is done.

This system achieved a planning success rate of 84% and an execution rate of 74% in real-world testing.

SayCan revolutionizes the way robots understand and execute high-level instructions given natural language commands. By limiting the LLM’s vocabulary to tasks the robot can perform and using a Task Affordance Function to score the feasibility of each sub-task, SayCan grounds the available sub-tasks to the real world, creating a powerful robot-reasoning loop. This system achieved a high success rate in real-world testing and has sparked exciting research in language-capable robotics.

Other papers

Inner Monologue: Embodied Reasoning through Planning with Language Models

Inner Monologue: Embodied Reasoning through Planning with Language Models.

In order to study how different sources of environment feedback can support a rich inner monologue that enables complex…

innermonologue.github.io

Inner Monologue is a natural extension of SayCan. SayCan can only receive feedback about its surroundings through its Task Affordance Function while choosing a subtask, but if a skill fails or the environment dramatically changes, this feedback may not be available. This paper explores how adding success detection, object recognition, scene description, and human interaction improves high-level instruction completion.

Code as Policies: Language Model Programs for Embodied Control

Code as Policies

Large language models (LLMs) trained on code-completion have been shown to be capable of synthesizing simple Python…

code-as-policies.github.io

Code as Policies explores using LLMs to generate code given a natural language instruction. The authors demonstrate how code-writing LLMs can be repurposed to write complex robot policies.

Code as Policies code generation example

Interactive Language: Talking to Robots in Real Time

Interactive Language

push the yellow hexagon to the top right corner move the yellow heart to the yellow hexagon push the red circle to the…

interactive-language.github.io

Interactive Language explores building an interactive, real-time, and natural language instructable robot in the real world. They open-source their code and a massive dataset of 600,000 language-labeled robot trajectories for the community to use. The authors also propose a novel infrastructure for training multi-modal models for Reinforcement Learning policies, (Image, Natural Language Instruction) => Actions .

Interactive Language interactive guidance loop

Conclusion

The field of language-capable robotics is rapidly advancing due to recent developments in LLMs and NLP. These advancements allow robots to interact with humans and understand the world better, opening up exciting new opportunities for human-robot interaction and robot reasoning.

The paper Do As I Can, Not As I Say introduces the concept of grounding language in robotic affordances to limit the responses of the LLM to feasible actions for the robot and demonstrated the potential of LLMs for complex robot reasoning tasks. Inner Monologue expands this work to explore how adding additional feedback improves long-term task success rates. Code as Policies explores utilizing code-writing LLMs to generate robot policies from natural language prompts. And Interactive Language explores building an interactive, real-time, and natural language instructable robot in the real world.

Unlocking communication with robots through natural language allows for a more natural and intuitive medium for communication between humans and robots, making robots more accessible and user-friendly. As the field expands, we can expect to see even more exciting developments in the future!

Sources

[1] Lynch, Corey, et al. “Interactive language: Talking to robots in real time.” arXiv preprint arXiv:2210.06407 (2022).

[2] Ahn, Michael, et al. “Do as i can, not as i say: Grounding language in robotic affordances.” arXiv preprint arXiv:2204.01691 (2022).

[3] Huang, Wenlong, et al. “Inner monologue: Embodied reasoning through planning with language models.” arXiv preprint arXiv:2207.05608 (2022).

[4] Liang, Jacky, et al. “Code as policies: Language model programs for embodied control.” arXiv preprint arXiv:2209.07753 (2022).

[5] https://blogs.nvidia.com/blog/2023/01/26/what-are-large-language-models-used-for