Language Models as Zero-Shot Planners: Extracting Actionable Knowledge for Embodied Agents

3 min readAug 22, 2022

Large language models(LLMs) are trained using a colossal amount of data. Previous work shows that LLMs can internalize rich world knowledge. The advantage of autoregressive LLMs is that they can even do in-context learning. That is, it can solve tasks with only contextual information without traditional gradient updates. It would be a great thing if we can utilize such pre-trained models to generate the knowledge we need.

This paper wonders whether we can directly use the knowledge in them to make decisions. In this paper, they show that without additional training, LLMs can generate goal-driven action plans though such plans often are not executable in interactive environments.

screenshot from paper referenced, steps to generate action given a task. The function of pre-trained causal LLM is to decompose high-level tasks into mid-level action plans, while that of pre-trained masked LLM is to translate these action plans into admissible actions. Prompt is used to give LLM the contextual information. Noting that all models are frozen without additional training.

It uses VirtualHome as the evaluation environment, which models complex human activities in a virtual household setting. In VirtualHome, actions are expressed with programs, and a program is a sequence of action steps.

screenshot from paper referenced, format of an action step. action refers to one of 42 atomic actions provided in VirtualHome environment. arg represents the object for interaction. idx specifies unique id of arg. id can help to specify corresponding node in environment graph.

screenshot from paper referenced, an example program for task “Relax on sofa”.

The pseudo-code shown in following picture illustrates how to generate such action plans.

screenshot from paper referenced, algorithm for generating action plans from Pre-Trained Language Models

They use an example from demonstration set as input to query Planning LM. An example contains high-level task name and its annotated action plan. Planning LM is expected to output the action plan. To improve generation quality, they sample multiple outputs for each query and choose the sample with highest mean log probability.

screenshot from paper referenced, X means a sample, which consists of n tokens.

screenshot from paper referenced, choosing the sample with highest mean log probability.

Free-form texts often contains ambiguous action or object, so they are often not executable. Instead of developing rules to transform free-form text into admissible action steps, the algorithm calculates semantic distance between free-from text and predicted action phrase by cosine similarity, and use semantic distance to translate the action.

screenshot from paper referenced, cosine similarity. a^ means predicted action phrase. ae means admissible environment action.

After the program is synthesized, translating it step by step lacks consideration of achievability of individual steps and might lead to compounding error. To avoid this problem, they instead query Planning LM at each step to generate k samples for a single action and find the one that maximizes semantic soundness and achievability.

screenshot from paper referenced, fining the action that maximizes semantic soundness and achievability.

Furthermore, Translation TM can detect actions that are outside the abilities of robot and terminate the program early, this can be achieved by setting a threshold ε.

The main two metrics are executability and correctness:

Executability:

It measures whether an action plan can be correctly parsed and satisfies the common-sense constraints of the environment.

Correctness:

Because of ambiguity and multimodal nature of natural language task specification, it is impractical to obtain a gold-standard measurement of correctness. Thus, they use human evaluations as main methods, and use match-based metric to measure how similar a generated program is to human annotations. In detail, they calculate the longest common subsequence (LCS) between two programs, normalized by the maximum length of the two.

Reference:

Language Models as Zero-Shot Planners: Extracting Actionable Knowledge for Embodied Agents

Language Models as Zero-Shot Planners: Extracting Actionable Knowledge for Embodied Agents

Reference:

Written by Jasperora