Generative AI from the digital to the physical realm with Multi-Modal Large Language Models (MLLMs)

Realizing Agile Human Robot Teaming with MLLMs

Published in

Labs Notebook

11 min readFeb 13, 2024

Large language Model (LLM)-based generative AI (GenAI) systems have excited almost everyone through their ability to perform a broad range of complex language tasks, including summarization, reasoning, designing, coding, etc. The possibilities GenAI has unlocked are primarily attributed to its ability to easily communicate with users via natural language prompts. Moreover, the vast amount of text and the size of resulting model, achieved through scalable self-supervised learning, makes it apt at handling new situations and scenarios.

However, so far, LLMs have been used for tasks that are digital in nature. For example, we can instruct an LLM for digital content generation, media generation, analyzing documents / images, or write source code. We are now beginning to see the interaction, with LLMs, going to next level where we use them to perform physical activities. A physical activity implies an action that makes a physical change in the environment. For example, instructing a robot, through LLM, to perform activities such as moving an object, manipulating it, arranging a set of artifacts in a particular fashion, or sending a command to start or stop a conveyer belt. For the sake of clarity, we take example of robot that performs the physical activity, throughout this article.

Imagine being able to instruct an assistive robot to prepare a sandwich. Executing this task will require the robot to be able to plan and execute multiple steps involved in the sandwich preparation, that includes: 1) fetching and analyzing the recipe from the internet, 2) getting all the ingredients, such as vegetables and spreads, from the refrigerator, 3) cutting vegetables and placing them on the bread, and 4) grilling the sandwich. Typically, assistive robots are pre-programmed to support users on a finite set of daily living tasks. They are not agile enough to handle new and unfamiliar tasks for which they were not programmed. For instance, it will be hard to use the robot to prepare a samosa (a popular Indian snack) if it has not been programmed for it.

With multi-modal LLMs (MLLMs), we are now witnessing efforts to bring such agility in robots. For example, Google’s work Palm-E [1] allows users to instruct a robot for some simple pick and place tasks and support for navigation and manipulation together. Google DeepMind has introduced Robotic Transformer 2 (RT-2) [2], a vision-language-action (VLA) model that leverages both web and robotics data to generate generalized instructions for robotic control, learning to produce a vector of motion commands for the robot.

Researchers at Stanford University have created VIMA [3], an embodied agent with the ability to handle multimodal prompts by integrating text and image embeddings. This system predicts motor commands and utilizes them to control a robotic arm, allowing it to successfully execute tasks. UC Berkeley researchers introduced LM-Nav [5], an annotation-free system for robotic navigation using textual instructions. The system combines three pre-trained models: a language model (LLM) parses instructions into landmarks, a vision-language model (VLM) estimates observation probabilities in a mental map, and a robotic control model (VNM) assesses navigational affordances and determines robot actions based on distances between landmarks.

MLLMs provide a multi-modal interface to users to convey the instruction via multiple modalities. The most common forms through which users interact with MLLMs, at present, is text and images. By adding modality such as video input to the MLLM, we can achieve huge leap; especially when it is about explaining a physical task. For example, a MLLM can allow users to demonstrate, a robot, how to perform a certain physical task in samosa preparation, e.g. peeling potatoes.

Difference between MLLM for Physical task and Digital task

To understand the difference between MLLMs for digital tasks and physical tasks, here we show the two interaction models: 1) the conventional interaction model for digital tasks (Fig 1(a)) and a new interaction model to interact with MLLM for physical tasks (Fig 1(b)).

In the conventional model, users communicate to MLLM through natural language instructions and images which form a set of prompts to the MLLM. These prompts are structured using verity of strategies to get the desired outcome. For example, one can instruct the MLLM to assume a role, instruct it to think step by step (chain of thoughts), or give few examples (few shot prompt) to explain how the task needs to be done. Based on that, MLLM generates its output.

In the new interaction model, users interact to the MLLM through multi-modal prompts that includes video, audio, and text. For example, a prompt can include a video explaining how vegetables need to be arranged. It may also have video demonstration to show how vegetables need to be cut / peeled. These can be treated as few shot examples in the audio-visual prompt. With these prompts, the MLLM generates the instructions interpretable and executable by the machines. For example, the generated instruction can be to start the conveyer belt, or make the robot perform few physical steps. One of the key distinctions, between the two interaction models, lies in how we make use of MLLM output. Another distinction lies in the way input is provided to the MLLM. In the new model, input not only incudes the audio-visual prompt but also the information about the operating environment / world which is continuously being sensed. The operating environment provides necessary context to the MLLM to reason, plan, and generate execution plan for physical activities.

Leveraging MLLM for physical tasks

To realize interaction between human and robot via MLLM, we need innovation at three levels:

Ways in which users can interact with MLLMs; i.e. a visual language vocabulary and taxonomy, and design of visual dialogues system,
Curation, scraping , or preparation of data on which MLLMs are trained on,
Design of output types from MLLMs to support multi-turn interaction between human and robot.

In addition, MLLM needs to be aware of capabilities of physical machines that it needs to actuate. For example, robots come with varying capabilities such as varying degree of freedom, types of grasping, motion constraints, etc. Therefore, MLLM needs to be aware of a robot’s capabilities and other constraints while generating the plan for the robot. Prompting strategy needs to include the description of the constraints while communicating with the MLLM. In some cases, MLLM can even orchestrate and distribute tasks to human and robots to form an effective team of humans and robots.

In summary, MLLM, in its new avatar, needs to act as orchestrator, knowledge source, and interpreter for an effective Human-Robot Teaming.

Agile Human-Robot Teaming with MLLM

The extended capability of MLLM gives birth to new forms of Human Robot teaming that we name as Agile Human Robot teaming. In essence, it is an agile team structure where humans and robots team together and complement each other in the most dynamic way. We can imagine multiple forms in which humans could team up with robots and MLLM acts as a core engine to achieve this:

MLLM as an interpreter: Humans can communicate with robots through the MLLM via audio visual prompts and MLLM can act as an interpreter to interpret and translate the prompt to robotics instructions.
MLLM as knowledge source: Humans can leverage the MLLM’s vast knowledge to solve new / unfamiliar tasks.
MLLM as an orchestrator: MLLM can act as an orchestrator to orchestrate human working alongside with robot on a common task based on their strengths and limitations.

Example:
The role of a MLLM in agile human robot teaming can be better understood with an example where assistive robot learns a new task and interacts with a human to perform it (Figure 2). In this example, human requests an assistive robot to prepare a new recipe. As the robot identifies that the recipe is new, it fetches the instructions through the MLLM (step 2). Next, the MLLM analyzes the cooking instructions. While analyzing the instructions, the MLLM finds few steps for which the robot is not programmed, such as peeling potatoes. For such situations, the MLLM asks back to the user to explain more about the step and how to perform it (step 5). At this stage, human demonstrates the step to be performed (step 6 and 7). Human demonstration, which is in the form of video, is further analyzed, and appropriate motion and manipulation plan is generated by the MLLM (step 8 and 9). For the steps, the robot is already pre-programmed (e.g. fetching potatoes, putting it to boil, etc.), motion, and manipulation plan is generated by the MLLM for the robot without any further follow-up interaction (step 3 and 4).

What would be required to achieve Agile-Human Robot teaming?

Example given above shows a promising possibility of MLLM being used for agile human-robot teaming. However, we are still at the early stages to realize the full potential of LLM/MLLM to achieve truly agile human robot teaming.
To fully leverage the power of LLM/MLLMs, we need to do the following:

Train models with multi-modal content: we need to train the models on videos of human demonstrations. In other words, we need to extend the LLMs so that they can support audio visual prompts.
Include environmental description in the training: we need to encode the environmental description in the input while training models so that trained MLLMs are capable to reason about a task plan in the context of a given environment.
Develop conversational models for audio-visual communication: A conversational model needs to be evolved to enable multi-turn interaction between human and robot via audio-visual communication mode. The envisioned interaction will be drastically different than the traditional multi-turn interaction that, so far, has been developed for human and chatbot interaction over textual interfaces. This would require rethinking how would human converse with robot to communicate for physical tasks and create a comprehensive list of potential collaboration / interaction modes. An appropriate audio-visual language vocabulary needs to be defined and used for the training. For example, how does a robot reframe a question back when it seeks clarity on a task, how robot should communicate to human to ascertain that the task it is going to perform is the right task, how would it ensure that it can execute the task safely. These questions need to be taken in account while designing such system.

Accenture Agile Human-Robot teaming effort

At Accenture Labs, we are re-imagining models of interaction between human and robots via MLLM with advanced multi-modal capabilities. That is by taking it beyond text and images to videos.

One of our early works, namely “Robotic Assembly Planning from Video demonstration” [2] analyzes user actions in a video demonstration of furniture assembly and generates an assembly plan graph. The generated assembly plan graph is further used to generate robotic instructions that is used by an arm robot (UR5e) to assemble other furniture. A snapshot is shown below. The ability our solution gives to the robot is limitless as robots can now learn new tasks through audio-visual prompt (i.e. video demonstration).

Fig 3: Manipulation graph generation from video demonstration

As an extension of this work, we further looked at leveraging state of the art LLM to train robots via video demonstration. In our experiments, we extended the capability of Code T5+ LLM from Salesforce [4], so that it can interpret video inputs and generate robotic instructions directly from the video demonstration. To achieve this, Code T5+ was trained on a curated dataset of video snippet of human demonstrating a physical task (such as assembling a furniture) and the corresponding robotics instructions that can perform equivalent physical task. The extended Code T5+ model, namely Act2Code [7], was able to generate instructions from audio-visual prompts accurately.

Conclusions

Although in the above work, we have only looked at one aspect of human-robot interaction, i.e. human playing a role of a trainer for the robots, from a broader perspective, we can imagine multiple types of interaction between human and robot in an agile human robot team. For instance, a robot can ask follow-up questions to human to understand the nuances of each step, it may seek an approval from human to ensure the task it is performing is safe, it may want to explain to human how it will perform the step so that human can guide it further.

We need to reimagine all types of interaction scenarios between human and robot to redefine the role of MLLM for agile human-robot teaming. The good news is that efforts are underway to support this vision. Multi-robot datasets such as RT-X aims at training an LLM, which is a Vision Language Action (VLA) model to support the different robot models for different kinds of tasks. These are early times, and we believe soon, we’ll have MLLMs that support all the aspects of robots such as motion, manipulation, navigation, and grasping. And soon human will be able to interact and collaborate with robot via a richer multi-modal interface. We believe, as MLLM moves to more modalities and audio-visual communication, it will be possible to take the MLLM beyond digital.

Contacts: Dr. Alpana Dubey , Abhinav Upadhyay , Dr. Alex Kass , Dr. Shubhashis Sengupta

References:

Driess, D., Xia, F., Sajjadi, M. S., Lynch, C., Chowdhery, A., Ichter, B., … & Florence, P. (2023). Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378.
Brohan, A., Brown, N., Carbajal, J., Chebotar, Y., Chen, X., Choromanski, K., … & Zitkovich, B. (2023). Rt-2: Vision-language-action models transfer web knowledge to robotic control. arXiv preprint arXiv:2307.15818.
Jiang, Y., Gupta, A., Zhang, Z., Wang, G., Dou, Y., Chen, Y., … & Fan, L. (2023). VIMA: Robot Manipulation with Multimodal Prompts.
Wang, Y., Wang, W., Joty, S., & Hoi, S. C. (2021, November). CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing (pp. 8696–8708).
Shah, D., Osiński, B., & Levine, S. (2023, March). Lm-nav: Robotic navigation with large pre-trained models of language, vision, and action. In Conference on Robot Learning (pp. 492–504). PMLR.
Open X-Embodiment: Robotic Learning Datasets and RT-X Models. Available at https://robotics-transformer-x.github.io/. Last accessed Jan 30, 2024.
Gautam Reddy, Abhinav Upadhyay, Alpana Dubey, Shubhashis Sengupta, and Piyush Goenka. 2024. Act2Code: Generating Sequential Robotic Instructions from Video Demonstration. In Proceedings of the 7th Joint International Conference on Data Science & Management of Data (11th ACM IKDD CODS and 29th COMAD) (CODS-COMAD ‘24). Association for Computing Machinery, New York, NY, USA, 578–579. https://doi.org/10.1145/3632410.3632482
Abhinav Upadhyay, Priyanshu Barua, Alpana Dubey, Shubhashis Sengupta, Piyush Goenka and Suma Mani Kuriakose. Robotic Assembly Planning from Video Demonstration, IEEE International Robotics Computing conference (IRC 2023)