LAMs : Large Language Models with Neuro-Symbolic AI

As LLMs approach their functional limits, LAMs could represent the next practical step in enhancing technology interaction, potentially offering broader capabilities in understanding and executing tasks.

8 min readJan 19, 2024

Introduction

In a recent paper published on December 3, 2023 by the Rabbit — a new AI company — research team, titled “Learning Human Actions on Computer Applications”, a new paradigm in the field of AI and human-computer interaction is introduced : the Large Action Model (LAM).

This article aims to demystify the complexities and scientific innovations behind LAM, offering insights into its potential to revolutionize the way we interact with technology. We will be looking at what makes LAMs so strong : the neuro-symbolic approach. Finally, we will examine the role of Computer Vision in LAMs and how it can be used to improve human-machine interaction in the human-oriented environment.

Keywords: Large Language Models, LLM, Large Action Models, LAM, neuro-symbolic, symbolic, logic, Rabbit, Rabbit R1, artificial intelligence, explainability, AI agents, GPT-4, ChatGPT, NLP, NLU, natural language processing, natural language understanding, human-oriented interface

What is a Large Action Model ?

LAM represents a significant shift from traditional Large Language Models (LLMs) like GPT-4 or BERT. While LLMs focus on understanding and generating human language, LAMs extend this concept to comprehend and execute human actions on computer interfaces. Essentially, a LAM is designed to interpret human intentions and translate them into actionable steps that computer systems can understand and execute in real time. [1]

Neuro-Symbolic AI Explained

At its core, LAMs leverage the latest advancements in neuro-symbolic programming. Unlike traditional AI models that often depend on either purely symbolic or purely neural network approaches, LAMs combine these two paradigms.

The neuro-symbolic approach in AI combines symbolic AI, which excels in reasoning and rule-based processing, with neural networks, known for their pattern recognition and predictive capabilities. [2][3]

NLU processes unstructured textual input to generate outputs that can be logically inferred, while NLG converts structured data into comprehensible natural language responses. **Source** : Hamilton, K.*, Nayak, A., Božić, B., Longo, L. (2022). Is neuro-symbolic AI meeting its promises in natural language processing ? A structured review.

The goal is to create actions on applications that are regular, minimalistic, stable, and explainable. This philosophy aligns with Occam’s razor principle, emphasizing simplicity and effectiveness. Mathematically, we can translate it as follows :

Symbolic Logic Function (S): This function processes input x using a set of predefined rules or logic. Represented as S : X → Y, where X is the input space and Y is the symbolic representation space.
Neural Network Function (N): This function processes the symbolic representation y generated by S. Represented as N : Y → Z, where Z is the output space corresponding to the final action or decision.

The neuro-symbolic function (NS) is the composition of S and N, formulated as : NS(x) = N(S(x))

x ∈ X is the initial input (e.g., a natural language instruction)
y = S(x) ∈ Y is the symbolic representation (e.g., structured commands)
NS(x) ∈ Z is the final output (e.g., executed action).

**Neural architectures**. Convolutional Neural Networks (CNN) apply convolutions and pooling to inputs for classification. Graph Neural Networks (GNN) use node connections for data aggregation. Sequence-to-Sequence (Seq2Seq) models convert sequences via encoders and decoders, while Transformers rely on attention mechanisms. Neuroevolution (NE) evolves network structures through genetic algorithm-inspired operations. **Source** : Hamilton, K.*, Nayak, A., Božić, B., Longo, L. (2022). Is neuro-symbolic AI meeting its promises in natural language processing ? A structured review.

While symbolic techniques are highly explainable but limited in handling complex systems, neural network methods are scalable but lack explainability. LAMs aim to harness the strengths of both, offering a scalable, understandable, and reliable solution in AI research.

**Neuro-symbolic architectures.** Logic Tensor Networks (LTN) merge logic with neural learning. Recursive Neural Knowledge Networks (RNKN) structure data hierarchically for knowledge extraction. Tensor Product Representation (TPR) encodes relationships, exemplified by “John loves Mary.” Logic Neural Networks (LNN) simulate logical reasoning within a neural framework, such as classifying “Cat & Dog” as “Pet.” **Source** : Hamilton, K.*, Nayak, A., Božić, B., Longo, L. (2022). Is neuro-symbolic AI meeting its promises in natural language processing ? A structured review.

How Large Action Models Work ?

LAMs are designed to operate human-oriented interfaces across all mobile and desktop environments. They observe human interaction with the interface and form a “conceptual blueprint” of the service behind it, carrying out the underlying intentions. This allows LAMs to act as a virtual helper, understanding how to interact with applications to achieve certain objectives in a humanizing way. [1][2]

Rabbit R1

For example, Rabbit has developed a device called “R1” that uses LAM as its operating system. The device can learn how users interact with devices and can theoretically carry out tasks such as booking flights or editing documents after being taught how to do so. The LAM system is enabled by recent advances in neuro-symbolic programming, allowing for the direct modeling of the structure of various applications and user actions performed on them without a transitory representation, such as text. [1]

The neuro-symbolic approach to LAMs involves overcoming both research challenges and engineering complexities, from real-time communication to virtual network computing technologies. The goal is to shape the next generation of natural language-driven consumer experiences.

The MiniWoB++ library offers a suite of more than **100 environments** for web interactions, complete with interfaces in both JavaScript and Python for automated interactions. Its Python interface adheres to the Gymnasium API standards and utilizes Selenium WebDriver for executing actions within a web browser.

LAMs learn through observation, mimicking human interaction with interfaces to execute tasks directly and transparently, allowing for easy inspection and understanding of their processes. Over time, LAMs develop a comprehensive grasp of an application’s interface, serving as a conduit between users and services.

Three screenshots of a music streaming application interface. Each screenshot shows a different task being performed as part of a user interaction sequence to manage playlists and play songs. The tasks include adding a song to a favorite list, and playing a specific track. These scenarios are likely used to train LAMs for automating interactions within the app based on natural language instructions or user behavior patterns. **Source** : https://rabbit.tech/research

Computer Vision for Improved Human Interaction

As we’ve seen, the aim of LAMs is to perform invisible, fully automated scrapping. Most sites and applications (web apps) don’t have API access, so it’s difficult for LLMs to navigate to take actions. LAM’s strength : labeling actions and understanding user patterns. But how can Computer Vision be used by LAMs and what differentiates a LAM from a Computer Vision-enabled LLM, commonly known as a Visual Language Model (VLM) ?

**Source** : A Dive into Vision-Language Models

Symbolic Reasoning with Computer Vision

It’s crucial for enabling these models to not only recognize visual elements but also to understand their context and how they relate to one another. This allows LAMs to develop a common-sense understanding of the environment, leading to more intuitive and effective interactions.

Figure 1 illustrates the process of human reasoning in visual contexts. It shows how abstract knowledge derived from visual perception can be applied to logical reasoning, facilitating precise and comprehensive understanding in complex visual scenes. Source : Neural-Symbolic Visual Question Answering (NS-VQA)

One relevant example of integrating symbolic reasoning with Computer Vision in AI is the Neural-Symbolic Visual Question Answering (NS-VQA) system. This system combines deep representation learning for visual recognition with symbolic program execution for reasoning. It begins by extracting a structural scene representation from an image and a program trace from a question. Then, it executes the program on the scene representation to obtain an answer.

Figure 2 depicts a three-part model for visual question answering. The model includes a scene parser that segments an image, a question parser that generates a program from natural language, and a program executor that applies the program to the scene’s structural representation to deduce an answer. **Source** : Neural-Symbolic Visual Question Answering (NS-VQA)

This approach demonstrates robustness in complex reasoning tasks and efficiency in data and memory usage, while also offering full transparency in the reasoning process. This example highlights the potential of combining symbolic structures with visual data for enhanced AI capabilities. For more details, you can refer to the research published by MIT-IBM Watson AI Lab.

Computer Vision in LAMs’ Human-Oriented Environments

Computer Vision can be used in Large Action Models (LAMs) to enable them to perceive and understand the visual aspects of human-oriented environments. By incorporating Computer Vision, LAMs can interpret and respond to visual information, thereby enhancing their ability to interact with users and carry out tasks in such environments.

Here are some ways in which Computer Vision can be utilized in LAMs’ human-oriented environments :

Enhancing Human-Machine Interaction : Vision-based gesture recognition and action recognition are essential for enhancing human-machine interaction, enabling LAMs to understand and respond to users’ natural behaviors and actions [4][5].

A robotic arm performing tasks based on textual commands, like moving a red cylinder to specified locations on a board, highlighting the translation of language instructions into physical actions. **Source** : https://blog.research.google/2023/02/google-research-2022-beyond-robotics.html

Visual Information : Computer vision allows LAMs to perceive and interpret visual information, such as user interfaces, human actions, and environmental cues, which is essential for understanding and responding to user requests in human-oriented environments [6].

AI policy code interacting with perception and control APIs to execute a user command to stack blocks on a bowl, demonstrated in a robotic setup. **Source** : https://blog.research.google/2023/02/google-research-2022-beyond-robotics.html

Digital Twins and Human-Centric Graphs : Computer vision can be leveraged to create human-centric digital twins, capturing the interactive nature of humans and their surrounding environments, which can be valuable for LAMs operating in such contexts [7].

A crowdsourced data collection framework that monitors a user’s comfort, heart rate, noise, and sun exposure, incorporating visual perception for semantic segmentation and object detection. **Source** : Towards Human-centric Digital Twins: Leveraging Computer Vision and Graph Models to Predict Outdoor Comfort

Implementing Computer Vision in a human-oriented environment presents challenges related to variability in human behavior, adaptation to interface changes, real-time processing, interpreting unstructured visual data, safety and reliability, and integration with other AI components. Addressing these challenges is essential for the successful deployment of computer vision systems in LAMs and similar human-centric AI applications.

LAM, a new paradigm or just another buzzword ?

Visual Language Models (VLMs) like GPT-4-Vision have distinct capabilities compared to Large Action Models (LAMs). While LAMs are designed to observe and replicate human interactions with interfaces, VLMs are focused on understanding and generating responses based on visual and textual inputs. For instance, GPT-4-Vision is a large multimodal model that accepts image and text inputs and emits text outputs.

This approach sets LAMs apart from traditional black-box models, allowing technically trained individuals to understand and reason about their inner workings in order to achieve explainability.

The combination of VLMs with autonomous AI agents may have similarities to LAMs in terms of their ability to understand and execute tasks, but the specific approach of LAMs, rooted in learning from human interactions, sets them apart.

Conclusion

In summary, the neuro-symbolic approach to LAMs involves integrating neural and symbolic AI architectures to enhance the model’s capabilities, with the aim of achieving artificial general intelligence and addressing complex AI tasks such as understanding and acting upon human intentions in user interfaces and human-oriented environment. This approach is seen as a promising pathway to provide an organic experience to users.

References

Stay uptaded ! 🔔

Subscribe to my Medium Newsletter to receive my latest content
Your thoughts and questions are always welcome in the comment section !
You can also find me on LinkedIn !