LAMs : Large Language Models with Neuro-Symbolic AI
As LLMs approach their functional limits, LAMs could represent the next practical step in enhancing technology interaction, potentially offering broader capabilities in understanding and executing tasks.
--
Introduction
In a recent paper published on December 3, 2023 by the Rabbit — a new AI company — research team, titled “Learning Human Actions on Computer Applications”, a new paradigm in the field of AI and human-computer interaction is introduced : the Large Action Model (LAM).
This article aims to demystify the complexities and scientific innovations behind LAM, offering insights into its potential to revolutionize the way we interact with technology. We will be looking at what makes LAMs so strong : the neuro-symbolic approach. Finally, we will examine the role of Computer Vision in LAMs and how it can be used to improve human-machine interaction in the human-oriented environment.
Keywords: Large Language Models, LLM, Large Action Models, LAM, neuro-symbolic, symbolic, logic, Rabbit, Rabbit R1, artificial intelligence, explainability, AI agents, GPT-4, ChatGPT, NLP, NLU, natural language processing, natural language understanding, human-oriented interface
What is a Large Action Model ?
LAM represents a significant shift from traditional Large Language Models (LLMs) like GPT-4 or BERT. While LLMs focus on understanding and generating human language, LAMs extend this concept to comprehend and execute human actions on computer interfaces. Essentially, a LAM is designed to interpret human intentions and translate them into actionable steps that computer systems can understand and execute in real time. [1]
Neuro-Symbolic AI Explained
At its core, LAMs leverage the latest advancements in neuro-symbolic programming. Unlike traditional AI models that often depend on either purely symbolic or purely neural network approaches, LAMs combine these two paradigms.
The neuro-symbolic approach in AI combines symbolic AI, which excels in reasoning and rule-based processing, with neural networks, known for their pattern recognition and predictive capabilities. [2][3]
The goal is to create actions on applications that are regular, minimalistic, stable, and explainable. This philosophy aligns with Occam’s razor principle, emphasizing simplicity and effectiveness. Mathematically, we can translate it as follows :
- Symbolic Logic Function (S): This function processes input x using a set of predefined rules or logic. Represented as S : X → Y, where X is the input space and Y is the symbolic representation space.
- Neural Network Function (N): This function processes the symbolic representation y generated by S. Represented as N : Y → Z, where Z is the output space corresponding to the final action or decision.
The neuro-symbolic function (NS) is the composition of S and N, formulated as : NS(x) = N(S(x))
- x ∈ X is the initial input (e.g., a natural language instruction)
- y = S(x) ∈ Y is the symbolic representation (e.g., structured commands)
- NS(x) ∈ Z is the final output (e.g., executed action).
While symbolic techniques are highly explainable but limited in handling complex systems, neural network methods are scalable but lack explainability. LAMs aim to harness the strengths of both, offering a scalable, understandable, and reliable solution in AI research.
How Large Action Models Work ?
LAMs are designed to operate human-oriented interfaces across all mobile and desktop environments. They observe human interaction with the interface and form a “conceptual blueprint” of the service behind it, carrying out the underlying intentions. This allows LAMs to act as a virtual helper, understanding how to interact with applications to achieve certain objectives in a humanizing way. [1][2]
Rabbit R1
For example, Rabbit has developed a device called “R1” that uses LAM as its operating system. The device can learn how users interact with devices and can theoretically carry out tasks such as booking flights or editing documents after being taught how to do so. The LAM system is enabled by recent advances in neuro-symbolic programming, allowing for the direct modeling of the structure of various applications and user actions performed on them without a transitory representation, such as text. [1]
The neuro-symbolic approach to LAMs involves overcoming both research challenges and engineering complexities, from real-time communication to virtual network computing technologies. The goal is to shape the next generation of natural language-driven consumer experiences.
LAMs learn through observation, mimicking human interaction with interfaces to execute tasks directly and transparently, allowing for easy inspection and understanding of their processes. Over time, LAMs develop a comprehensive grasp of an application’s interface, serving as a conduit between users and services.
Computer Vision for Improved Human Interaction
As we’ve seen, the aim of LAMs is to perform invisible, fully automated scrapping. Most sites and applications (web apps) don’t have API access, so it’s difficult for LLMs to navigate to take actions. LAM’s strength : labeling actions and understanding user patterns. But how can Computer Vision be used by LAMs and what differentiates a LAM from a Computer Vision-enabled LLM, commonly known as a Visual Language Model (VLM) ?
Symbolic Reasoning with Computer Vision
It’s crucial for enabling these models to not only recognize visual elements but also to understand their context and how they relate to one another. This allows LAMs to develop a common-sense understanding of the environment, leading to more intuitive and effective interactions.
One relevant example of integrating symbolic reasoning with Computer Vision in AI is the Neural-Symbolic Visual Question Answering (NS-VQA) system. This system combines deep representation learning for visual recognition with symbolic program execution for reasoning. It begins by extracting a structural scene representation from an image and a program trace from a question. Then, it executes the program on the scene representation to obtain an answer.
This approach demonstrates robustness in complex reasoning tasks and efficiency in data and memory usage, while also offering full transparency in the reasoning process. This example highlights the potential of combining symbolic structures with visual data for enhanced AI capabilities. For more details, you can refer to the research published by MIT-IBM Watson AI Lab.
Computer Vision in LAMs’ Human-Oriented Environments
Computer Vision can be used in Large Action Models (LAMs) to enable them to perceive and understand the visual aspects of human-oriented environments. By incorporating Computer Vision, LAMs can interpret and respond to visual information, thereby enhancing their ability to interact with users and carry out tasks in such environments.
Here are some ways in which Computer Vision can be utilized in LAMs’ human-oriented environments :
- Enhancing Human-Machine Interaction : Vision-based gesture recognition and action recognition are essential for enhancing human-machine interaction, enabling LAMs to understand and respond to users’ natural behaviors and actions [4][5].
- Visual Information : Computer vision allows LAMs to perceive and interpret visual information, such as user interfaces, human actions, and environmental cues, which is essential for understanding and responding to user requests in human-oriented environments [6].
- Digital Twins and Human-Centric Graphs : Computer vision can be leveraged to create human-centric digital twins, capturing the interactive nature of humans and their surrounding environments, which can be valuable for LAMs operating in such contexts [7].
Implementing Computer Vision in a human-oriented environment presents challenges related to variability in human behavior, adaptation to interface changes, real-time processing, interpreting unstructured visual data, safety and reliability, and integration with other AI components. Addressing these challenges is essential for the successful deployment of computer vision systems in LAMs and similar human-centric AI applications.
LAM, a new paradigm or just another buzzword ?
Visual Language Models (VLMs) like GPT-4-Vision have distinct capabilities compared to Large Action Models (LAMs). While LAMs are designed to observe and replicate human interactions with interfaces, VLMs are focused on understanding and generating responses based on visual and textual inputs. For instance, GPT-4-Vision is a large multimodal model that accepts image and text inputs and emits text outputs.
This approach sets LAMs apart from traditional black-box models, allowing technically trained individuals to understand and reason about their inner workings in order to achieve explainability.
The combination of VLMs with autonomous AI agents may have similarities to LAMs in terms of their ability to understand and execute tasks, but the specific approach of LAMs, rooted in learning from human interactions, sets them apart.
Conclusion
In summary, the neuro-symbolic approach to LAMs involves integrating neural and symbolic AI architectures to enhance the model’s capabilities, with the aim of achieving artificial general intelligence and addressing complex AI tasks such as understanding and acting upon human intentions in user interfaces and human-oriented environment. This approach is seen as a promising pathway to provide an organic experience to users.
References
- Rabbit Tech. (n.d.). Research.
- Hitzler, P. (n.d.). Neural-Symbolic Learning and Reasoning: A Survey and Interpretation.
- Semantic Web Journal. (n.d.). Semantic Web
- SpringerLink. (2023). Computer vision-based hand gesture recognition for human-robot interaction : a review.
- MDPI. (2023). Sensors.
- Journal of Young Investigators. (2023, August). Physics Meets Data: A New Approach to Computer Vision.
- ScienceDirect. (2023). Towards Human-centric Digital Twins: Leveraging Computer Vision and Graph Models to Predict Outdoor Comfort.
Stay uptaded ! 🔔
- Subscribe to my Medium Newsletter to receive my latest content
- Your thoughts and questions are always welcome in the comment section !
- You can also find me on LinkedIn !