Jarvis: A Framework for Building Powerful Collaborative Systems with Language Models

Published in

Version 1

5 min readMay 9, 2023

In a world where time is precious and information overload is the norm, finding efficient solutions to complex tasks has become a necessity. Microsoft recently introduced Jarvis, a name that seems to be inspired by the fictional AI assistant from the Marvel cinematic universe.

What is Jarvis?

Jarvis is an implementation of a collaborative system that combines the power of an LLM controller with multiple expert models from HuggingFace Hub. In simpler terms, it’s a system designed to streamline tasks and optimize workflows, with a four-stage process that includes task planning, model selection, task execution, and response generation.

Jarvis is an implementation based on a recent paper that introduces a framework known as HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in HuggingFace. The philosophy behind this framework is based on the recent remarkable abilities of Large language models (LLMs) in language understanding, generation, interaction, and reasoning. This led to the belief that LLMs can serve as a universal interface, empowering LLMs to connect various AI models and domains.

The framework is inspired from the reasoning abilities of LLMs to act like a brain/manager to identify which existing model to use based on the task at hand.

By leveraging the strengths of LLMs and using language as a bridge between different AI models, we can unlock new possibilities for advanced artificial intelligence.

HuggingGPT: A quick look at the framework

A high-level overview of the HuggingGPT Framework (Source)

As shown in the image above, the framework follows 4 stages to understand a user prompt and give the relevant output. The stages are the following:

Task Planning: Analysing user requests with the LLM to understand their intention and break them down into solvable tasks. The original paper used ChatGPT as the default LLM but one could replace this with an LLM model of their own choice.
Model Selection: Choosing expert models from Hugging Face based on their descriptions to solve the planned tasks. This is very customisable allowing a user to add any custom model to the list for specific tasks.
Task Execution: Invoking and executing each selected model and returning the results to ChatGPT/LLM.
Response Generation: Integrating the predictions of all models with ChatGPT to generate responses.

At the end of the four stages, you would receive the output of your prompt as shown in the response in the image above.

Examples

Here are a few examples of what we tried using HuggingGPT.

Object Detection and Classification

a.jpg is an image of a cat next to a plant. The image below is the prompt passed to the HuggingGPT system.

Here is the output I got, it detected both the plant and the cat as shown in the sentence below. It provides additional technical details of which models were used for validation but this could be removed for a basic user who does not need these details.

Here is the image of objects detected by the AI.

Summarisation of a webpage

We asked the system to summarise a webpage. The webpage asked about the Carbon Emissions Savings Calculator we developed in Version 1.

Input Prompt to Summarise Carbon Emissions Savings Calculator

Here is the summary of what our Calculator does:

Summary Generated using HuggingGPT from Carbon Emissions Savings Calculator

The output is an acceptable summary of the webpage provided, however when compared to the previous example of the cat, this method doesn’t perform better. However, generated outputs could be improved by adding better state-of-the-art models to the framework.

So what does this all mean?

Even though ChatGPT or other LLMs might not be the best model to use in every scenario, they surely open the door to a system that has Artificial General Intelligence, which can do multiple tasks based on what you give it access to. Although the framework has limitations like hallucination and token limits, which are inherent in all LLM models, its capabilities, including reasoning ability, will improve as LLMs advance in the future.

The capability of LLMs linked with any private or public ML community such as Github, HuggingFace, or Azure allows organisations to scale up their AI infrastructure for complex tasks all through one interface — an LLM that will decide which model works best for what you want. Currently, the open-source project offers a broad spectrum of tasks that can be performed using different modalities, such as language, image, audio, and video. The codebase for the project can be found here. You can ideally customise this to add your own models to the framework or even modify the code to use your own LLM as the orchestrator.

Can you run it on your system at home? Yes, but only if you remove all the major models which require a GPU or better infrastructure. You can however test out the model using HuggingFace's publicly available inference end points. A demo of the model can be found here.

In conclusion, Jarvis is a collaborative system that combines the power of LLM controllers with multiple expert models from HuggingFace Hub to streamline tasks and optimize workflows. By leveraging the strengths of LLMs and using language as a bridge between different AI models, Jarvis unlocks new possibilities for advanced artificial intelligence.

About the Author

Rohit Vincent is a Data Scientist at the Version 1 Innovation Labs.