Visual ChatGPT: Paper and Code Review

Powering ChatGPT with Visual Foundation Models

9 min readMar 12, 2023

Toolformer showed that Large Language Models (LLMs) can easily learn to use tools such as APIs to solve tasks. Some of the tools that the model learned to use were calculators, search engines, calendars, etc.

A team at Microsoft has recently released their work where they show how different Visual Foundation Models such as ControlNet, Stable diffusion, etc. can be leveraged by ChatGPT as a tool, to solve tasks that involve using images.

In today’s article, we’ll review the paper Visual ChatGPT: Talking, Drawing and Editing with Visual Foundation Models. All code snippets are referred to and edited if necessary from the GitHub repository linked here.

Image from authors of Visual ChatGPT: Talking, Drawing and Editing with Visual Foundation Models.

The authors also use the LangChain library to make it easier to interact with ChatGPT.

The authors create a robust framework that has different components ensuring that ChatGPT learns to use different Vision Models. Introducing information from a new modality in deep learning is by no means a trivial endeavor.

Visual ChatGPT is not a monolithic model rather it should be considered as an agent that acts in an environment/world. The environment it exists in accepts textual and visual information. It can utilize different tools like ChatGPT and Visual Foundation Models to act in the environment.

Per LangChain:
Agents are systems that use a language model to interact with other tools. These can be used to do more grounded question/answering, interact with APIs, or even take actions.
https://langchain.readthedocs.io/en/latest/use_cases/agents.html

There are plenty of problems that arise when we try to introduce a new modality of information into an LM such as how ChatGPT can understand the outputs from a visual model if they produce images and many others. In this article, each section will detail the solutions for each of the subproblems faced.

How does Visual ChatGPT accept Images as Input?

Tasks such as image editing, visual question answering, etc. would need one or more images as input so that a deep learning model can process it and provide a response. However, ChatGPT operates solely in the textual domain and cannot accept any other forms of input.

In order, to overcome this hurdle, the authors create an artificial conversational turn. Where as soon as an image is uploaded by a user, the file gets saved with a unique id, the file’s name is sent as input to ChatGPT, and the response from ChatGPT is set to Received.

This turn is retained in the conversational history of ChatGPT so that it always knows which file to use/refer to. An example is shown in the image below:

The code for the same is shown below:

with gr.Column(scale=0.15, min_width=0): 
  btn = gr.UploadButton(“Upload”, file_types=[“image”])

btn.upload(bot.run_image, [btn, state, txt], [chatbot, state, txt])

 def run_image(self, image, state, txt):
        image_filename = os.path.join('image', str(uuid.uuid4())[0:8] + ".png")
        print("======>Auto Resize Image...")
        img = Image.open(image.name)
        width, height = img.size
        ratio = min(512 / width, 512 / height)
        width_new, height_new = (round(width * ratio), round(height * ratio))
        img = img.resize((width_new, height_new))
        img = img.convert('RGB')
        img.save(image_filename, "PNG")
        print(f"Resize image form {width}x{height} to {width_new}x{height_new}")
        description = self.i2t.inference(image_filename)
        Human_prompt = "\nHuman: provide a figure named {}. The description is: {}. This information helps you to understand this image, but you should use tools to finish following tasks, " \
                       "rather than directly imagine from my description. If you understand, say \"Received\". \n".format(image_filename, description)
        AI_prompt = "Received.  "
        self.agent.memory.buffer = self.agent.memory.buffer + Human_prompt + 'AI: ' + AI_prompt
        print("======>Current memory:\n %s" % self.agent.memory)
        state = state + [(f"![](/file={image_filename})*{image_filename}*", AI_prompt)]
        print("Outputs:", state)
        return state, state, txt + ' ' + image_filename + ' '

As seen above, once an image is uploaded, the run_image function is invoked. This function creates a new image name via a uuid, does some image pre-processing, and then creates the artificial turn that is added to the memory buffer.

It can also be seen that a description of the image is included as the initial input along with the file’s name. This description is generated from the Image Captioning model Blip.

class ImageCaptioning:
    def __init__(self, device):
        print("Initializing ImageCaptioning to %s" % device)
        self.device = device
        self.processor = BlipProcessor.from_pretrained("Salesforce/blip-image-captioning-base")
        self.model = BlipForConditionalGeneration.from_pretrained("Salesforce/blip-image-captioning-base").to(self.device)
 
self.i2t = ImageCaptioning(device="cuda:4")

How to ensure that ChatGPT leverages the Visual Foundation Models (VFM)?

As seen from the Human_prompt variable declared above, The phrase but you should use tools to finish following tasks rather than directly imagine from my description. sets the tone for ChatGPT to leverage VFMs instead of arbitrarily giving an answer.

Human_prompt = "\nHuman: provide a figure named {}. The description is: {}. This information helps you to understand this image, but you should use tools to finish following tasks, " \
                       "rather than directly imagine from my description. If you understand, say \"Received\". \n".format(image_filename, description)

Besides, just the prompt for when the image is uploaded, each prompt also has a prefix and suffix that further ensure that the model doesn’t behave in an ad-hoc manner. Some of the key instructions provided in the prefix are:

As a language model, Visual ChatGPT can not directly read images, but it has a list of tools to finish different visual tasks. Each image will have a file name formed as “image/xxx.png”, and Visual ChatGPT can invoke different tools to indirectly understand pictures.
When talking about images, Visual ChatGPT is very strict to the file name and will never fabricate nonexistent files.
Visual ChatGPT is able to use tools in a sequence and is loyal to the tool observation outputs rather than faking the image content and image file name. It will remember to provide the file name from the last tool observation, if a new image is generated.
Visual ChatGPT has access to the following tools:

These statements prime Visual ChatGPT to leverage the visual tools available as well as tell it how to handle file names, and how to communicate with a user about an image generated by one of the VFM models.

How does Visual ChatGPT know which VFM to use?

The agent is provided with a list of all the tools i.e. VFMs in this case, that can be leveraged. Each of the tools has a description to detail their capabilities for example:

Tool(name="Generate Image From User Input Text", func=self.t2i.inference,
      description="useful when you want to generate an image from a user input text and save 
      it to a file. like: generate an image of an object or something, or 
      generate an image that includes some objects. "
      "The input to this tool should be a string, representing the text 
      used to generate image. "),

One of the tools being used is a VFM that can convert text to images, as seen the agent is given info on the name of the tool that summarizes what the model does, the function to be called, and a description detailing the utility, inputs, and outputs of the tool.

The agent then uses the description of the tool and the conversational history so far to decide which tool to use next. Decisions are made using the ReAct framework.

self.agent = initialize_agent(
            self.tools,
            self.llm,
            agent="conversational-react-description",
            verbose=True,
            memory=self.memory,
            return_intermediate_steps=True,
            agent_kwargs={'prefix': VISUAL_CHATGPT_PREFIX, 
            'format_instructions': VISUAL_CHATGPT_FORMAT_INSTRUCTIONS, 
            'suffix': VISUAL_CHATGPT_SUFFIX},

The ReAct (Reasoning + Action) Paradigm

ReAct can be thought of as an extension of the paradigm of Chain of Thought (CoT) Reasoning. While CoT lets an LM generate a chain of reasoning to solve a task, thereby reducing the chances of hallucinations.

ReAct ensures that every response of an LM is composed of three steps:

Thought corresponds to reasoning.
Action is where the agent chooses an action based on the thought it generated.
Observation is where the agent observes the results of its actions and can decide what to do next.

In order to ensure that ChatGPT responds in this format the following is included in the prompt of ChatGPT:

VISUAL_CHATGPT_FORMAT_INSTRUCTIONS = “””To use a tool, please use the following 
format:
```Thought: Do I need to use a tool? Yes
   Action: the action to take, should be one of [{tool_names}]
   Action Input: the input to the action
   Observation: the result of the action```

When you have a response to say to the Human, or if you do not need to 
use a tool, you MUST use the format:
```Thought: Do I need to use a tool? No
{ai_prefix}: [your response here]```”””

It is important to note that the outputs of the Thought, Action, and Observation steps are not displayed to the end user. All of that info is hidden away to make sure that the end user doesn’t get overwhelmed with all the intermediate responses that don’t directly solve the query of the user.

Instead, the only piece of generated text displayed to the user is the [your response here] field when the LM thinks it has either obtained a final answer or wants to ask the user a question.

Another nice effect of the ReAct paradigm is that we can now chain the use of multiple tools together because after seeing an observation, ChatGPT will default to thinking if it needs to use a tool. Essentially Do I need to use a tool? is the suffix added to every query and intermediary answer generated by ChatGPT.

How does Visual ChatGPT send inputs to the VFM models?

As can be seen from the prompt above the format of ChatGPT’s responses make ChatGPT choose one of the tools from the available list, the input format of a tool can be obtained from the tool’s description that we saw earlier and finally the output of the VFM can be parsed out from the Observation field.

The parsing of the action and action input via the LangChain library can be seen below:

def _extract_tool_and_input(self, llm_output: str) -> Optional[Tuple[str, str]]:
    if f"{self.ai_prefix}:" in llm_output:
        return self.ai_prefix, llm_output.split(f"{self.ai_prefix}:")[-1].strip()
    regex = r"Action: (.*?)[\n]*Action Input: (.*)"
    match = re.search(regex, llm_output)
    if not match:
        raise ValueError(f"Could not parse LLM output: `{llm_output}`")
    action = match.group(1)
    action_input = match.group(2)
    return action.strip(), action_input.strip(" ").strip('"')

On extracting the tool to use and the input to be provided a call is made to execute the tool.

How are the outputs from VFMs handled?

The outputs of each model are saved as a file name with the following format:

{Name}_{Operation}_{Prev Name}_{Org Name}.

Name is a unique uuid.
Operation corresponds to the name of the tool.
Prev Name corresponds to the uuid of the input image used to generate the new image.
Org Name corresponds to the original input image provided by the user.

By following this naming convention ChatGPT can easily derive information about the newly generated image.

def get_new_image_name(org_img_name, func_name="update"):
    head_tail = os.path.split(org_img_name)
    head = head_tail[0]
    tail = head_tail[1]
    name_split = tail.split('.')[0].split('_')
    this_new_uuid = str(uuid.uuid4())[0:4]
    if len(name_split) == 1:
        most_org_file_name = name_split[0]
        recent_prev_file_name = name_split[0]
        new_file_name = '{}_{}_{}_{}.png'.format(this_new_uuid, func_name, recent_prev_file_name, most_org_file_name)
    else:
        assert len(name_split) == 4
        most_org_file_name = name_split[3]
        recent_prev_file_name = name_split[0]
        new_file_name = '{}_{}_{}_{}.png'.format(this_new_uuid, func_name, recent_prev_file_name, most_org_file_name)
    return os.path.join(head, new_file_name)

Conclusion

Finally, all the moving parts are combined together to have a conversation with Visual ChatGPT that can leverage visual information.

This work is a perfect example of the importance of prompt engineering. The prompt allows the agent to deal with visual info using file names, and create chains of thought -> action -> observation responses that help in determining which VFM to use and handle the outputs of the VFM models.

In order to abstract away the complex nature of the solution, the intermediary responses that include the thought, action, and observation statements are hidden from the user and only the final response that is generated by the LM is displayed to the user, once ChatGPT believes it no longer needs to use a VFM.

We hope you’ve learned from this code and paper review, let us know if you like this approach to reviewing research papers.

Visual ChatGPT: Paper and Code Review

Powering ChatGPT with Visual Foundation Models

How does Visual ChatGPT accept Images as Input?

How to ensure that ChatGPT leverages the Visual Foundation Models (VFM)?

The ReAct (Reasoning + Action) Paradigm

How does Visual ChatGPT send inputs to the VFM models?

How are the outputs from VFMs handled?

Conclusion

BECOME a WRITER at MLearning.ai

Mlearning.ai Submission Suggestions

How to become a writer on Mlearning.ai

Written by Building Blocks