Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ


Choose your language

InfoQ Homepage News Microsoft Open-Sources Multimodal Chatbot Visual ChatGPT

Microsoft Open-Sources Multimodal Chatbot Visual ChatGPT

This item in japanese

Microsoft Research recently open-sourced Visual ChatGPT, a chatbot system that can generate and manipulate images in response to human textual prompts. The system combines OpenAI's ChatGPT with 22 different visual foundation models (VFM) to support multi-modal interactions.

The system was described in a paper published on arXiv. Users can interact with the bot either by typing text or uploading images. The bot can also generate images, either from scratch based on a textual prompt, or by manipulating previous images in the chat history. The key module in the bot is a Prompt Manager, which modifies raw text from the user into a "chain of thought" prompt that helps ChatGPT determine if a VFM tool is needed to perform an image task. According to the Microsoft team, Visual ChatGPT is:

an open system incorporating different VFMs and enabling users to interact with ChatGPT beyond language format. To build such a system, we meticulously design a series of prompts to help inject the visual information into ChatGPT, which thus can solve the complex visual questions step-by-step.

ChatGPT and other large language models (LLM) have shown remarkable natural language processing capabilities; however, they are trained to handle only one mode of input: text. Instead of training a new model to handle multimodal input, the Microsoft team designed a Prompt Manager to produce text inputs to ChatGPT that result in outputs that can invoke VFMs such as CLIP or Stable Diffusion to perform computer vision tasks.

Visual ChatGPT Architecture. Image Source:

The Prompt Manager is based on a LangChain Agent, and the VFMs are defined as LangChain agent Tools. To determine whether a tool is required, the agent incorporates input from the user's prompt and from conversation history, which includes image filenames, then applies prompt prefixes and suffixes. The prefix includes the text:

Visual ChatGPT can not directly read images, but it has a list of tools to finish different visual tasks. Each image will have a filename formed as "image/xxx.png", and Visual ChatGPT can invoke different tools to indirectly understand pictures.

Additional text in the prefix guides ChatGPT to ask itself "Do I need to use a tool?" to handle the user's desired task, and if so it should output the name of the tool along with its required inputs, such as an image filename or a text description of an image to generate. The agent will iteratively invoke VFM tools, sending the resulting image to chat, until it no longer needs to use a tool. At that point, the last generated text output will be sent to chat.

In a Hacker News thread about the work, one user noted that the VFMs use much less memory than language models, and wondered why. Another user replied:

Image models can be very off and still produce a satisfying result. Consider that I could literally vary all the pixels in an image randomly by 10% and you'd just see it as a bit low quality but otherwise perfectly cohesive image. Language models have no such luck, the problem they're trying to solve is way "sharper", it's very easy for their results to be strictly wrong if they're off even a little bit. So you need a much larger model to get a sufficient level of "sharpness" for text."

The Visual ChatGPT source code is available on GitHub.

About the Author

Rate this Article