A recent paper by researchers at Zhejiang University and Microsoft Research Asia explores the use of large language models (LLMs) as a controller to manage existing AI models available in communities like Hugging Face.
The key idea behind the research is leveraging existing AI models available for different domains and connecting them using the advanced language understanding and generation capabilities shown by LLMs such as ChatGPT.
Specifically, we use ChatGPT to conduct task planning when receiving a user request, select models according to their function descriptions available in Hugging Face, execute each subtask with the selected AI model, and summarize the response according to the execution results.
According to the researchers, their approach makes it possible to solve sophisticated AI tasks in language, vision, speech, and other domains.
To establish the connection between ChatGPT and Hugging Face models, HuggingGPT uses the model descriptions from the Hugging Face library and fuses them into ChatGPT prompts.
The first stage in the process is task planning, where ChatGPT analyzes the user request and decompose it into tasks that can be solved using models from the library. The second stage consists in selecting the models that can best solve the planned tasks. The next logical step is executing the tasks and returning the results to ChatGPT. Finally, ChatGPT generates the response by integrating the prediction of all models.
For the task planning stage, HuggingGPT uses task specifications and demonstrations. A task specification includes four slots defining an ID; the task type, e.g., video, audio, etc.; dependencies, which define pre-requisite tasks; and task arguments. Demonstrations associate user requests to a sequence of task specifications. For example, the user request "In image /exp2.jpg, what is the animal and what is it doing?" is associated to a sequence of four tasks: image to text, image classification, object detection, and a final question answering task.
The six paper authors stated they used HuggingGPT for a number of experiments including both simple and complex tasks involving multiple sub-tasks.
HuggingGPT has integrated hundreds of models on Hugging Face around ChatGPT, covering 24 tasks such as text classification, object detection, semantic segmentation, image generation, question answering, text-to-speech, and text-to-video. Experimental results demonstrate the capabilities of HuggingGPT in processing multimodal information and complicated AI tasks.
According to their creators, HuggingGPT still suffers from some limitations, including efficiency and latency, mostly related to the need of interacting at least once with a large language model for each stage; context-length limitation, related to the maximum number of tokens an LLM can accept; and system stability, which can be reduced by the possibility an LLM occasionally failing to conform to instructions, as well as by the possibility that one of the models controlled by the LLM may fail.