LLaVA-CoT Shows How to Achieve Structured, Autonomous Reasoning in Vision Language Models

Researchers from several Chinese institutions fine-tuned Llama-3.2-11B-Vision-Instruct to improve its ability to solve multimodal reasoning problems by going beyond the direct-response or chain-of-thought (coT) approaches to reason step by step in a structured way. Named LLava-CoT, the new model outperforms its base model and proves better than larger models, including Gemini-1.5-pro, GPT-4o-mini, and Llama-3.2-90B-Vision-Instruct, on a number of benchmarks.

According to the researchers, one reason why visual language models (VLMs) often hallucinate or produce errors is the lack of systematic and structured reasoning:

Specifically, by referring systematic, we mean that the model does not generate a direct reasoning chain but instead engages in multistage reasoning. Structured, on the other hand, refers to the model’s ability to clearly identify the reasoning stage it is in and understand the primary task to be addressed at each stage.

The research authors' approach consists of designing LLaVA-CoT so it reasons through four stages: a summary, where the model summarizes the current task; a caption, which describes the relevant parts of an image; reasoning, where the model analyzes the question; and conclusion, which provides a final response based on the reasoning stage. In other words, the model first organizes the problem and all known information then carries through a detailed thought process and finally derives a conclusion.

To make this possible, the researchers constructed a specific dataset, LLaVA-o1-100k, by using GPT-4o to generate responses stage by stage. The custom dataset includes data from both general-purpose visual question answer (VQA) datasets as well as science-targeted VQA datasets. They used then the generated dataset to perform a full parameter fine-tuning of Llama-3.2-11B-Vision-Instruct in a supervised approach.

Additionally, LLaVA-CoT uses a novel approach to efficient inference time scaling. Instead of using beam search at the sentence level, they use it at this stage level to generate multiple candidate results at each stage. The best potential result is then selected to continue the generation process at the next stage. According to the authors, using inference time scaling makes it possible for the model to arrive at a concrete answer during the reasoning process and retain it for the final stage. Lacking this, the model could need to make a guess for the final stage, possibly leading to incorrect results.

Stage-level beam search, which is made possible by the structured output design of [LLaVA-CoT], is an effective and powerful approach for inference time scaling.

To assess their approach, the researchers compared LLaVA-CoT performance to both its base model and other models. They found LLaVA-CoT provides notable improvements across general VQA, mathematical reasoning, scientific VQA, and hallucination control tasks in comparison to its base model. Additionally, LLaVA-CoT appears to outperform many open-source models of similar or even larger sizes, such as InternVL2-8B, Ovis1.5-Gemma2-9B, MiniCPM-V2.6-8B, Llama-3.2-90B-Vision-Instruct, and VILA-1.5-40B, as well as closed-source models such as GPT-4o-mini and Gemini-1.5-pro.

LLaVA-Cot is available on Hugging Face, while the LLaVA-o1-100k dataset will be made public in future, say the authors. A Web app is also available which allows to upload an image and start chatting about it.

About the Author

Sergio De Simone

Show moreShow less

InfoQ Software Architects' Newsletter

Write for InfoQ

About the Author

Sergio De Simone

Rate this Article

This content is in the AI, ML & Data Engineering topic

Related Topics:

Related Editorial

Related Sponsors

Popular across InfoQ

The InfoQ Newsletter