Google Supercharges Gemini 3 Flash with Agentic Vision

Google has added agentic vision to Gemini 3 Flash, combining visual reasoning with code execution to "ground answers in visual evidence". According to Google, this not only improves accuracy, but more importantly unlocks entirely new AI-driven behaviors.

Briefly, rather than analyzing an image in a single pass, Gemini 3 Flash now approaches vision as an agent‑like investigation: planning steps, manipulating the image, and using code to verify details before answering.

This leads to a "think -> act -> observe" loop, in which the model first analyzes the prompt and the image to plan a multi-step approach; then it generates and executes Python code to manipulate the image and extract additional information from it, such as cropping, zooming, annotating, or calculating; and finally, appends the transformed image to its context before producing a new answer.

According to Google, this approach yields a 5-10% accuracy improvement on vision tasks across most vision benchmarks, driven by two major factors.

First, code execution enables fine-grained inspection of details in an image by zooming into smaller visual elements, such as tiny text, rather than relying on guesses. Gemini can also annotate images by drawing bounding boxes and labels to strengthen is visual reasoning, for example by correctly counting objects. Using such annotations, Google claims to have solved the notoriously "hard problem" of counting the digits on a hand.

Second, visual arithmetic and data visualization can be offloaded to deterministic code written in Python using Matplotlib, reducing hallucinations in complex, image‑based math.

Reacting to Google's announcement, X user Kanika observed:

Reading this makes earlier vision tools feel incomplete in hindsight. So many edge cases existed simply because models couldn’t intervene or verify visually. Agentic Vision feels like the direction everyone will eventually adopt.

Redditor Izento commented:

The implications of this are massive. Essentially they've unlocked visual reasoning for AI to be implemented in actual physical robots. Robots will have tons more context awareness and agentic capabilities.

Other redditors noted that ChatGPT has employed a similar approach for quite some time via Code Interpreter; nevertheless, it still appears unable to reliably count the digits on a hand.

Google's roadmap for agentic vision includes more implicit behavior, such as automatically triggering zooming, rotation, and other actions without explicit prompts; adding new tools such as web and reverse image search to enhance the evidence available to the model; and extending support to other models in the Gemini family beyond Flash.

Agentic Vision is accessible through the Gemini API in Google AI Studio and Vertex AI, and is starting to roll out in the Gemini app in Thinking mode.

About the Author

Sergio De Simone

Show moreShow less

InfoQ Software Architects' Newsletter

Write for InfoQ

About the Author

Sergio De Simone

Rate this Article

This content is in the AI, ML & Data Engineering topic

Related Topics:

Related Editorial

Related Sponsors

Popular across InfoQ

The InfoQ Newsletter