Position: The Systemic Lack of Agency in Visual Reasoning

Strong Semantic Recognition Does Not Equate to Active Visual Exploration

Yizhao Huang1* Haoyang Chen12* Shiqin Wang1* Pohson Huang1* Jiayuan Li23 Haoyuan Du1 Yandong Shi1
Zheng Wang12 Zhixiang Wang4
1Wuhan University 2Zhongguancun Academy 3Beijing Institute of Technology 4Shanda AI Research Tokyo
*Equal Contribution Corresponding Author
                    ICML 2026                            
Overview

Abstract

This paper argues that a systemic lack of Agency constrains the implicit reasoning capabilities of current Vision-Language Models (VLMs). Implicit reasoning refers to the ability to autonomously discover and utilize hidden visual evidence to bridge information gaps, rather than merely relying on explicitly specified targets. This capacity underlies human visual understanding and everyday reasoning. We argue that this limitation arises from a tendency to approach visual reasoning primarily as passive semantic retrieval, rather than as active, situated reasoning that depends on autonomous visual exploration.

As a result, most existing benchmarks primarily assess Passive Capacity, leaving this aspect of reasoning largely unmeasured. To address this gap, we introduce the Visual Implicit Reasoning Diagnosing Benchmark (V-IRD), which targets this missing quadrant by requiring models to derive answers strictly through autonomous visual analysis. Our results show that, despite strong retrieval abilities, prominent VLMs struggle to utilize reference objects and to attend to visual evidence that requires self-directed inquiry. Simply put, strong semantic recognition does not equate to active visual exploration, revealing a critical gap in current VLMs.

The Problem

The Visual Agency Gap

Active Agency vs Passive Capacity
Coin is 2.5cm in diameter.
-> Bottle is 8 coins high
-> 20cm
Correct
Looks like a bottle
-> most bottles are
15cm.
Wrong
Human: Active Agency
VLM: Passive Capacity

Figure 1: Comparison of Active Agency vs. Passive Capacity. While humans actively retrieve implicit visual cues (e.g., coin size) to solve problems, VLMs often ignore context and rely on priors. Watch the animation above to see their internal reasoning.

The Data

Task Distribution & Taxonomy

V-IRD Task Distribution

Figure 3: Statistics of 4 categories and 10 tasks in V-IRD Benchmark.

The V-IRD benchmark is structured into a hierarchical taxonomy spanning four core domains to ensure comprehensive cognitive coverage:

  • Spatial Geometry (41%): Focusing on precise metrology tasks such as Length, Distance, Volume, and Area.
  • Contextual Inference (29%): Challenges the model to deduce abstract information like Environment and Remarks.
  • Physical Properties (21%): Covering Temperature (e.g., inferring from steam) and Weight estimation.
  • Physical Logic (9%): Involving abstract reasoning such as Electricity (Ohm's Law) and Kinematics.
Examples

V-IRD Benchmark Cases

Representative instances requiring autonomous visual discovery across domains.

V-IRD Benchmark Examples

Figure 4: The questions are designed to be "Target-Exclusive", meaning they ask for the final answer without explicitly pointing to the visual evidence required to solve it.

Evaluation

Main Benchmark Results

We evaluated a wide range of VLMs, from open-source 2B models to proprietary SOTA models like GPT-5.2 and Gemini-3-Pro.

Main Benchmark Results Table

Key Observations

  • Severe Collapse in Spatial Geometry: Most models struggle significantly in spatial tasks (e.g., measuring length using a reference).
  • Proprietary vs. Open Source: Closed-source models generally outperform open-source models, exhibiting stronger robustness in active retrieval.
  • Agency Deficit is Universal: Despite differences in capacity, the deficit in visual agency remains a bottleneck across all architectures compared to human performance.

BibTeX


@article{huang2026VIRD,
  title={Position: The Systemic Lack of Agency in Visual Reasoning},
  author={Huang, Yizhao and Chen, Haoyang and Wang, Shiqin and Huang, Pohsun and Li, Jiayuan and Du, Haoyuan and Shi, Yandong and Wang, Zheng and Wang, Zhixiang},
  journal={arXiv preprint arXiv:2606.14795},
  year={2026}
}