Position: The Systemic Lack of Agency in Visual Reasoning

Strong Semantic Recognition Does Not Equate to Active Visual Exploration

Yizhao Huang^{1 *} Haoyang Chen^{1,2 *} Shiqin Wang^{1 *} Pohson Huang^{1 *} Jiayuan Li^2,3 Haoyuan Du¹ Yandong Shi¹

Zheng Wang^1,2 Zhixiang Wang⁴

¹Wuhan University ²Zhongguancun Academy ³Beijing Institute of Technology ⁴Shanda AI Research Tokyo

^*Equal Contribution Corresponding Author

ICML 2026

Paper V-IRD Benchmark

Overview

Abstract

This paper argues that a systemic lack of Agency constrains the implicit reasoning capabilities of current Vision-Language Models (VLMs). Implicit reasoning refers to the ability to autonomously discover and utilize hidden visual evidence to bridge information gaps, rather than merely relying on explicitly specified targets. This capacity underlies human visual understanding and everyday reasoning. We argue that this limitation arises from a tendency to approach visual reasoning primarily as passive semantic retrieval, rather than as active, situated reasoning that depends on autonomous visual exploration.

As a result, most existing benchmarks primarily assess Passive Capacity, leaving this aspect of reasoning largely unmeasured. To address this gap, we introduce the Visual Implicit Reasoning Diagnosing Benchmark (V-IRD), which targets this missing quadrant by requiring models to derive answers strictly through autonomous visual analysis. Our results show that, despite strong retrieval abilities, prominent VLMs struggle to utilize reference objects and to attend to visual evidence that requires self-directed inquiry. Simply put, strong semantic recognition does not equate to active visual exploration, revealing a critical gap in current VLMs.

The Problem

The Visual Agency Gap

Coin is 2.5cm in diameter.
-> Bottle is 8 coins high
-> 20cm
Correct

Looks like a bottle
-> most bottles are
15cm.
Wrong

Human: Active Agency

VLM: Passive Capacity

Figure 1: Comparison of Active Agency vs. Passive Capacity. While humans actively retrieve implicit visual cues (e.g., coin size) to solve problems, VLMs often ignore context and rely on priors. Watch the animation above to see their internal reasoning.

The Data

Task Distribution & Taxonomy

Figure 3: Statistics of 4 categories and 10 tasks in V-IRD Benchmark.

The V-IRD benchmark is structured into a hierarchical taxonomy spanning four core domains to ensure comprehensive cognitive coverage:

Spatial Geometry (41%): Focusing on precise metrology tasks such as Length, Distance, Volume, and Area.
Contextual Inference (29%): Challenges the model to deduce abstract information like Environment and Remarks.
Physical Properties (21%): Covering Temperature (e.g., inferring from steam) and Weight estimation.
Physical Logic (9%): Involving abstract reasoning such as Electricity (Ohm's Law) and Kinematics.

Examples

V-IRD Benchmark Cases

Representative instances requiring autonomous visual discovery across domains.

Figure 4: The questions are designed to be "Target-Exclusive", meaning they ask for the final answer without explicitly pointing to the visual evidence required to solve it.

Evaluation

Main Benchmark Results

We evaluated a wide range of VLMs, from open-source 2B models to proprietary SOTA models like GPT-5.2 and Gemini-3-Pro.

Key Observations

Severe Collapse in Spatial Geometry: Most models struggle significantly in spatial tasks (e.g., measuring length using a reference).
Proprietary vs. Open Source: Closed-source models generally outperform open-source models, exhibiting stronger robustness in active retrieval.
Agency Deficit is Universal: Despite differences in capacity, the deficit in visual agency remains a bottleneck across all architectures compared to human performance.

BibTeX


@article{huang2026VIRD,
  title={Position: The Systemic Lack of Agency in Visual Reasoning},
  author={Huang, Yizhao and Chen, Haoyang and Wang, Shiqin and Huang, Pohsun and Li, Jiayuan and Du, Haoyuan and Shi, Yandong and Wang, Zheng and Wang, Zhixiang},
  journal={arXiv preprint arXiv:2606.14795},
  year={2026}
}