Saved in:
| Main Authors: | , , , , , |
|---|---|
| Format: | Preprint |
| Published: |
2026
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2602.14482 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1866911505548574720 |
|---|---|
| author | Ding, Hao Yang, Zhichuan Ge, Weijie Gao, Ziqin Lu, Chaoyi Zhao, Lei |
| author_facet | Ding, Hao Yang, Zhichuan Ge, Weijie Gao, Ziqin Lu, Chaoyi Zhao, Lei |
| contents | Fine-grained visual reasoning in multimodal large language models (MLLMs) is bottlenecked by single-pass global image encoding: key evidence often lies in tiny objects, cluttered regions, subtle markings, or dense charts. We present \textbf{TikArt} (\textbf{T}h\textbf{i}n\textbf{k}ing \textbf{A}pe\textbf{rt}ure), an aperture-guided agent that formulates multimodal reasoning as sequential evidence acquisition over regions of interest. TikArt follows a Think--Aperture--Observe (TAO) loop that interleaves language reasoning with two aperture actions: Zoom, which extracts rectangular crops, and Segment, which invokes an off-the-shelf segmenter to produce object-centric mask-based views for irregular targets. A mandatory Observation step after every aperture action writes local evidence back into text, yielding interpretable aperture trajectories and persistent linguistic memory.
Built on Qwen3-VL-8B, TikArt is trained with GRPO-style reinforcement learning under a two-stage curriculum. To stabilize long-horizon tool-integrated learning, we introduce Relative Uncertainty Reduction (RUR), a dense reward computed by a frozen evaluator that favors evidence-building trajectories and mitigates degenerate tool use. Experiments on high-resolution reasoning, general multimodal understanding, and both referring and reasoning-oriented segmentation show consistent gains over the backbone, demonstrating that aperture-guided observation improves fine-grained visual reasoning and transfers naturally to pixel-level grounding. |
| format | Preprint |
| id |
arxiv_https___arxiv_org_abs_2602_14482 |
| institution | arXiv |
| publishDate | 2026 |
| record_format | arxiv |
| spellingShingle | TikArt: Stabilizing Aperture-Guided Fine-Grained Visual Reasoning with Reinforcement Learning Ding, Hao Yang, Zhichuan Ge, Weijie Gao, Ziqin Lu, Chaoyi Zhao, Lei Computer Vision and Pattern Recognition Artificial Intelligence Fine-grained visual reasoning in multimodal large language models (MLLMs) is bottlenecked by single-pass global image encoding: key evidence often lies in tiny objects, cluttered regions, subtle markings, or dense charts. We present \textbf{TikArt} (\textbf{T}h\textbf{i}n\textbf{k}ing \textbf{A}pe\textbf{rt}ure), an aperture-guided agent that formulates multimodal reasoning as sequential evidence acquisition over regions of interest. TikArt follows a Think--Aperture--Observe (TAO) loop that interleaves language reasoning with two aperture actions: Zoom, which extracts rectangular crops, and Segment, which invokes an off-the-shelf segmenter to produce object-centric mask-based views for irregular targets. A mandatory Observation step after every aperture action writes local evidence back into text, yielding interpretable aperture trajectories and persistent linguistic memory. Built on Qwen3-VL-8B, TikArt is trained with GRPO-style reinforcement learning under a two-stage curriculum. To stabilize long-horizon tool-integrated learning, we introduce Relative Uncertainty Reduction (RUR), a dense reward computed by a frozen evaluator that favors evidence-building trajectories and mitigates degenerate tool use. Experiments on high-resolution reasoning, general multimodal understanding, and both referring and reasoning-oriented segmentation show consistent gains over the backbone, demonstrating that aperture-guided observation improves fine-grained visual reasoning and transfers naturally to pixel-level grounding. |
| title | TikArt: Stabilizing Aperture-Guided Fine-Grained Visual Reasoning with Reinforcement Learning |
| topic | Computer Vision and Pattern Recognition Artificial Intelligence |
| url | https://arxiv.org/abs/2602.14482 |