Saved in:
| Main Authors: | Marsili, Damiano, Agrawal, Rohun, Yue, Yisong, Gkioxari, Georgia |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2502.06787 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
No Labels, No Problem: Training Visual Reasoners with Multimodal Verifiers
by: Marsili, Damiano, et al.
Published: (2025)
by: Marsili, Damiano, et al.
Published: (2025)
Same or Not? Enhancing Visual Perception in Vision-Language Models
by: Marsili, Damiano, et al.
Published: (2025)
by: Marsili, Damiano, et al.
Published: (2025)
Find Any Part in 3D
by: Ma, Ziqi, et al.
Published: (2024)
by: Ma, Ziqi, et al.
Published: (2024)
Feedforward 3D Editing via Text-Steerable Image-to-3D
by: Ma, Ziqi, et al.
Published: (2025)
by: Ma, Ziqi, et al.
Published: (2025)
Is This Tracker On? A Benchmark Protocol for Dynamic Tracking
by: Demler, Ilona, et al.
Published: (2025)
by: Demler, Ilona, et al.
Published: (2025)
Conversational Image Segmentation: Grounding Abstract Concepts with Scalable Supervision
by: Sahoo, Aadarsh, et al.
Published: (2026)
by: Sahoo, Aadarsh, et al.
Published: (2026)
Linear Mechanisms for Spatiotemporal Reasoning in Vision Language Models
by: Kang, Raphi, et al.
Published: (2026)
by: Kang, Raphi, et al.
Published: (2026)
Aligning Text, Images, and 3D Structure Token-by-Token
by: Sahoo, Aadarsh, et al.
Published: (2025)
by: Sahoo, Aadarsh, et al.
Published: (2025)
Out of Sight, Out of Mind? Evaluating State Evolution in Video World Models
by: Ma, Ziqi, et al.
Published: (2026)
by: Ma, Ziqi, et al.
Published: (2026)
Is CLIP ideal? No. Can we fix it? Yes!
by: Kang, Raphi, et al.
Published: (2025)
by: Kang, Raphi, et al.
Published: (2025)
Reconstructing Hand-Held Objects in 3D from Images and Videos
by: Wu, Jane, et al.
Published: (2024)
by: Wu, Jane, et al.
Published: (2024)
MonoTher-Depth: Enhancing Thermal Depth Estimation via Confidence-Aware Distillation
by: Zuo, Xingxing, et al.
Published: (2025)
by: Zuo, Xingxing, et al.
Published: (2025)
STUPD: A Synthetic Dataset for Spatial and Temporal Relation Reasoning
by: Agrawal, Palaash, et al.
Published: (2023)
by: Agrawal, Palaash, et al.
Published: (2023)
How and What to Imagine? Visual Thinking in Unified Multimodal Models for Cross-View Spatial Reasoning
by: Yang, Qian, et al.
Published: (2026)
by: Yang, Qian, et al.
Published: (2026)
Caltech Aerial RGB-Thermal Dataset in the Wild
by: Lee, Connor, et al.
Published: (2024)
by: Lee, Connor, et al.
Published: (2024)
NitroGen: An Open Foundation Model for Generalist Gaming Agents
by: Magne, Loïc, et al.
Published: (2026)
by: Magne, Loïc, et al.
Published: (2026)
GeoVista: Web-Augmented Agentic Visual Reasoning for Geolocalization
by: Wang, Yikun, et al.
Published: (2025)
by: Wang, Yikun, et al.
Published: (2025)
SpatialImaginer: Towards Adaptive Visual Imagination for Spatial Reasoning
by: Li, Yian, et al.
Published: (2026)
by: Li, Yian, et al.
Published: (2026)
Connecting the Dots: Training-Free Visual Grounding via Agentic Reasoning
by: Luo, Liqin, et al.
Published: (2025)
by: Luo, Liqin, et al.
Published: (2025)
SpaceVista: All-Scale Visual Spatial Reasoning from mm to km
by: Sun, Peiwen, et al.
Published: (2025)
by: Sun, Peiwen, et al.
Published: (2025)
SAGE: Training Smart Any-Horizon Agents for Long Video Reasoning with Reinforcement Learning
by: Jain, Jitesh, et al.
Published: (2025)
by: Jain, Jitesh, et al.
Published: (2025)
Active Exploring like a Pigeon: Reinforcing Spatial Reasoning via Agentic Vision-Language Models
by: Deng, Wei, et al.
Published: (2026)
by: Deng, Wei, et al.
Published: (2026)
Self-Evolving Visual Concept Library using Vision-Language Critics
by: Sehgal, Atharva, et al.
Published: (2025)
by: Sehgal, Atharva, et al.
Published: (2025)
RadFabric: Agentic AI System with Reasoning Capability for Radiology
by: Chen, Wenting, et al.
Published: (2025)
by: Chen, Wenting, et al.
Published: (2025)
From Web to Pixels: Bringing Agentic Search into Visual Perception
by: Yang, Bokang, et al.
Published: (2026)
by: Yang, Bokang, et al.
Published: (2026)
Escaping Plato's Cave: JAM for Aligning Independently Trained Vision and Language Models
by: Yoon, Lauren Hyoseo, et al.
Published: (2025)
by: Yoon, Lauren Hyoseo, et al.
Published: (2025)
pySpatial: Generating 3D Visual Programs for Zero-Shot Spatial Reasoning
by: Luo, Zhanpeng, et al.
Published: (2026)
by: Luo, Zhanpeng, et al.
Published: (2026)
Learning to Localize Objects Improves Spatial Reasoning in Visual-LLMs
by: Ranasinghe, Kanchana, et al.
Published: (2024)
by: Ranasinghe, Kanchana, et al.
Published: (2024)
VideoAnchor: Reinforcing Subspace-Structured Visual Cues for Coherent Visual-Spatial Reasoning
by: Wang, Zhaozhi, et al.
Published: (2025)
by: Wang, Zhaozhi, et al.
Published: (2025)
Visual-Semantic Graph Matching Net for Zero-Shot Learning
by: Duan, Bowen, et al.
Published: (2024)
by: Duan, Bowen, et al.
Published: (2024)
Act2See: Emergent Active Visual Perception for Video Reasoning
by: Ma, Martin Q., et al.
Published: (2026)
by: Ma, Martin Q., et al.
Published: (2026)
Beyond Pairwise Preferences: Listwise Reward-Aware Alignment for Diffusion Models
by: Wang, Austin, et al.
Published: (2026)
by: Wang, Austin, et al.
Published: (2026)
ARM-Thinker: Reinforcing Multimodal Generative Reward Models with Agentic Tool Use and Visual Reasoning
by: Ding, Shengyuan, et al.
Published: (2025)
by: Ding, Shengyuan, et al.
Published: (2025)
ActFER: Agentic Facial Expression Recognition via Active Tool-Augmented Visual Reasoning
by: Liu, Shifeng, et al.
Published: (2026)
by: Liu, Shifeng, et al.
Published: (2026)
SpatialBoost: Enhancing Visual Representation through Language-Guided Reasoning
by: Jeon, Byungwoo, et al.
Published: (2026)
by: Jeon, Byungwoo, et al.
Published: (2026)
Learning GUI Grounding with Spatial Reasoning from Visual Feedback
by: Zhao, Yu, et al.
Published: (2025)
by: Zhao, Yu, et al.
Published: (2025)
InfiniBench: Infinite Benchmarking for Visual Spatial Reasoning with Customizable Scene Complexity
by: Wang, Haoming, et al.
Published: (2025)
by: Wang, Haoming, et al.
Published: (2025)
Unsupervised Representation Learning from Sparse Transformation Analysis
by: Song, Yue, et al.
Published: (2024)
by: Song, Yue, et al.
Published: (2024)
MosaicThinker: On-Device Visual Spatial Reasoning for Embodied AI via Iterative Construction of Space Representation
by: Wang, Haoming, et al.
Published: (2026)
by: Wang, Haoming, et al.
Published: (2026)
Enhancing Spatial Reasoning through Visual and Textual Thinking
by: Liang, Xun, et al.
Published: (2025)
by: Liang, Xun, et al.
Published: (2025)
Similar Items
-
No Labels, No Problem: Training Visual Reasoners with Multimodal Verifiers
by: Marsili, Damiano, et al.
Published: (2025) -
Same or Not? Enhancing Visual Perception in Vision-Language Models
by: Marsili, Damiano, et al.
Published: (2025) -
Find Any Part in 3D
by: Ma, Ziqi, et al.
Published: (2024) -
Feedforward 3D Editing via Text-Steerable Image-to-3D
by: Ma, Ziqi, et al.
Published: (2025) -
Is This Tracker On? A Benchmark Protocol for Dynamic Tracking
by: Demler, Ilona, et al.
Published: (2025)