Saved in:
| Main Authors: | Marsili, Damiano, Mehta, Aditya, Lin, Ryan Y., Gkioxari, Georgia |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2512.23592 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
No Labels, No Problem: Training Visual Reasoners with Multimodal Verifiers
by: Marsili, Damiano, et al.
Published: (2025)
by: Marsili, Damiano, et al.
Published: (2025)
Visual Agentic AI for Spatial Reasoning with a Dynamic API
by: Marsili, Damiano, et al.
Published: (2025)
by: Marsili, Damiano, et al.
Published: (2025)
Linear Mechanisms for Spatiotemporal Reasoning in Vision Language Models
by: Kang, Raphi, et al.
Published: (2026)
by: Kang, Raphi, et al.
Published: (2026)
Conversational Image Segmentation: Grounding Abstract Concepts with Scalable Supervision
by: Sahoo, Aadarsh, et al.
Published: (2026)
by: Sahoo, Aadarsh, et al.
Published: (2026)
Out of Sight, Out of Mind? Evaluating State Evolution in Video World Models
by: Ma, Ziqi, et al.
Published: (2026)
by: Ma, Ziqi, et al.
Published: (2026)
Aligning Text, Images, and 3D Structure Token-by-Token
by: Sahoo, Aadarsh, et al.
Published: (2025)
by: Sahoo, Aadarsh, et al.
Published: (2025)
Is This Tracker On? A Benchmark Protocol for Dynamic Tracking
by: Demler, Ilona, et al.
Published: (2025)
by: Demler, Ilona, et al.
Published: (2025)
Find Any Part in 3D
by: Ma, Ziqi, et al.
Published: (2024)
by: Ma, Ziqi, et al.
Published: (2024)
Reconstructing Hand-Held Objects in 3D from Images and Videos
by: Wu, Jane, et al.
Published: (2024)
by: Wu, Jane, et al.
Published: (2024)
MonoTher-Depth: Enhancing Thermal Depth Estimation via Confidence-Aware Distillation
by: Zuo, Xingxing, et al.
Published: (2025)
by: Zuo, Xingxing, et al.
Published: (2025)
Feedforward 3D Editing via Text-Steerable Image-to-3D
by: Ma, Ziqi, et al.
Published: (2025)
by: Ma, Ziqi, et al.
Published: (2025)
Is CLIP ideal? No. Can we fix it? Yes!
by: Kang, Raphi, et al.
Published: (2025)
by: Kang, Raphi, et al.
Published: (2025)
Adapting Lightweight Vision Language Models for Radiological Visual Question Answering
by: Shourya, Aditya, et al.
Published: (2025)
by: Shourya, Aditya, et al.
Published: (2025)
LDFaceNet: Latent Diffusion-based Network for High-Fidelity Deepfake Generation
by: Mehta, Dwij, et al.
Published: (2024)
by: Mehta, Dwij, et al.
Published: (2024)
Caltech Aerial RGB-Thermal Dataset in the Wild
by: Lee, Connor, et al.
Published: (2024)
by: Lee, Connor, et al.
Published: (2024)
Efficient Vision-Language Models by Summarizing Visual Tokens into Compact Registers
by: Wen, Yuxin, et al.
Published: (2024)
by: Wen, Yuxin, et al.
Published: (2024)
ViPER: Empowering the Self-Evolution of Visual Perception Abilities in Vision-Language Model
by: Zhang, Juntian, et al.
Published: (2025)
by: Zhang, Juntian, et al.
Published: (2025)
Refining Skewed Perceptions in Vision-Language Contrastive Models through Visual Representations
by: Dai, Haocheng, et al.
Published: (2024)
by: Dai, Haocheng, et al.
Published: (2024)
Object Referring-Guided Scanpath Prediction with Perception-Enhanced Vision-Language Models
by: Quan, Rong, et al.
Published: (2026)
by: Quan, Rong, et al.
Published: (2026)
Quality Text, Robust Vision: The Role of Language in Enhancing Visual Robustness of Vision-Language Models
by: Waseda, Futa, et al.
Published: (2025)
by: Waseda, Futa, et al.
Published: (2025)
Video Active Perception: Effective Inference-Time Long-Form Video Understanding with Vision-Language Models
by: Ma, Martin Q., et al.
Published: (2026)
by: Ma, Martin Q., et al.
Published: (2026)
Do Vision-Language Foundational models show Robust Visual Perception?
by: Chandhok, Shivam, et al.
Published: (2024)
by: Chandhok, Shivam, et al.
Published: (2024)
Do You See Me : A Multidimensional Benchmark for Evaluating Visual Perception in Multimodal LLMs
by: Kanade, Aditya, et al.
Published: (2025)
by: Kanade, Aditya, et al.
Published: (2025)
Visual Perception by Large Language Model's Weights
by: Ma, Feipeng, et al.
Published: (2024)
by: Ma, Feipeng, et al.
Published: (2024)
On Asymmetric Optimization of Reasoning and Perception in Vision-Language Model Post-Training
by: Wu, Xueqing, et al.
Published: (2026)
by: Wu, Xueqing, et al.
Published: (2026)
SSR: Enhancing Depth Perception in Vision-Language Models via Rationale-Guided Spatial Reasoning
by: Liu, Yang, et al.
Published: (2025)
by: Liu, Yang, et al.
Published: (2025)
DynRsl-VLM: Enhancing Autonomous Driving Perception with Dynamic Resolution Vision-Language Models
by: Zhou, Xirui, et al.
Published: (2025)
by: Zhou, Xirui, et al.
Published: (2025)
Not There Yet: Evaluating Vision Language Models in Simulating the Visual Perception of People with Low Vision
by: Natalie, Rosiana, et al.
Published: (2025)
by: Natalie, Rosiana, et al.
Published: (2025)
Investigate the Low-level Visual Perception in Vision-Language based Image Quality Assessment
by: Li, Yuan, et al.
Published: (2025)
by: Li, Yuan, et al.
Published: (2025)
EPIC-Bench: A Perception-Centric Benchmark for Fine-Grained Embodied Visual Grounding in Vision-Language Models
by: Shan, Haozhe, et al.
Published: (2026)
by: Shan, Haozhe, et al.
Published: (2026)
ObjectTransforms for Uncertainty Quantification and Reduction in Vision-Based Perception for Autonomous Vehicles
by: Sahu, Nishad, et al.
Published: (2025)
by: Sahu, Nishad, et al.
Published: (2025)
Are They the Same? Exploring Visual Correspondence Shortcomings of Multimodal LLMs
by: Zhou, Yikang, et al.
Published: (2025)
by: Zhou, Yikang, et al.
Published: (2025)
Losing Visual Needles in Image Haystacks: Vision Language Models are Easily Distracted in Short and Long Contexts
by: Sharma, Aditya, et al.
Published: (2024)
by: Sharma, Aditya, et al.
Published: (2024)
VisOnlyQA: Large Vision Language Models Still Struggle with Visual Perception of Geometric Information
by: Kamoi, Ryo, et al.
Published: (2024)
by: Kamoi, Ryo, et al.
Published: (2024)
AR as an Evaluation Playground: Bridging Metrics and Visual Perception of Computer Vision Models
by: Ganj, Ashkan, et al.
Published: (2025)
by: Ganj, Ashkan, et al.
Published: (2025)
MegaCOIN: Enhancing Medium-Grained Color Perception for Vision-Language Models
by: Chiu, Ming-Chang, et al.
Published: (2024)
by: Chiu, Ming-Chang, et al.
Published: (2024)
Perceptio: Perception Enhanced Vision Language Models via Spatial Token Generation
by: Li, Yuchen, et al.
Published: (2026)
by: Li, Yuchen, et al.
Published: (2026)
PatchCue: Enhancing Vision-Language Model Reasoning with Patch-Based Visual Cues
by: Qi, Yukun, et al.
Published: (2026)
by: Qi, Yukun, et al.
Published: (2026)
Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution
by: Wang, Peng, et al.
Published: (2024)
by: Wang, Peng, et al.
Published: (2024)
Look Again, Think Slowly: Enhancing Visual Reflection in Vision-Language Models
by: Jian, Pu, et al.
Published: (2025)
by: Jian, Pu, et al.
Published: (2025)
Similar Items
-
No Labels, No Problem: Training Visual Reasoners with Multimodal Verifiers
by: Marsili, Damiano, et al.
Published: (2025) -
Visual Agentic AI for Spatial Reasoning with a Dynamic API
by: Marsili, Damiano, et al.
Published: (2025) -
Linear Mechanisms for Spatiotemporal Reasoning in Vision Language Models
by: Kang, Raphi, et al.
Published: (2026) -
Conversational Image Segmentation: Grounding Abstract Concepts with Scalable Supervision
by: Sahoo, Aadarsh, et al.
Published: (2026) -
Out of Sight, Out of Mind? Evaluating State Evolution in Video World Models
by: Ma, Ziqi, et al.
Published: (2026)