:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Marsili, Damiano, Mehta, Aditya, Lin, Ryan Y., Gkioxari, Georgia
Format:	Preprint
Published:	2025
Subjects:	Computer Vision and Pattern Recognition
Online Access:	https://arxiv.org/abs/2512.23592
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

No Labels, No Problem: Training Visual Reasoners with Multimodal Verifiers
by: Marsili, Damiano, et al.
Published: (2025)

Visual Agentic AI for Spatial Reasoning with a Dynamic API
by: Marsili, Damiano, et al.
Published: (2025)

Linear Mechanisms for Spatiotemporal Reasoning in Vision Language Models
by: Kang, Raphi, et al.
Published: (2026)

Conversational Image Segmentation: Grounding Abstract Concepts with Scalable Supervision
by: Sahoo, Aadarsh, et al.
Published: (2026)

Out of Sight, Out of Mind? Evaluating State Evolution in Video World Models
by: Ma, Ziqi, et al.
Published: (2026)

Aligning Text, Images, and 3D Structure Token-by-Token
by: Sahoo, Aadarsh, et al.
Published: (2025)

Is This Tracker On? A Benchmark Protocol for Dynamic Tracking
by: Demler, Ilona, et al.
Published: (2025)

Find Any Part in 3D
by: Ma, Ziqi, et al.
Published: (2024)

Reconstructing Hand-Held Objects in 3D from Images and Videos
by: Wu, Jane, et al.
Published: (2024)

MonoTher-Depth: Enhancing Thermal Depth Estimation via Confidence-Aware Distillation
by: Zuo, Xingxing, et al.
Published: (2025)

Feedforward 3D Editing via Text-Steerable Image-to-3D
by: Ma, Ziqi, et al.
Published: (2025)

Is CLIP ideal? No. Can we fix it? Yes!
by: Kang, Raphi, et al.
Published: (2025)

Adapting Lightweight Vision Language Models for Radiological Visual Question Answering
by: Shourya, Aditya, et al.
Published: (2025)

LDFaceNet: Latent Diffusion-based Network for High-Fidelity Deepfake Generation
by: Mehta, Dwij, et al.
Published: (2024)

Caltech Aerial RGB-Thermal Dataset in the Wild
by: Lee, Connor, et al.
Published: (2024)

Efficient Vision-Language Models by Summarizing Visual Tokens into Compact Registers
by: Wen, Yuxin, et al.
Published: (2024)

ViPER: Empowering the Self-Evolution of Visual Perception Abilities in Vision-Language Model
by: Zhang, Juntian, et al.
Published: (2025)

Refining Skewed Perceptions in Vision-Language Contrastive Models through Visual Representations
by: Dai, Haocheng, et al.
Published: (2024)

Object Referring-Guided Scanpath Prediction with Perception-Enhanced Vision-Language Models
by: Quan, Rong, et al.
Published: (2026)

Quality Text, Robust Vision: The Role of Language in Enhancing Visual Robustness of Vision-Language Models
by: Waseda, Futa, et al.
Published: (2025)

Video Active Perception: Effective Inference-Time Long-Form Video Understanding with Vision-Language Models
by: Ma, Martin Q., et al.
Published: (2026)

Do Vision-Language Foundational models show Robust Visual Perception?
by: Chandhok, Shivam, et al.
Published: (2024)

Do You See Me : A Multidimensional Benchmark for Evaluating Visual Perception in Multimodal LLMs
by: Kanade, Aditya, et al.
Published: (2025)

Visual Perception by Large Language Model's Weights
by: Ma, Feipeng, et al.
Published: (2024)

On Asymmetric Optimization of Reasoning and Perception in Vision-Language Model Post-Training
by: Wu, Xueqing, et al.
Published: (2026)

SSR: Enhancing Depth Perception in Vision-Language Models via Rationale-Guided Spatial Reasoning
by: Liu, Yang, et al.
Published: (2025)

DynRsl-VLM: Enhancing Autonomous Driving Perception with Dynamic Resolution Vision-Language Models
by: Zhou, Xirui, et al.
Published: (2025)

Not There Yet: Evaluating Vision Language Models in Simulating the Visual Perception of People with Low Vision
by: Natalie, Rosiana, et al.
Published: (2025)

Investigate the Low-level Visual Perception in Vision-Language based Image Quality Assessment
by: Li, Yuan, et al.
Published: (2025)

EPIC-Bench: A Perception-Centric Benchmark for Fine-Grained Embodied Visual Grounding in Vision-Language Models
by: Shan, Haozhe, et al.
Published: (2026)

ObjectTransforms for Uncertainty Quantification and Reduction in Vision-Based Perception for Autonomous Vehicles
by: Sahu, Nishad, et al.
Published: (2025)

Are They the Same? Exploring Visual Correspondence Shortcomings of Multimodal LLMs
by: Zhou, Yikang, et al.
Published: (2025)

Losing Visual Needles in Image Haystacks: Vision Language Models are Easily Distracted in Short and Long Contexts
by: Sharma, Aditya, et al.
Published: (2024)

VisOnlyQA: Large Vision Language Models Still Struggle with Visual Perception of Geometric Information
by: Kamoi, Ryo, et al.
Published: (2024)

AR as an Evaluation Playground: Bridging Metrics and Visual Perception of Computer Vision Models
by: Ganj, Ashkan, et al.
Published: (2025)

MegaCOIN: Enhancing Medium-Grained Color Perception for Vision-Language Models
by: Chiu, Ming-Chang, et al.
Published: (2024)

Perceptio: Perception Enhanced Vision Language Models via Spatial Token Generation
by: Li, Yuchen, et al.
Published: (2026)

PatchCue: Enhancing Vision-Language Model Reasoning with Patch-Based Visual Cues
by: Qi, Yukun, et al.
Published: (2026)

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution
by: Wang, Peng, et al.
Published: (2024)

Look Again, Think Slowly: Enhancing Visual Reflection in Vision-Language Models
by: Jian, Pu, et al.
Published: (2025)