:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Author:	Lan, HaoTian
Format:	Preprint
Published:	2025
Subjects:	Computation and Language Computer Vision and Pattern Recognition
Online Access:	https://arxiv.org/abs/2506.05080
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

Interpretable Multimodal Framework for Human-Centered Street Assessment: Integrating Visual-Language Models for Perceptual Urban Diagnostics
by: Lan, HaoTian
Published: (2025)

Diagnosing Urban Street Vitality via a Visual-Semantic and Spatiotemporal Framework for Street-Level Economics
by: Zhuo, Xinxin, et al.
Published: (2026)

Parameter-Inverted Image Pyramid Networks for Visual Perception and Multimodal Understanding
by: Wang, Zhaokai, et al.
Published: (2025)

VLM-FO1: Bridging the Gap Between High-Level Reasoning and Fine-Grained Perception in VLMs
by: Liu, Peng, et al.
Published: (2025)

Perception-R1: Pioneering Perception Policy with Reinforcement Learning
by: Yu, En, et al.
Published: (2025)

Retrieval-Augmented Perception: High-Resolution Image Perception Meets Visual RAG
by: Wang, Wenbin, et al.
Published: (2025)

The Percept-V Challenge: Can Multimodal LLMs Crack Simple Perception Problems?
by: Ghosh, Samrajnee, et al.
Published: (2025)

pLitterStreet: Street Level Plastic Litter Detection and Mapping
by: Mandhati, Sriram Reddy, et al.
Published: (2024)

Reinforced Visual Perception with Tools
by: Zhou, Zetong, et al.
Published: (2025)

VELMA: Verbalization Embodiment of LLM Agents for Vision and Language Navigation in Street View
by: Schumann, Raphael, et al.
Published: (2023)

Perception, Reason, Think, and Plan: A Survey on Large Multimodal Reasoning Models
by: Li, Yunxin, et al.
Published: (2025)

On the Perception Bottleneck of VLMs for Chart Understanding
by: Liu, Junteng, et al.
Published: (2025)

Linking Perception, Confidence and Accuracy in MLLMs
by: Du, Yuetian, et al.
Published: (2026)

Eyes on the Streets: Leveraging Street-Level Imaging to Model Urban Crime Dynamics
by: Qi, Zhixuan, et al.
Published: (2024)

LLM-driven Medical Report Generation via Communication-efficient Heterogeneous Federated Learning
by: Che, Haoxuan, et al.
Published: (2025)

Unleashing Perception-Time Scaling to Multimodal Reasoning Models
by: Li, Yifan, et al.
Published: (2025)

Does Acceleration Cause Hidden Instability in Vision Language Models? Uncovering Instance-Level Divergence Through a Large-Scale Empirical Study
by: Sun, Yizheng, et al.
Published: (2025)

Mitigating Object Hallucination via Robust Local Perception Search
by: Gao, Zixian, et al.
Published: (2025)

Perception Without Engagement: Dissecting the Causal Discovery Deficit in LMMs
by: Liang, Jiafeng, et al.
Published: (2026)

On Asymmetric Optimization of Reasoning and Perception in Vision-Language Model Post-Training
by: Wu, Xueqing, et al.
Published: (2026)

From Perception to Reasoning: Deep Thinking Empowers Multimodal Large Language Models
by: Zhu, Wenxin, et al.
Published: (2025)

Refining Skewed Perceptions in Vision-Language Contrastive Models through Visual Representations
by: Dai, Haocheng, et al.
Published: (2024)

SpecEyes: Accelerating Agentic Multimodal LLMs via Speculative Perception and Planning
by: Huang, Haoyu, et al.
Published: (2026)

DEEM: Diffusion Models Serve as the Eyes of Large Language Models for Image Perception
by: Luo, Run, et al.
Published: (2024)

Mobile-Agent: Autonomous Multi-Modal Mobile Device Agent with Visual Perception
by: Wang, Junyang, et al.
Published: (2024)

Visual Confused Deputy: Exploiting and Defending Perception Failures in Computer-Using Agents
by: Liu, Xunzhuo, et al.
Published: (2026)

Docopilot: Improving Multimodal Models for Document-Level Understanding
by: Duan, Yuchen, et al.
Published: (2025)

Building Floor Number Estimation from Crowdsourced Street-Level Images: Munich Dataset and Baseline Method
by: Sun, Yao, et al.
Published: (2025)

Diagnosing Vision Language Models' Perception by Leveraging Human Methods for Color Vision Deficiencies
by: Hayashi, Kazuki, et al.
Published: (2025)

IV-Bench: A Benchmark for Image-Grounded Video Perception and Reasoning in Multimodal LLMs
by: Ma, David, et al.
Published: (2025)

An Empirical Study on Configuring In-Context Learning Demonstrations for Unleashing MLLMs' Sentimental Perception Capability
by: Wu, Daiqing, et al.
Published: (2025)

ALN-P3: Unified Language Alignment for Perception, Prediction, and Planning in Autonomous Driving
by: Ma, Yunsheng, et al.
Published: (2025)

AesBench: An Expert Benchmark for Multimodal Large Language Models on Image Aesthetics Perception
by: Huang, Yipo, et al.
Published: (2024)

Addressing Overthinking in Large Vision-Language Models via Gated Perception-Reasoning Optimization
by: Diao, Xingjian, et al.
Published: (2026)

AIP: Subverting Retrieval-Augmented Generation via Adversarial Instructional Prompt
by: Chaturvedi, Saket S., et al.
Published: (2025)

Token-Level Entropy Reveals Demographic Disparities in Language Models
by: Lee, Messi H. J.
Published: (2025)

Pixel-Level Reasoning Segmentation via Multi-turn Conversations
by: Cai, Dexian, et al.
Published: (2025)

Multi-Level Correlation Network For Few-Shot Image Classification
by: Dang, Yunkai, et al.
Published: (2024)

From Seeing to Thinking: Decoupling Perception and Reasoning Improves Post-Training of Vision-Language Models
by: Wu, Juncheng, et al.
Published: (2026)

PerceptionComp: A Video Benchmark for Complex Perception-Centric Reasoning
by: Li, Shaoxuan, et al.
Published: (2026)