Saved in:
| Main Author: | Lan, HaoTian |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2506.05080 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Interpretable Multimodal Framework for Human-Centered Street Assessment: Integrating Visual-Language Models for Perceptual Urban Diagnostics
by: Lan, HaoTian
Published: (2025)
by: Lan, HaoTian
Published: (2025)
Diagnosing Urban Street Vitality via a Visual-Semantic and Spatiotemporal Framework for Street-Level Economics
by: Zhuo, Xinxin, et al.
Published: (2026)
by: Zhuo, Xinxin, et al.
Published: (2026)
Parameter-Inverted Image Pyramid Networks for Visual Perception and Multimodal Understanding
by: Wang, Zhaokai, et al.
Published: (2025)
by: Wang, Zhaokai, et al.
Published: (2025)
VLM-FO1: Bridging the Gap Between High-Level Reasoning and Fine-Grained Perception in VLMs
by: Liu, Peng, et al.
Published: (2025)
by: Liu, Peng, et al.
Published: (2025)
Perception-R1: Pioneering Perception Policy with Reinforcement Learning
by: Yu, En, et al.
Published: (2025)
by: Yu, En, et al.
Published: (2025)
Retrieval-Augmented Perception: High-Resolution Image Perception Meets Visual RAG
by: Wang, Wenbin, et al.
Published: (2025)
by: Wang, Wenbin, et al.
Published: (2025)
The Percept-V Challenge: Can Multimodal LLMs Crack Simple Perception Problems?
by: Ghosh, Samrajnee, et al.
Published: (2025)
by: Ghosh, Samrajnee, et al.
Published: (2025)
pLitterStreet: Street Level Plastic Litter Detection and Mapping
by: Mandhati, Sriram Reddy, et al.
Published: (2024)
by: Mandhati, Sriram Reddy, et al.
Published: (2024)
Reinforced Visual Perception with Tools
by: Zhou, Zetong, et al.
Published: (2025)
by: Zhou, Zetong, et al.
Published: (2025)
VELMA: Verbalization Embodiment of LLM Agents for Vision and Language Navigation in Street View
by: Schumann, Raphael, et al.
Published: (2023)
by: Schumann, Raphael, et al.
Published: (2023)
Perception, Reason, Think, and Plan: A Survey on Large Multimodal Reasoning Models
by: Li, Yunxin, et al.
Published: (2025)
by: Li, Yunxin, et al.
Published: (2025)
On the Perception Bottleneck of VLMs for Chart Understanding
by: Liu, Junteng, et al.
Published: (2025)
by: Liu, Junteng, et al.
Published: (2025)
Linking Perception, Confidence and Accuracy in MLLMs
by: Du, Yuetian, et al.
Published: (2026)
by: Du, Yuetian, et al.
Published: (2026)
Eyes on the Streets: Leveraging Street-Level Imaging to Model Urban Crime Dynamics
by: Qi, Zhixuan, et al.
Published: (2024)
by: Qi, Zhixuan, et al.
Published: (2024)
LLM-driven Medical Report Generation via Communication-efficient Heterogeneous Federated Learning
by: Che, Haoxuan, et al.
Published: (2025)
by: Che, Haoxuan, et al.
Published: (2025)
Unleashing Perception-Time Scaling to Multimodal Reasoning Models
by: Li, Yifan, et al.
Published: (2025)
by: Li, Yifan, et al.
Published: (2025)
Does Acceleration Cause Hidden Instability in Vision Language Models? Uncovering Instance-Level Divergence Through a Large-Scale Empirical Study
by: Sun, Yizheng, et al.
Published: (2025)
by: Sun, Yizheng, et al.
Published: (2025)
Mitigating Object Hallucination via Robust Local Perception Search
by: Gao, Zixian, et al.
Published: (2025)
by: Gao, Zixian, et al.
Published: (2025)
Perception Without Engagement: Dissecting the Causal Discovery Deficit in LMMs
by: Liang, Jiafeng, et al.
Published: (2026)
by: Liang, Jiafeng, et al.
Published: (2026)
On Asymmetric Optimization of Reasoning and Perception in Vision-Language Model Post-Training
by: Wu, Xueqing, et al.
Published: (2026)
by: Wu, Xueqing, et al.
Published: (2026)
From Perception to Reasoning: Deep Thinking Empowers Multimodal Large Language Models
by: Zhu, Wenxin, et al.
Published: (2025)
by: Zhu, Wenxin, et al.
Published: (2025)
Refining Skewed Perceptions in Vision-Language Contrastive Models through Visual Representations
by: Dai, Haocheng, et al.
Published: (2024)
by: Dai, Haocheng, et al.
Published: (2024)
SpecEyes: Accelerating Agentic Multimodal LLMs via Speculative Perception and Planning
by: Huang, Haoyu, et al.
Published: (2026)
by: Huang, Haoyu, et al.
Published: (2026)
DEEM: Diffusion Models Serve as the Eyes of Large Language Models for Image Perception
by: Luo, Run, et al.
Published: (2024)
by: Luo, Run, et al.
Published: (2024)
Mobile-Agent: Autonomous Multi-Modal Mobile Device Agent with Visual Perception
by: Wang, Junyang, et al.
Published: (2024)
by: Wang, Junyang, et al.
Published: (2024)
Visual Confused Deputy: Exploiting and Defending Perception Failures in Computer-Using Agents
by: Liu, Xunzhuo, et al.
Published: (2026)
by: Liu, Xunzhuo, et al.
Published: (2026)
Docopilot: Improving Multimodal Models for Document-Level Understanding
by: Duan, Yuchen, et al.
Published: (2025)
by: Duan, Yuchen, et al.
Published: (2025)
Building Floor Number Estimation from Crowdsourced Street-Level Images: Munich Dataset and Baseline Method
by: Sun, Yao, et al.
Published: (2025)
by: Sun, Yao, et al.
Published: (2025)
Diagnosing Vision Language Models' Perception by Leveraging Human Methods for Color Vision Deficiencies
by: Hayashi, Kazuki, et al.
Published: (2025)
by: Hayashi, Kazuki, et al.
Published: (2025)
IV-Bench: A Benchmark for Image-Grounded Video Perception and Reasoning in Multimodal LLMs
by: Ma, David, et al.
Published: (2025)
by: Ma, David, et al.
Published: (2025)
An Empirical Study on Configuring In-Context Learning Demonstrations for Unleashing MLLMs' Sentimental Perception Capability
by: Wu, Daiqing, et al.
Published: (2025)
by: Wu, Daiqing, et al.
Published: (2025)
ALN-P3: Unified Language Alignment for Perception, Prediction, and Planning in Autonomous Driving
by: Ma, Yunsheng, et al.
Published: (2025)
by: Ma, Yunsheng, et al.
Published: (2025)
AesBench: An Expert Benchmark for Multimodal Large Language Models on Image Aesthetics Perception
by: Huang, Yipo, et al.
Published: (2024)
by: Huang, Yipo, et al.
Published: (2024)
Addressing Overthinking in Large Vision-Language Models via Gated Perception-Reasoning Optimization
by: Diao, Xingjian, et al.
Published: (2026)
by: Diao, Xingjian, et al.
Published: (2026)
AIP: Subverting Retrieval-Augmented Generation via Adversarial Instructional Prompt
by: Chaturvedi, Saket S., et al.
Published: (2025)
by: Chaturvedi, Saket S., et al.
Published: (2025)
Token-Level Entropy Reveals Demographic Disparities in Language Models
by: Lee, Messi H. J.
Published: (2025)
by: Lee, Messi H. J.
Published: (2025)
Pixel-Level Reasoning Segmentation via Multi-turn Conversations
by: Cai, Dexian, et al.
Published: (2025)
by: Cai, Dexian, et al.
Published: (2025)
Multi-Level Correlation Network For Few-Shot Image Classification
by: Dang, Yunkai, et al.
Published: (2024)
by: Dang, Yunkai, et al.
Published: (2024)
From Seeing to Thinking: Decoupling Perception and Reasoning Improves Post-Training of Vision-Language Models
by: Wu, Juncheng, et al.
Published: (2026)
by: Wu, Juncheng, et al.
Published: (2026)
PerceptionComp: A Video Benchmark for Complex Perception-Centric Reasoning
by: Li, Shaoxuan, et al.
Published: (2026)
by: Li, Shaoxuan, et al.
Published: (2026)
Similar Items
-
Interpretable Multimodal Framework for Human-Centered Street Assessment: Integrating Visual-Language Models for Perceptual Urban Diagnostics
by: Lan, HaoTian
Published: (2025) -
Diagnosing Urban Street Vitality via a Visual-Semantic and Spatiotemporal Framework for Street-Level Economics
by: Zhuo, Xinxin, et al.
Published: (2026) -
Parameter-Inverted Image Pyramid Networks for Visual Perception and Multimodal Understanding
by: Wang, Zhaokai, et al.
Published: (2025) -
VLM-FO1: Bridging the Gap Between High-Level Reasoning and Fine-Grained Perception in VLMs
by: Liu, Peng, et al.
Published: (2025) -
Perception-R1: Pioneering Perception Policy with Reinforcement Learning
by: Yu, En, et al.
Published: (2025)