:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Liu, Xunzhuo, He, Bowei, Liu, Xue, Luo, Andy, Zhang, Haichen, Chen, Huamin
Format:	Preprint
Published:	2026
Subjects:	Computer Vision and Pattern Recognition Computation and Language
Online Access:	https://arxiv.org/abs/2603.14707
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

Adaptive Vision-Language Model Routing for Computer Use Agents
by: Liu, Xunzhuo, et al.
Published: (2026)

Knowledge Access Beats Model Size: Memory Augmented Routing for Persistent AI Agents
by: Liu, Xunzhuo, et al.
Published: (2026)

Dual-Pool Token-Budget Routing for Cost-Efficient and Reliable LLM Serving
by: Liu, Xunzhuo, et al.
Published: (2026)

98$\times$ Faster LLM Routing Without a Dedicated GPU: Flash Attention, Prompt Compression, and Near-Streaming for the vLLM Semantic Router
by: Liu, Xunzhuo, et al.
Published: (2026)

Fast and Faithful: Real-Time Verification for Long-Document Retrieval-Augmented Generation Systems
by: Liu, Xunzhuo, et al.
Published: (2026)

Uncovering Entity Identity Confusion in Multimodal Knowledge Editing
by: Wu, Shu, et al.
Published: (2026)

Mobile-Agent: Autonomous Multi-Modal Mobile Device Agent with Visual Perception
by: Wang, Junyang, et al.
Published: (2024)

Parameter-Inverted Image Pyramid Networks for Visual Perception and Multimodal Understanding
by: Wang, Zhaokai, et al.
Published: (2025)

Vision Language Models are Confused Tourists
by: Irawan, Patrick Amadeus, et al.
Published: (2025)

Retrieval-Augmented Perception: High-Resolution Image Perception Meets Visual RAG
by: Wang, Wenbin, et al.
Published: (2025)

Reinforced Visual Perception with Tools
by: Zhou, Zetong, et al.
Published: (2025)

VizDefender: Unmasking Visualization Tampering through Proactive Localization and Intent Inference
by: Song, Sicheng, et al.
Published: (2025)

A Computational Approach to Visual Metonymy
by: Ghosh, Saptarshi, et al.
Published: (2026)

Robust Prompt Optimization for Defending Language Models Against Jailbreaking Attacks
by: Zhou, Andy, et al.
Published: (2024)

How Good (Or Bad) Are LLMs at Detecting Misleading Visualizations?
by: Lo, Leo Yu-Ho, et al.
Published: (2024)

DEEM: Diffusion Models Serve as the Eyes of Large Language Models for Image Perception
by: Luo, Run, et al.
Published: (2024)

GEM: Context-Aware Gaze EstiMation with Visual Search Behavior Matching for Chest Radiograph
by: Liu, Shaonan, et al.
Published: (2024)

On the Perception Bottleneck of VLMs for Chart Understanding
by: Liu, Junteng, et al.
Published: (2025)

Mono-InternVL: Pushing the Boundaries of Monolithic Multimodal Large Language Models with Endogenous Visual Pre-training
by: Luo, Gen, et al.
Published: (2024)

Images are Achilles' Heel of Alignment: Exploiting Visual Vulnerabilities for Jailbreaking Multimodal Large Language Models
by: Li, Yifan, et al.
Published: (2024)

Unleashing Perception-Time Scaling to Multimodal Reasoning Models
by: Li, Yifan, et al.
Published: (2025)

Video-Based Reward Modeling for Computer-Use Agents
by: Song, Linxin, et al.
Published: (2026)

Hierarchical Visual Agent: Managing Contexts in Joint Image-Text Space for Advanced Chart Reasoning
by: Dong, Qihua, et al.
Published: (2026)

Delve into Base-Novel Confusion: Redundancy Exploration for Few-Shot Class-Incremental Learning
by: Zhou, Haichen, et al.
Published: (2024)

VisionFoundry: Teaching VLMs Visual Perception with Synthetic Images
by: Zhou, Guanyu, et al.
Published: (2026)

Visual Chain of Thought: Bridging Logical Gaps with Multimodal Infillings
by: Rose, Daniel, et al.
Published: (2023)

Latent Visual Reasoning
by: Li, Bangzheng, et al.
Published: (2025)

Learning to Exploit Temporal Structure for Biomedical Vision-Language Processing
by: Bannur, Shruthi, et al.
Published: (2023)

ALN-P3: Unified Language Alignment for Perception, Prediction, and Planning in Autonomous Driving
by: Ma, Yunsheng, et al.
Published: (2025)

AMIA: Automatic Masking and Joint Intention Analysis Makes LVLMs Robust Jailbreak Defenders
by: Zhang, Yuqi, et al.
Published: (2025)

Computed Tomography Visual Question Answering with Cross-modal Feature Graphing
by: Tian, Yuanhe, et al.
Published: (2025)

Arbitration Failure, Not Perceptual Blindness: How Vision-Language Models Resolve Visual-Linguistic Conflicts
by: Nooralahzadeh, Farhad, et al.
Published: (2026)

Failure Makes the Agent Stronger: Enhancing Accuracy through Structured Reflection for Reliable Tool Interactions
by: Su, Junhao, et al.
Published: (2025)

VisualAgentBench: Towards Large Multimodal Models as Visual Foundation Agents
by: Liu, Xiao, et al.
Published: (2024)

Refining Skewed Perceptions in Vision-Language Contrastive Models through Visual Representations
by: Dai, Haocheng, et al.
Published: (2024)

VividMed: Vision Language Model with Versatile Visual Grounding for Medicine
by: Luo, Lingxiao, et al.
Published: (2024)

Bi-VLDoc: Bidirectional Vision-Language Modeling for Visually-Rich Document Understanding
by: Luo, Chuwei, et al.
Published: (2022)

Perception, Reason, Think, and Plan: A Survey on Large Multimodal Reasoning Models
by: Li, Yunxin, et al.
Published: (2025)

CNVSRC 2023: The First Chinese Continuous Visual Speech Recognition Challenge
by: Chen, Chen, et al.
Published: (2024)

Linking Perception, Confidence and Accuracy in MLLMs
by: Du, Yuetian, et al.
Published: (2026)