Saved in:
| Main Authors: | Liu, Xunzhuo, He, Bowei, Liu, Xue, Luo, Andy, Zhang, Haichen, Chen, Huamin |
|---|---|
| Format: | Preprint |
| Published: |
2026
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2603.14707 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Adaptive Vision-Language Model Routing for Computer Use Agents
by: Liu, Xunzhuo, et al.
Published: (2026)
by: Liu, Xunzhuo, et al.
Published: (2026)
Knowledge Access Beats Model Size: Memory Augmented Routing for Persistent AI Agents
by: Liu, Xunzhuo, et al.
Published: (2026)
by: Liu, Xunzhuo, et al.
Published: (2026)
Dual-Pool Token-Budget Routing for Cost-Efficient and Reliable LLM Serving
by: Liu, Xunzhuo, et al.
Published: (2026)
by: Liu, Xunzhuo, et al.
Published: (2026)
98$\times$ Faster LLM Routing Without a Dedicated GPU: Flash Attention, Prompt Compression, and Near-Streaming for the vLLM Semantic Router
by: Liu, Xunzhuo, et al.
Published: (2026)
by: Liu, Xunzhuo, et al.
Published: (2026)
Fast and Faithful: Real-Time Verification for Long-Document Retrieval-Augmented Generation Systems
by: Liu, Xunzhuo, et al.
Published: (2026)
by: Liu, Xunzhuo, et al.
Published: (2026)
Uncovering Entity Identity Confusion in Multimodal Knowledge Editing
by: Wu, Shu, et al.
Published: (2026)
by: Wu, Shu, et al.
Published: (2026)
Mobile-Agent: Autonomous Multi-Modal Mobile Device Agent with Visual Perception
by: Wang, Junyang, et al.
Published: (2024)
by: Wang, Junyang, et al.
Published: (2024)
Parameter-Inverted Image Pyramid Networks for Visual Perception and Multimodal Understanding
by: Wang, Zhaokai, et al.
Published: (2025)
by: Wang, Zhaokai, et al.
Published: (2025)
Vision Language Models are Confused Tourists
by: Irawan, Patrick Amadeus, et al.
Published: (2025)
by: Irawan, Patrick Amadeus, et al.
Published: (2025)
Retrieval-Augmented Perception: High-Resolution Image Perception Meets Visual RAG
by: Wang, Wenbin, et al.
Published: (2025)
by: Wang, Wenbin, et al.
Published: (2025)
Reinforced Visual Perception with Tools
by: Zhou, Zetong, et al.
Published: (2025)
by: Zhou, Zetong, et al.
Published: (2025)
VizDefender: Unmasking Visualization Tampering through Proactive Localization and Intent Inference
by: Song, Sicheng, et al.
Published: (2025)
by: Song, Sicheng, et al.
Published: (2025)
A Computational Approach to Visual Metonymy
by: Ghosh, Saptarshi, et al.
Published: (2026)
by: Ghosh, Saptarshi, et al.
Published: (2026)
Robust Prompt Optimization for Defending Language Models Against Jailbreaking Attacks
by: Zhou, Andy, et al.
Published: (2024)
by: Zhou, Andy, et al.
Published: (2024)
How Good (Or Bad) Are LLMs at Detecting Misleading Visualizations?
by: Lo, Leo Yu-Ho, et al.
Published: (2024)
by: Lo, Leo Yu-Ho, et al.
Published: (2024)
DEEM: Diffusion Models Serve as the Eyes of Large Language Models for Image Perception
by: Luo, Run, et al.
Published: (2024)
by: Luo, Run, et al.
Published: (2024)
GEM: Context-Aware Gaze EstiMation with Visual Search Behavior Matching for Chest Radiograph
by: Liu, Shaonan, et al.
Published: (2024)
by: Liu, Shaonan, et al.
Published: (2024)
On the Perception Bottleneck of VLMs for Chart Understanding
by: Liu, Junteng, et al.
Published: (2025)
by: Liu, Junteng, et al.
Published: (2025)
Mono-InternVL: Pushing the Boundaries of Monolithic Multimodal Large Language Models with Endogenous Visual Pre-training
by: Luo, Gen, et al.
Published: (2024)
by: Luo, Gen, et al.
Published: (2024)
Images are Achilles' Heel of Alignment: Exploiting Visual Vulnerabilities for Jailbreaking Multimodal Large Language Models
by: Li, Yifan, et al.
Published: (2024)
by: Li, Yifan, et al.
Published: (2024)
Unleashing Perception-Time Scaling to Multimodal Reasoning Models
by: Li, Yifan, et al.
Published: (2025)
by: Li, Yifan, et al.
Published: (2025)
Video-Based Reward Modeling for Computer-Use Agents
by: Song, Linxin, et al.
Published: (2026)
by: Song, Linxin, et al.
Published: (2026)
Hierarchical Visual Agent: Managing Contexts in Joint Image-Text Space for Advanced Chart Reasoning
by: Dong, Qihua, et al.
Published: (2026)
by: Dong, Qihua, et al.
Published: (2026)
Delve into Base-Novel Confusion: Redundancy Exploration for Few-Shot Class-Incremental Learning
by: Zhou, Haichen, et al.
Published: (2024)
by: Zhou, Haichen, et al.
Published: (2024)
VisionFoundry: Teaching VLMs Visual Perception with Synthetic Images
by: Zhou, Guanyu, et al.
Published: (2026)
by: Zhou, Guanyu, et al.
Published: (2026)
Visual Chain of Thought: Bridging Logical Gaps with Multimodal Infillings
by: Rose, Daniel, et al.
Published: (2023)
by: Rose, Daniel, et al.
Published: (2023)
Latent Visual Reasoning
by: Li, Bangzheng, et al.
Published: (2025)
by: Li, Bangzheng, et al.
Published: (2025)
Learning to Exploit Temporal Structure for Biomedical Vision-Language Processing
by: Bannur, Shruthi, et al.
Published: (2023)
by: Bannur, Shruthi, et al.
Published: (2023)
ALN-P3: Unified Language Alignment for Perception, Prediction, and Planning in Autonomous Driving
by: Ma, Yunsheng, et al.
Published: (2025)
by: Ma, Yunsheng, et al.
Published: (2025)
AMIA: Automatic Masking and Joint Intention Analysis Makes LVLMs Robust Jailbreak Defenders
by: Zhang, Yuqi, et al.
Published: (2025)
by: Zhang, Yuqi, et al.
Published: (2025)
Computed Tomography Visual Question Answering with Cross-modal Feature Graphing
by: Tian, Yuanhe, et al.
Published: (2025)
by: Tian, Yuanhe, et al.
Published: (2025)
Arbitration Failure, Not Perceptual Blindness: How Vision-Language Models Resolve Visual-Linguistic Conflicts
by: Nooralahzadeh, Farhad, et al.
Published: (2026)
by: Nooralahzadeh, Farhad, et al.
Published: (2026)
Failure Makes the Agent Stronger: Enhancing Accuracy through Structured Reflection for Reliable Tool Interactions
by: Su, Junhao, et al.
Published: (2025)
by: Su, Junhao, et al.
Published: (2025)
VisualAgentBench: Towards Large Multimodal Models as Visual Foundation Agents
by: Liu, Xiao, et al.
Published: (2024)
by: Liu, Xiao, et al.
Published: (2024)
Refining Skewed Perceptions in Vision-Language Contrastive Models through Visual Representations
by: Dai, Haocheng, et al.
Published: (2024)
by: Dai, Haocheng, et al.
Published: (2024)
VividMed: Vision Language Model with Versatile Visual Grounding for Medicine
by: Luo, Lingxiao, et al.
Published: (2024)
by: Luo, Lingxiao, et al.
Published: (2024)
Bi-VLDoc: Bidirectional Vision-Language Modeling for Visually-Rich Document Understanding
by: Luo, Chuwei, et al.
Published: (2022)
by: Luo, Chuwei, et al.
Published: (2022)
Perception, Reason, Think, and Plan: A Survey on Large Multimodal Reasoning Models
by: Li, Yunxin, et al.
Published: (2025)
by: Li, Yunxin, et al.
Published: (2025)
CNVSRC 2023: The First Chinese Continuous Visual Speech Recognition Challenge
by: Chen, Chen, et al.
Published: (2024)
by: Chen, Chen, et al.
Published: (2024)
Linking Perception, Confidence and Accuracy in MLLMs
by: Du, Yuetian, et al.
Published: (2026)
by: Du, Yuetian, et al.
Published: (2026)
Similar Items
-
Adaptive Vision-Language Model Routing for Computer Use Agents
by: Liu, Xunzhuo, et al.
Published: (2026) -
Knowledge Access Beats Model Size: Memory Augmented Routing for Persistent AI Agents
by: Liu, Xunzhuo, et al.
Published: (2026) -
Dual-Pool Token-Budget Routing for Cost-Efficient and Reliable LLM Serving
by: Liu, Xunzhuo, et al.
Published: (2026) -
98$\times$ Faster LLM Routing Without a Dedicated GPU: Flash Attention, Prompt Compression, and Near-Streaming for the vLLM Semantic Router
by: Liu, Xunzhuo, et al.
Published: (2026) -
Fast and Faithful: Real-Time Verification for Long-Document Retrieval-Augmented Generation Systems
by: Liu, Xunzhuo, et al.
Published: (2026)