Saved in:
| Main Authors: | Ghatkesar, Aarti, Venkatesh, Ganesh |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2505.05626 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
SCRAMBLe : Enhancing Multimodal LLM Compositionality with Synthetic Preference Data
by: Mishra, Samarth, et al.
Published: (2025)
by: Mishra, Samarth, et al.
Published: (2025)
Fusion to Enhance: Fusion Visual Encoder to Enhance Multimodal Language Model
by: She, Yifei, et al.
Published: (2025)
by: She, Yifei, et al.
Published: (2025)
Allocentric Perceiver: Disentangling Allocentric Reasoning from Egocentric Visual Priors via Frame Instantiation
by: Wang, Hengyi, et al.
Published: (2026)
by: Wang, Hengyi, et al.
Published: (2026)
BLINK: Multimodal Large Language Models Can See but Not Perceive
by: Fu, Xingyu, et al.
Published: (2024)
by: Fu, Xingyu, et al.
Published: (2024)
Visual Attention Never Fades: Selective Progressive Attention ReCalibration for Detailed Image Captioning in Multimodal Large Language Models
by: Jung, Mingi, et al.
Published: (2025)
by: Jung, Mingi, et al.
Published: (2025)
Beyond 3D VQAs: Injecting 3D Spatial Priors into Vision-Language Models for Enhanced Geometric Reasoning
by: Yeh, Chun-Hsiao, et al.
Published: (2026)
by: Yeh, Chun-Hsiao, et al.
Published: (2026)
PriorCLIP: Visual Prior Guided Vision-Language Model for Remote Sensing Image-Text Retrieval
by: Pan, Jiancheng, et al.
Published: (2024)
by: Pan, Jiancheng, et al.
Published: (2024)
Aquila: A Hierarchically Aligned Visual-Language Model for Enhanced Remote Sensing Image Comprehension
by: Lu, Kaixuan, et al.
Published: (2024)
by: Lu, Kaixuan, et al.
Published: (2024)
DentVLM: A Multimodal Vision-Language Model for Comprehensive Dental Diagnosis and Enhanced Clinical Practice
by: Meng, Zijie, et al.
Published: (2025)
by: Meng, Zijie, et al.
Published: (2025)
Probing Visual Concepts in Lightweight Vision-Language Models for Automated Driving
by: Theodoridis, Nikos, et al.
Published: (2026)
by: Theodoridis, Nikos, et al.
Published: (2026)
See What You Are Told: Visual Attention Sink in Large Multimodal Models
by: Kang, Seil, et al.
Published: (2025)
by: Kang, Seil, et al.
Published: (2025)
Decomposing Complex Visual Comprehension into Atomic Visual Skills for Vision Language Models
by: Chae, Hyunsik, et al.
Published: (2025)
by: Chae, Hyunsik, et al.
Published: (2025)
Beyond Visual Safety: Jailbreaking Multimodal Large Language Models for Harmful Image Generation via Semantic-Agnostic Inputs
by: Yu, Mingyu, et al.
Published: (2026)
by: Yu, Mingyu, et al.
Published: (2026)
Perception Tokens Enhance Visual Reasoning in Multimodal Language Models
by: Bigverdi, Mahtab, et al.
Published: (2024)
by: Bigverdi, Mahtab, et al.
Published: (2024)
GPT-4 Enhanced Multimodal Grounding for Autonomous Driving: Leveraging Cross-Modal Attention with Large Language Models
by: Liao, Haicheng, et al.
Published: (2023)
by: Liao, Haicheng, et al.
Published: (2023)
Beyond Text-Visual Attention: Exploiting Visual Cues for Effective Token Pruning in VLMs
by: Zhang, Qizhe, et al.
Published: (2024)
by: Zhang, Qizhe, et al.
Published: (2024)
Animation Needs Attention: A Holistic Approach to Slides Animation Comprehension with Visual-Language Models
by: Jiang, Yifan, et al.
Published: (2025)
by: Jiang, Yifan, et al.
Published: (2025)
Seeing Beyond Words: Self-Supervised Visual Learning for Multimodal Large Language Models
by: Caffagni, Davide, et al.
Published: (2025)
by: Caffagni, Davide, et al.
Published: (2025)
True Multimodal In-Context Learning Needs Attention to the Visual Context
by: Chen, Shuo, et al.
Published: (2025)
by: Chen, Shuo, et al.
Published: (2025)
Demystifying the Visual Quality Paradox in Multimodal Large Language Models
by: Xing, Shuo, et al.
Published: (2025)
by: Xing, Shuo, et al.
Published: (2025)
First Multi-Dimensional Evaluation of Flowchart Comprehension for Multimodal Large Language Models
by: Zhang, Enming, et al.
Published: (2024)
by: Zhang, Enming, et al.
Published: (2024)
MaPPER: Multimodal Prior-guided Parameter Efficient Tuning for Referring Expression Comprehension
by: Liu, Ting, et al.
Published: (2024)
by: Liu, Ting, et al.
Published: (2024)
EAGLE: Enhanced Visual Grounding Minimizes Hallucinations in Instructional Multimodal Models
by: Villa, Andrés, et al.
Published: (2025)
by: Villa, Andrés, et al.
Published: (2025)
Visual Graph Arena: Evaluating Visual Conceptualization of Vision and Multimodal Large Language Models
by: Babaiee, Zahra, et al.
Published: (2025)
by: Babaiee, Zahra, et al.
Published: (2025)
Do MLLMs Really See It: Reinforcing Visual Attention in Multimodal LLMs
by: Ou, Siqu, et al.
Published: (2026)
by: Ou, Siqu, et al.
Published: (2026)
Beyond Recognition: Evaluating Visual Perspective Taking in Vision Language Models
by: Góral, Gracjan, et al.
Published: (2025)
by: Góral, Gracjan, et al.
Published: (2025)
Focusing by Contrastive Attention: Enhancing VLMs' Visual Reasoning
by: Ge, Yuyao, et al.
Published: (2025)
by: Ge, Yuyao, et al.
Published: (2025)
Modelling Visual Semantics via Image Captioning to extract Enhanced Multi-Level Cross-Modal Semantic Incongruity Representation with Attention for Multimodal Sarcasm Detection
by: Aggarwal, Sajal, et al.
Published: (2024)
by: Aggarwal, Sajal, et al.
Published: (2024)
MMAD: A Comprehensive Benchmark for Multimodal Large Language Models in Industrial Anomaly Detection
by: Jiang, Xi, et al.
Published: (2024)
by: Jiang, Xi, et al.
Published: (2024)
Visual Funnel: Resolving Contextual Blindness in Multimodal Large Language Models
by: Jung, Woojun, et al.
Published: (2025)
by: Jung, Woojun, et al.
Published: (2025)
Unleashing the Intrinsic Visual Representation Capability of Multimodal Large Language Models
by: Li, Hengzhuang, et al.
Published: (2025)
by: Li, Hengzhuang, et al.
Published: (2025)
A Multimodal-Multitask Framework with Cross-modal Relation and Hierarchical Interactive Attention for Semantic Comprehension
by: Rehman, Mohammad Zia Ur, et al.
Published: (2025)
by: Rehman, Mohammad Zia Ur, et al.
Published: (2025)
CLDTracker: A Comprehensive Language Description for Visual Tracking
by: Alansari, Mohamad, et al.
Published: (2025)
by: Alansari, Mohamad, et al.
Published: (2025)
Beyond Attention Scores: SVD-Based Vision Token Pruning for Efficient Vision-Language Models
by: Apedo, Yvon, et al.
Published: (2026)
by: Apedo, Yvon, et al.
Published: (2026)
Trifuse: Enhancing Attention-Based GUI Grounding via Multimodal Fusion
by: Ma, Longhui, et al.
Published: (2026)
by: Ma, Longhui, et al.
Published: (2026)
HM-Bench: A Comprehensive Benchmark for Multimodal Large Language Models in Hyperspectral Remote Sensing
by: Zhang, Xinyu, et al.
Published: (2026)
by: Zhang, Xinyu, et al.
Published: (2026)
Visual-Noise Guided In-Context Distillation for Multimodal Large Language Model Unlearning
by: Chen, Junkai, et al.
Published: (2026)
by: Chen, Junkai, et al.
Published: (2026)
Boosting Multimodal Large Language Models with Visual Tokens Withdrawal for Rapid Inference
by: Lin, Zhihang, et al.
Published: (2024)
by: Lin, Zhihang, et al.
Published: (2024)
Vision Inference Former: Sustaining Visual Consistency in Multimodal Large Language Models
by: Dong, Xinpeng, et al.
Published: (2026)
by: Dong, Xinpeng, et al.
Published: (2026)
Feature Recalibration Based Olfactory-Visual Multimodal Model for Enhanced Rice Deterioration Detection
by: Zhao, Rongqiang, et al.
Published: (2026)
by: Zhao, Rongqiang, et al.
Published: (2026)
Similar Items
-
SCRAMBLe : Enhancing Multimodal LLM Compositionality with Synthetic Preference Data
by: Mishra, Samarth, et al.
Published: (2025) -
Fusion to Enhance: Fusion Visual Encoder to Enhance Multimodal Language Model
by: She, Yifei, et al.
Published: (2025) -
Allocentric Perceiver: Disentangling Allocentric Reasoning from Egocentric Visual Priors via Frame Instantiation
by: Wang, Hengyi, et al.
Published: (2026) -
BLINK: Multimodal Large Language Models Can See but Not Perceive
by: Fu, Xingyu, et al.
Published: (2024) -
Visual Attention Never Fades: Selective Progressive Attention ReCalibration for Detailed Image Captioning in Multimodal Large Language Models
by: Jung, Mingi, et al.
Published: (2025)