:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Wang, Wei-Yao, Wang, Zhao, Suzuki, Helen, Kobayashi, Yoshiyuki
Format:	Preprint
Published:	2025
Subjects:	Computer Vision and Pattern Recognition Artificial Intelligence
Online Access:	https://arxiv.org/abs/2503.02597
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

Do MLLMs Really See It: Reinforcing Visual Attention in Multimodal LLMs
by: Ou, Siqu, et al.
Published: (2026)

See What You Are Told: Visual Attention Sink in Large Multimodal Models
by: Kang, Seil, et al.
Published: (2025)

Trifuse: Enhancing Attention-Based GUI Grounding via Multimodal Fusion
by: Ma, Longhui, et al.
Published: (2026)

Video-CCAM: Enhancing Video-Language Understanding with Causal Cross-Attention Masks for Short and Long Videos
by: Fei, Jiajun, et al.
Published: (2024)

Seeing but Not Believing: Probing the Disconnect Between Visual Attention and Answer Correctness in VLMs
by: Liu, Zhining, et al.
Published: (2025)

CompoDistill: Attention Distillation for Compositional Reasoning in Multimodal LLMs
by: Kim, Jiwan, et al.
Published: (2025)

Multimodal Fusion SLAM with Fourier Attention
by: Zhou, Youjie, et al.
Published: (2025)

Solution to the 10th ABAW Expression Recognition Challenge: A Robust Multimodal Framework with Safe Cross-Attention and Modality Dropout
by: Yu, Jun, et al.
Published: (2026)

GPT-4 Enhanced Multimodal Grounding for Autonomous Driving: Leveraging Cross-Modal Attention with Large Language Models
by: Liao, Haicheng, et al.
Published: (2023)

Can't See the Forest for the Trees: Benchmarking Multimodal Safety Awareness for Multimodal LLMs
by: Wang, Wenxuan, et al.
Published: (2025)

Plots Unlock Time-Series Understanding in Multimodal Models
by: Daswani, Mayank, et al.
Published: (2024)

MVU-Eval: Towards Multi-Video Understanding Evaluation for Multimodal LLMs
by: Peng, Tianhao, et al.
Published: (2025)

DraftAttention: Fast Video Diffusion via Low-Resolution Attention Guidance
by: Shen, Xuan, et al.
Published: (2025)

AID: Attention Interpolation of Text-to-Image Diffusion
by: He, Qiyuan, et al.
Published: (2024)

GazeLLM: Multimodal LLMs incorporating Human Visual Attention
by: Rekimoto, Jun
Published: (2025)

Local and Global Feature Attention Fusion Network for Face Recognition
by: Yu, Wang, et al.
Published: (2024)

CMHANet: A Cross-Modal Hybrid Attention Network for Point Cloud Registration
by: Zhang, Dongxu, et al.
Published: (2026)

MAMI: Multi-Attentional Mutual-Information for Long Sequence Neuron Captioning
by: Fauzulhaq, Alfirsa Damasyifa, et al.
Published: (2024)

Seeing Clearly by Layer Two: Enhancing Attention Heads to Alleviate Hallucination in LVLMs
by: Zhang, Xiaofeng, et al.
Published: (2024)

Cross-Modal Attention Calibration for LVLM Hallucination Mitigation
by: Li, Jiaming, et al.
Published: (2025)

Multimodal Causal Reasoning Benchmark: Challenging Vision Large Language Models to Discern Causal Links Across Modalities
by: Li, Zhiyuan, et al.
Published: (2024)

Causally-Grounded Dual-Path Attention Intervention for Object Hallucination Mitigation in LVLMs
by: Yu, Liu, et al.
Published: (2025)

Rethinking Causal Mask Attention for Vision-Language Inference
by: Pei, Xiaohuan, et al.
Published: (2025)

Causal-Story: Local Causal Attention Utilizing Parameter-Efficient Tuning For Visual Story Synthesis
by: Song, Tianyi, et al.
Published: (2023)

Region-Level Context-Aware Multimodal Understanding
by: Wei, Hongliang, et al.
Published: (2025)

Understanding Temporal Logic Consistency in Video-Language Models through Cross-Modal Attention Discriminability
by: Li, Chengzhi, et al.
Published: (2025)

Can Multimodal LLMs See Materials Clearly? A Multimodal Benchmark on Materials Characterization
by: Lai, Zhengzhao, et al.
Published: (2025)

WorldSense: Evaluating Real-world Omnimodal Understanding for Multimodal LLMs
by: Hong, Jack, et al.
Published: (2025)

MASR: Self-Reflective Reasoning through Multimodal Hierarchical Attention Focusing for Agent-based Video Understanding
by: Cao, Shiwen, et al.
Published: (2025)

Complementary Information Mutual Learning for Multimodality Medical Image Segmentation
by: Shen, Chuyun, et al.
Published: (2024)

Attention Mechanism based Cognition-level Scene Understanding
by: Tang, Xuejiao, et al.
Published: (2022)

Beyond Seeing: Evaluating Multimodal LLMs on Tool-Enabled Image Perception, Transformation, and Reasoning
by: Guo, Xingang, et al.
Published: (2025)

Point Cloud Understanding via Attention-Driven Contrastive Learning
by: Wang, Yi, et al.
Published: (2024)

DiTFastAttnV2: Head-wise Attention Compression for Multi-Modality Diffusion Transformers
by: Zhang, Hanling, et al.
Published: (2025)

LLaVA-Octopus: Unlocking Instruction-Driven Adaptive Projector Fusion for Video Understanding
by: Sun, Boyuan, et al.
Published: (2025)

AttAnchor: Guiding Cross-Modal Token Alignment in VLMs with Attention Anchors
by: Zhang, Junyang, et al.
Published: (2025)

True Multimodal In-Context Learning Needs Attention to the Visual Context
by: Chen, Shuo, et al.
Published: (2025)

CAM-VFD: Cross-Attention Multimodal Video Forgery Detection
by: Elkhodary, Hoda Osama, et al.
Published: (2026)

Seeing the Big Picture: Evaluating Multimodal LLMs' Ability to Interpret and Grade Handwritten Student Work
by: Henkel, Owen, et al.
Published: (2025)

Enhancing Multimodal Unified Representations for Cross Modal Generalization
by: Huang, Hai, et al.
Published: (2024)