:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Agrawal, Aviral, Lezcano, Carlos Mateo Samudio, Heredia-Marin, Iqui Balam, Sethi, Prabhdeep Singh
Format:	Preprint
Published:	2024
Subjects:	Computer Vision and Pattern Recognition Computation and Language Machine Learning
Online Access:	https://arxiv.org/abs/2404.13530
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

No Training Wheels: Steering Vectors for Bias Correction at Inference Time
by: Gupta, Aviral, et al.
Published: (2025)

StyleSplat: 3D Object Style Transfer with Gaussian Splatting
by: Jain, Sahil, et al.
Published: (2024)

VidLA: Video-Language Alignment at Scale
by: Rizve, Mamshad Nayeem, et al.
Published: (2024)

Cross-modal Causal Relation Alignment for Video Question Grounding
by: Chen, Weixing, et al.
Published: (2025)

Seeing Syntax: Uncovering Syntactic Learning Limitations in Vision-Language Models
by: Dumpala, Sri Harsha, et al.
Published: (2024)

RadZero: Similarity-Based Cross-Attention for Explainable Vision-Language Alignment in Chest X-ray with Zero-Shot Multi-Task Capability
by: Park, Jonggwon, et al.
Published: (2025)

Seeing Through Their Eyes: Evaluating Visual Perspective Taking in Vision Language Models
by: Góral, Gracjan, et al.
Published: (2024)

VisMin: Visual Minimal-Change Understanding
by: Awal, Rabiul, et al.
Published: (2024)

Seeing the Image: Prioritizing Visual Correlation by Contrastive Alignment
by: Xiao, Xin, et al.
Published: (2024)

Reinforced Attention Learning
by: Li, Bangzheng, et al.
Published: (2026)

RAVEN: Query-Guided Representation Alignment for Question Answering over Audio, Video, Embedded Sensors, and Natural Language
by: Biswas, Subrata, et al.
Published: (2025)

An Examination of the Robustness of Reference-Free Image Captioning Evaluation Metrics
by: Ahmadi, Saba, et al.
Published: (2023)

Unifying Specialized Visual Encoders for Video Language Models
by: Chung, Jihoon, et al.
Published: (2025)

Can Visual Encoder Learn to See Arrows?
by: Terashita, Naoyuki, et al.
Published: (2025)

Text-centric Alignment for Multi-Modality Learning
by: Tsai, Yun-Da, et al.
Published: (2024)

Phrase-Instance Alignment for Generalized Referring Segmentation
by: Nguyen, E-Ro, et al.
Published: (2024)

Seeing No Evil: Blinding Large Vision-Language Models to Safety Instructions via Adversarial Attention Hijacking
by: Li, Jingru, et al.
Published: (2026)

EMMA: Efficient Visual Alignment in Multi-Modal LLMs
by: Ghazanfari, Sara, et al.
Published: (2024)

Linear Alignment of Vision-language Models for Image Captioning
by: Paischer, Fabian, et al.
Published: (2023)

Attribute Diversity Determines the Systematicity Gap in VQA
by: Berlot-Attwell, Ian, et al.
Published: (2023)

Transformer with Controlled Attention for Synchronous Motion Captioning
by: Radouane, Karim, et al.
Published: (2024)

X-VILA: Cross-Modality Alignment for Large Language Model
by: Ye, Hanrong, et al.
Published: (2024)

Data Alignment for Zero-Shot Concept Generation in Dermatology AI
by: Gadgil, Soham, et al.
Published: (2024)

Implicit Multimodal Alignment: On the Generalization of Frozen LLMs to Multimodal Inputs
by: Shukor, Mustafa, et al.
Published: (2024)

Omnimodal Dataset Distillation via High-order Proxy Alignment
by: Gao, Yuxuan, et al.
Published: (2026)

Evaluation of Audio-Visual Alignments in Visually Grounded Speech Models
by: Khorrami, Khazar, et al.
Published: (2021)

Improving Automatic VQA Evaluation Using Large Language Models
by: Mañas, Oscar, et al.
Published: (2023)

Resolving Spatio-Temporal Entanglement in Video Prediction via Multi-Modal Attention
by: Gupta, Shreyam, et al.
Published: (2025)

Head Pursuit: Probing Attention Specialization in Multimodal Transformers
by: Basile, Lorenzo, et al.
Published: (2025)

CAT: Circular-Convolutional Attention for Sub-Quadratic Transformers
by: Yamada, Yoshihiro
Published: (2025)

Fine-Grained Alignment in Vision-and-Language Navigation through Bayesian Optimization
by: Song, Yuhang, et al.
Published: (2024)

Whats in a Video: Factorized Autoregressive Decoding for Online Dense Video Captioning
by: Piergiovanni, AJ, et al.
Published: (2024)

The ART of Composition: Attention-Regularized Training for Compositional Visual Grounding
by: Luo, Jiayun, et al.
Published: (2024)

Progressive Multi-granular Alignments for Grounded Reasoning in Large Vision-Language Models
by: Le, Quang-Hung, et al.
Published: (2024)

LongVideoBench: A Benchmark for Long-context Interleaved Video-Language Understanding
by: Wu, Haoning, et al.
Published: (2024)

Can Video LLMs Refuse to Answer? Alignment for Answerability in Video Large Language Models
by: Yoon, Eunseop, et al.
Published: (2025)

DELAN: Dual-Level Alignment for Vision-and-Language Navigation by Cross-Modal Contrastive Learning
by: Du, Mengfei, et al.
Published: (2024)

VLMGuard-R1: Proactive Safety Alignment for VLMs via Reasoning-Driven Prompt Optimization
by: Chen, Menglan, et al.
Published: (2025)

Distributionally Robust Alignment for Medical Federated Vision-Language Pre-training Under Data Heterogeneity
by: Shuai, Zitao, et al.
Published: (2024)

Enhancing Video-Language Representations with Structural Spatio-Temporal Alignment
by: Fei, Hao, et al.
Published: (2024)