Saved in:
Bibliographic Details
Main Authors: Wang, Yifan, Ai, Hongfeng, Liu, Quangao, Jiang, Maowei, Kang, Ruiyuan, Li, Ruiqi, Dong, Jiahua, Xiao, Mengting, Jiang, Cheng, Li, Chenzhong
Format: Preprint
Published: 2025
Subjects:
Online Access:https://arxiv.org/abs/2508.00945
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866916877280739328
author Wang, Yifan
Ai, Hongfeng
Liu, Quangao
Jiang, Maowei
Kang, Ruiyuan
Li, Ruiqi
Dong, Jiahua
Xiao, Mengting
Jiang, Cheng
Li, Chenzhong
author_facet Wang, Yifan
Ai, Hongfeng
Liu, Quangao
Jiang, Maowei
Kang, Ruiyuan
Li, Ruiqi
Dong, Jiahua
Xiao, Mengting
Jiang, Cheng
Li, Chenzhong
contents Vision Language Models (VLMs) face challenges in effectively coordinating diverse attention mechanisms for cross-modal embedding learning, leading to mismatched attention and suboptimal performance. We propose Consistent Cross-layer Regional Alignment (CCRA), which introduces Layer-Patch-wise Cross Attention (LPWCA) to capture fine-grained regional-semantic correlations by jointly weighting patch and layer-wise embedding, and Progressive Attention Integration (PAI) that systematically coordinates LPWCA, layer-wise, and patch-wise attention mechanisms in sequence. This progressive design ensures consistency from semantic to regional levels while preventing attention drift and maximizing individual attention benefits. Experimental results on ten diverse vision-language benchmarks demonstrate that our CCRA-enhanced LLaVA-v1.5-7B model achieves state-of-the-art performance, outperforming all baseline methods with only 3.55M additional parameters, while providing enhanced interpretability through more regionally focused and semantically aligned attention patterns.
format Preprint
id arxiv_https___arxiv_org_abs_2508_00945
institution arXiv
publishDate 2025
record_format arxiv
spellingShingle Optimizing Vision-Language Consistency via Cross-Layer Regional Attention Alignment
Wang, Yifan
Ai, Hongfeng
Liu, Quangao
Jiang, Maowei
Kang, Ruiyuan
Li, Ruiqi
Dong, Jiahua
Xiao, Mengting
Jiang, Cheng
Li, Chenzhong
Computer Vision and Pattern Recognition
Vision Language Models (VLMs) face challenges in effectively coordinating diverse attention mechanisms for cross-modal embedding learning, leading to mismatched attention and suboptimal performance. We propose Consistent Cross-layer Regional Alignment (CCRA), which introduces Layer-Patch-wise Cross Attention (LPWCA) to capture fine-grained regional-semantic correlations by jointly weighting patch and layer-wise embedding, and Progressive Attention Integration (PAI) that systematically coordinates LPWCA, layer-wise, and patch-wise attention mechanisms in sequence. This progressive design ensures consistency from semantic to regional levels while preventing attention drift and maximizing individual attention benefits. Experimental results on ten diverse vision-language benchmarks demonstrate that our CCRA-enhanced LLaVA-v1.5-7B model achieves state-of-the-art performance, outperforming all baseline methods with only 3.55M additional parameters, while providing enhanced interpretability through more regionally focused and semantically aligned attention patterns.
title Optimizing Vision-Language Consistency via Cross-Layer Regional Attention Alignment
topic Computer Vision and Pattern Recognition
url https://arxiv.org/abs/2508.00945