Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Wang, Yifan, Ai, Hongfeng, Liu, Quangao, Jiang, Maowei, Kang, Ruiyuan, Li, Ruiqi, Dong, Jiahua, Xiao, Mengting, Jiang, Cheng, Li, Chenzhong
Format:	Preprint
Published:	2025
Subjects:	Computer Vision and Pattern Recognition
Online Access:	https://arxiv.org/abs/2508.00945
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866916877280739328
author	Wang, Yifan Ai, Hongfeng Liu, Quangao Jiang, Maowei Kang, Ruiyuan Li, Ruiqi Dong, Jiahua Xiao, Mengting Jiang, Cheng Li, Chenzhong
author_facet	Wang, Yifan Ai, Hongfeng Liu, Quangao Jiang, Maowei Kang, Ruiyuan Li, Ruiqi Dong, Jiahua Xiao, Mengting Jiang, Cheng Li, Chenzhong
contents	Vision Language Models (VLMs) face challenges in effectively coordinating diverse attention mechanisms for cross-modal embedding learning, leading to mismatched attention and suboptimal performance. We propose Consistent Cross-layer Regional Alignment (CCRA), which introduces Layer-Patch-wise Cross Attention (LPWCA) to capture fine-grained regional-semantic correlations by jointly weighting patch and layer-wise embedding, and Progressive Attention Integration (PAI) that systematically coordinates LPWCA, layer-wise, and patch-wise attention mechanisms in sequence. This progressive design ensures consistency from semantic to regional levels while preventing attention drift and maximizing individual attention benefits. Experimental results on ten diverse vision-language benchmarks demonstrate that our CCRA-enhanced LLaVA-v1.5-7B model achieves state-of-the-art performance, outperforming all baseline methods with only 3.55M additional parameters, while providing enhanced interpretability through more regionally focused and semantically aligned attention patterns.
format	Preprint
id	arxiv_https___arxiv_org_abs_2508_00945
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	Optimizing Vision-Language Consistency via Cross-Layer Regional Attention Alignment Wang, Yifan Ai, Hongfeng Liu, Quangao Jiang, Maowei Kang, Ruiyuan Li, Ruiqi Dong, Jiahua Xiao, Mengting Jiang, Cheng Li, Chenzhong Computer Vision and Pattern Recognition Vision Language Models (VLMs) face challenges in effectively coordinating diverse attention mechanisms for cross-modal embedding learning, leading to mismatched attention and suboptimal performance. We propose Consistent Cross-layer Regional Alignment (CCRA), which introduces Layer-Patch-wise Cross Attention (LPWCA) to capture fine-grained regional-semantic correlations by jointly weighting patch and layer-wise embedding, and Progressive Attention Integration (PAI) that systematically coordinates LPWCA, layer-wise, and patch-wise attention mechanisms in sequence. This progressive design ensures consistency from semantic to regional levels while preventing attention drift and maximizing individual attention benefits. Experimental results on ten diverse vision-language benchmarks demonstrate that our CCRA-enhanced LLaVA-v1.5-7B model achieves state-of-the-art performance, outperforming all baseline methods with only 3.55M additional parameters, while providing enhanced interpretability through more regionally focused and semantically aligned attention patterns.
title	Optimizing Vision-Language Consistency via Cross-Layer Regional Attention Alignment
topic	Computer Vision and Pattern Recognition
url	https://arxiv.org/abs/2508.00945

Similar Items