Saved in:
| Main Authors: | , , , , , , , , , |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2508.00945 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1866916877280739328 |
|---|---|
| author | Wang, Yifan Ai, Hongfeng Liu, Quangao Jiang, Maowei Kang, Ruiyuan Li, Ruiqi Dong, Jiahua Xiao, Mengting Jiang, Cheng Li, Chenzhong |
| author_facet | Wang, Yifan Ai, Hongfeng Liu, Quangao Jiang, Maowei Kang, Ruiyuan Li, Ruiqi Dong, Jiahua Xiao, Mengting Jiang, Cheng Li, Chenzhong |
| contents | Vision Language Models (VLMs) face challenges in effectively coordinating diverse attention mechanisms for cross-modal embedding learning, leading to mismatched attention and suboptimal performance. We propose Consistent Cross-layer Regional Alignment (CCRA), which introduces Layer-Patch-wise Cross Attention (LPWCA) to capture fine-grained regional-semantic correlations by jointly weighting patch and layer-wise embedding, and Progressive Attention Integration (PAI) that systematically coordinates LPWCA, layer-wise, and patch-wise attention mechanisms in sequence. This progressive design ensures consistency from semantic to regional levels while preventing attention drift and maximizing individual attention benefits. Experimental results on ten diverse vision-language benchmarks demonstrate that our CCRA-enhanced LLaVA-v1.5-7B model achieves state-of-the-art performance, outperforming all baseline methods with only 3.55M additional parameters, while providing enhanced interpretability through more regionally focused and semantically aligned attention patterns. |
| format | Preprint |
| id |
arxiv_https___arxiv_org_abs_2508_00945 |
| institution | arXiv |
| publishDate | 2025 |
| record_format | arxiv |
| spellingShingle | Optimizing Vision-Language Consistency via Cross-Layer Regional Attention Alignment Wang, Yifan Ai, Hongfeng Liu, Quangao Jiang, Maowei Kang, Ruiyuan Li, Ruiqi Dong, Jiahua Xiao, Mengting Jiang, Cheng Li, Chenzhong Computer Vision and Pattern Recognition Vision Language Models (VLMs) face challenges in effectively coordinating diverse attention mechanisms for cross-modal embedding learning, leading to mismatched attention and suboptimal performance. We propose Consistent Cross-layer Regional Alignment (CCRA), which introduces Layer-Patch-wise Cross Attention (LPWCA) to capture fine-grained regional-semantic correlations by jointly weighting patch and layer-wise embedding, and Progressive Attention Integration (PAI) that systematically coordinates LPWCA, layer-wise, and patch-wise attention mechanisms in sequence. This progressive design ensures consistency from semantic to regional levels while preventing attention drift and maximizing individual attention benefits. Experimental results on ten diverse vision-language benchmarks demonstrate that our CCRA-enhanced LLaVA-v1.5-7B model achieves state-of-the-art performance, outperforming all baseline methods with only 3.55M additional parameters, while providing enhanced interpretability through more regionally focused and semantically aligned attention patterns. |
| title | Optimizing Vision-Language Consistency via Cross-Layer Regional Attention Alignment |
| topic | Computer Vision and Pattern Recognition |
| url | https://arxiv.org/abs/2508.00945 |