Saved in:
| Main Authors: | , , , , , , , , , |
|---|---|
| Format: | Preprint |
| Published: |
2024
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2408.12928 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1866913735309787136 |
|---|---|
| author | Wang, An-Lan Shan, Bin Shi, Wei Lin, Kun-Yu Fei, Xiang Tang, Guozhi Liao, Lei Huang, Can Tang, Jingqun Zheng, Wei-Shi |
| author_facet | Wang, An-Lan Shan, Bin Shi, Wei Lin, Kun-Yu Fei, Xiang Tang, Guozhi Liao, Lei Huang, Can Tang, Jingqun Zheng, Wei-Shi |
| contents | This work presents ParGo, a novel Partial-Global projector designed to connect the vision and language modalities for Multimodal Large Language Models (MLLMs). Unlike previous works that rely on global attention-based projectors, our ParGo bridges the representation gap between the separately pre-trained vision encoders and the LLMs by integrating global and partial views, which alleviates the overemphasis on prominent regions. To facilitate the effective training of ParGo, we collect a large-scale detail-captioned image-text dataset named ParGoCap-1M-PT, consisting of 1 million images paired with high-quality captions. Extensive experiments on several MLLM benchmarks demonstrate the effectiveness of our ParGo, highlighting its superiority in aligning vision and language modalities. Compared to conventional Q-Former projector, our ParGo achieves an improvement of 259.96 in MME benchmark. Furthermore, our experiments reveal that ParGo significantly outperforms other projectors, particularly in tasks that emphasize detail perception ability. |
| format | Preprint |
| id |
arxiv_https___arxiv_org_abs_2408_12928 |
| institution | arXiv |
| publishDate | 2024 |
| record_format | arxiv |
| spellingShingle | ParGo: Bridging Vision-Language with Partial and Global Views Wang, An-Lan Shan, Bin Shi, Wei Lin, Kun-Yu Fei, Xiang Tang, Guozhi Liao, Lei Huang, Can Tang, Jingqun Zheng, Wei-Shi Computer Vision and Pattern Recognition This work presents ParGo, a novel Partial-Global projector designed to connect the vision and language modalities for Multimodal Large Language Models (MLLMs). Unlike previous works that rely on global attention-based projectors, our ParGo bridges the representation gap between the separately pre-trained vision encoders and the LLMs by integrating global and partial views, which alleviates the overemphasis on prominent regions. To facilitate the effective training of ParGo, we collect a large-scale detail-captioned image-text dataset named ParGoCap-1M-PT, consisting of 1 million images paired with high-quality captions. Extensive experiments on several MLLM benchmarks demonstrate the effectiveness of our ParGo, highlighting its superiority in aligning vision and language modalities. Compared to conventional Q-Former projector, our ParGo achieves an improvement of 259.96 in MME benchmark. Furthermore, our experiments reveal that ParGo significantly outperforms other projectors, particularly in tasks that emphasize detail perception ability. |
| title | ParGo: Bridging Vision-Language with Partial and Global Views |
| topic | Computer Vision and Pattern Recognition |
| url | https://arxiv.org/abs/2408.12928 |