Enregistré dans:
| Auteurs principaux: | , , , , , , |
|---|---|
| Format: | Preprint |
| Publié: |
2025
|
| Sujets: | |
| Accès en ligne: | https://arxiv.org/abs/2503.18533 |
| Tags: |
Ajouter un tag
Pas de tags, Soyez le premier à ajouter un tag!
|
| _version_ | 1866912290352136192 |
|---|---|
| author | Yan, Dawei Li, Yang Chen, Qing-Guo Luo, Weihua Wang, Peng Zhang, Haokui Shen, Chunhua |
| author_facet | Yan, Dawei Li, Yang Chen, Qing-Guo Luo, Weihua Wang, Peng Zhang, Haokui Shen, Chunhua |
| contents | Compared to single-turn dialogue, multi-turn dialogue involving multiple images better aligns with the needs of real-world human-AI interactions. Additionally, as training data, it provides richer contextual reasoning information, thereby guiding the model to achieve better performance. However, existing vision-language models (VLMs) primarily rely on single-turn dialogue training and evaluation benchmarks. In this paper, following the characteristics of human dialogue, such as focused topics and concise, clear content, we present MMCR (Multimodal Multi-turn Contextual Reasoning), a novel dataset comprising: (1) MMCR-310k -- the largest multi-image multi-turn instruction tuning dataset with 310K contextual dialogues, each covering 1-4 images and 4 or 8 dialogue turns; and (2) MMCR-Bench -- a diagnostic benchmark featuring dialogues, spanning 8 domains (Humanities, Natural, Science, Education, etc.) and 40 sub-topics. Extensive evaluations demonstrate that models fine-tuned with MMCR-310k achieve 5.2\% higher contextual accuracy on MMCR-Bench, while showing consistent improvements on existing benchmarks (+1.1\% on AI2D, +1.2\% on MMMU and MMVet). MMCR and prompt engineering will be released publicly. |
| format | Preprint |
| id |
arxiv_https___arxiv_org_abs_2503_18533 |
| institution | arXiv |
| publishDate | 2025 |
| record_format | arxiv |
| spellingShingle | MMCR: Advancing Visual Language Model in Multimodal Multi-Turn Contextual Reasoning Yan, Dawei Li, Yang Chen, Qing-Guo Luo, Weihua Wang, Peng Zhang, Haokui Shen, Chunhua Artificial Intelligence Compared to single-turn dialogue, multi-turn dialogue involving multiple images better aligns with the needs of real-world human-AI interactions. Additionally, as training data, it provides richer contextual reasoning information, thereby guiding the model to achieve better performance. However, existing vision-language models (VLMs) primarily rely on single-turn dialogue training and evaluation benchmarks. In this paper, following the characteristics of human dialogue, such as focused topics and concise, clear content, we present MMCR (Multimodal Multi-turn Contextual Reasoning), a novel dataset comprising: (1) MMCR-310k -- the largest multi-image multi-turn instruction tuning dataset with 310K contextual dialogues, each covering 1-4 images and 4 or 8 dialogue turns; and (2) MMCR-Bench -- a diagnostic benchmark featuring dialogues, spanning 8 domains (Humanities, Natural, Science, Education, etc.) and 40 sub-topics. Extensive evaluations demonstrate that models fine-tuned with MMCR-310k achieve 5.2\% higher contextual accuracy on MMCR-Bench, while showing consistent improvements on existing benchmarks (+1.1\% on AI2D, +1.2\% on MMMU and MMVet). MMCR and prompt engineering will be released publicly. |
| title | MMCR: Advancing Visual Language Model in Multimodal Multi-Turn Contextual Reasoning |
| topic | Artificial Intelligence |
| url | https://arxiv.org/abs/2503.18533 |