Enregistré dans:
Détails bibliographiques
Auteurs principaux: Yan, Dawei, Li, Yang, Chen, Qing-Guo, Luo, Weihua, Wang, Peng, Zhang, Haokui, Shen, Chunhua
Format: Preprint
Publié: 2025
Sujets:
Accès en ligne:https://arxiv.org/abs/2503.18533
Tags: Ajouter un tag
Pas de tags, Soyez le premier à ajouter un tag!
_version_ 1866912290352136192
author Yan, Dawei
Li, Yang
Chen, Qing-Guo
Luo, Weihua
Wang, Peng
Zhang, Haokui
Shen, Chunhua
author_facet Yan, Dawei
Li, Yang
Chen, Qing-Guo
Luo, Weihua
Wang, Peng
Zhang, Haokui
Shen, Chunhua
contents Compared to single-turn dialogue, multi-turn dialogue involving multiple images better aligns with the needs of real-world human-AI interactions. Additionally, as training data, it provides richer contextual reasoning information, thereby guiding the model to achieve better performance. However, existing vision-language models (VLMs) primarily rely on single-turn dialogue training and evaluation benchmarks. In this paper, following the characteristics of human dialogue, such as focused topics and concise, clear content, we present MMCR (Multimodal Multi-turn Contextual Reasoning), a novel dataset comprising: (1) MMCR-310k -- the largest multi-image multi-turn instruction tuning dataset with 310K contextual dialogues, each covering 1-4 images and 4 or 8 dialogue turns; and (2) MMCR-Bench -- a diagnostic benchmark featuring dialogues, spanning 8 domains (Humanities, Natural, Science, Education, etc.) and 40 sub-topics. Extensive evaluations demonstrate that models fine-tuned with MMCR-310k achieve 5.2\% higher contextual accuracy on MMCR-Bench, while showing consistent improvements on existing benchmarks (+1.1\% on AI2D, +1.2\% on MMMU and MMVet). MMCR and prompt engineering will be released publicly.
format Preprint
id arxiv_https___arxiv_org_abs_2503_18533
institution arXiv
publishDate 2025
record_format arxiv
spellingShingle MMCR: Advancing Visual Language Model in Multimodal Multi-Turn Contextual Reasoning
Yan, Dawei
Li, Yang
Chen, Qing-Guo
Luo, Weihua
Wang, Peng
Zhang, Haokui
Shen, Chunhua
Artificial Intelligence
Compared to single-turn dialogue, multi-turn dialogue involving multiple images better aligns with the needs of real-world human-AI interactions. Additionally, as training data, it provides richer contextual reasoning information, thereby guiding the model to achieve better performance. However, existing vision-language models (VLMs) primarily rely on single-turn dialogue training and evaluation benchmarks. In this paper, following the characteristics of human dialogue, such as focused topics and concise, clear content, we present MMCR (Multimodal Multi-turn Contextual Reasoning), a novel dataset comprising: (1) MMCR-310k -- the largest multi-image multi-turn instruction tuning dataset with 310K contextual dialogues, each covering 1-4 images and 4 or 8 dialogue turns; and (2) MMCR-Bench -- a diagnostic benchmark featuring dialogues, spanning 8 domains (Humanities, Natural, Science, Education, etc.) and 40 sub-topics. Extensive evaluations demonstrate that models fine-tuned with MMCR-310k achieve 5.2\% higher contextual accuracy on MMCR-Bench, while showing consistent improvements on existing benchmarks (+1.1\% on AI2D, +1.2\% on MMMU and MMVet). MMCR and prompt engineering will be released publicly.
title MMCR: Advancing Visual Language Model in Multimodal Multi-Turn Contextual Reasoning
topic Artificial Intelligence
url https://arxiv.org/abs/2503.18533