Affichage MARC: :: Library Catalog

Enregistré dans:

Détails bibliographiques
Auteurs principaux:	Yan, Dawei, Li, Yang, Chen, Qing-Guo, Luo, Weihua, Wang, Peng, Zhang, Haokui, Shen, Chunhua
Format:	Preprint
Publié:	2025
Sujets:	Artificial Intelligence
Accès en ligne:	https://arxiv.org/abs/2503.18533
Tags:	Ajouter un tag Pas de tags, Soyez le premier à ajouter un tag!

_version_	1866912290352136192
author	Yan, Dawei Li, Yang Chen, Qing-Guo Luo, Weihua Wang, Peng Zhang, Haokui Shen, Chunhua
author_facet	Yan, Dawei Li, Yang Chen, Qing-Guo Luo, Weihua Wang, Peng Zhang, Haokui Shen, Chunhua
contents	Compared to single-turn dialogue, multi-turn dialogue involving multiple images better aligns with the needs of real-world human-AI interactions. Additionally, as training data, it provides richer contextual reasoning information, thereby guiding the model to achieve better performance. However, existing vision-language models (VLMs) primarily rely on single-turn dialogue training and evaluation benchmarks. In this paper, following the characteristics of human dialogue, such as focused topics and concise, clear content, we present MMCR (Multimodal Multi-turn Contextual Reasoning), a novel dataset comprising: (1) MMCR-310k -- the largest multi-image multi-turn instruction tuning dataset with 310K contextual dialogues, each covering 1-4 images and 4 or 8 dialogue turns; and (2) MMCR-Bench -- a diagnostic benchmark featuring dialogues, spanning 8 domains (Humanities, Natural, Science, Education, etc.) and 40 sub-topics. Extensive evaluations demonstrate that models fine-tuned with MMCR-310k achieve 5.2\% higher contextual accuracy on MMCR-Bench, while showing consistent improvements on existing benchmarks (+1.1\% on AI2D, +1.2\% on MMMU and MMVet). MMCR and prompt engineering will be released publicly.
format	Preprint
id	arxiv_https___arxiv_org_abs_2503_18533
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	MMCR: Advancing Visual Language Model in Multimodal Multi-Turn Contextual Reasoning Yan, Dawei Li, Yang Chen, Qing-Guo Luo, Weihua Wang, Peng Zhang, Haokui Shen, Chunhua Artificial Intelligence Compared to single-turn dialogue, multi-turn dialogue involving multiple images better aligns with the needs of real-world human-AI interactions. Additionally, as training data, it provides richer contextual reasoning information, thereby guiding the model to achieve better performance. However, existing vision-language models (VLMs) primarily rely on single-turn dialogue training and evaluation benchmarks. In this paper, following the characteristics of human dialogue, such as focused topics and concise, clear content, we present MMCR (Multimodal Multi-turn Contextual Reasoning), a novel dataset comprising: (1) MMCR-310k -- the largest multi-image multi-turn instruction tuning dataset with 310K contextual dialogues, each covering 1-4 images and 4 or 8 dialogue turns; and (2) MMCR-Bench -- a diagnostic benchmark featuring dialogues, spanning 8 domains (Humanities, Natural, Science, Education, etc.) and 40 sub-topics. Extensive evaluations demonstrate that models fine-tuned with MMCR-310k achieve 5.2\% higher contextual accuracy on MMCR-Bench, while showing consistent improvements on existing benchmarks (+1.1\% on AI2D, +1.2\% on MMMU and MMVet). MMCR and prompt engineering will be released publicly.
title	MMCR: Advancing Visual Language Model in Multimodal Multi-Turn Contextual Reasoning
topic	Artificial Intelligence
url	https://arxiv.org/abs/2503.18533

Documents similaires