Saved in:
Bibliographic Details
Main Authors: Huang, Peizhou, Zhong, Zixuan, Wan, Zhongwei, Zhou, Donghao, Alam, Samiul, Wang, Xin, Li, Zexin, Dou, Zhihao, Zhu, Li, Xiong, Jing, Tao, Chaofan, Xu, Yan, Dimitriadis, Dimitrios, Zhang, Tuo, Zhang, Mi
Format: Preprint
Published: 2026
Subjects:
Online Access:https://arxiv.org/abs/2601.12346
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866914263157702656
author Huang, Peizhou
Zhong, Zixuan
Wan, Zhongwei
Zhou, Donghao
Alam, Samiul
Wang, Xin
Li, Zexin
Dou, Zhihao
Zhu, Li
Xiong, Jing
Tao, Chaofan
Xu, Yan
Dimitriadis, Dimitrios
Zhang, Tuo
Zhang, Mi
author_facet Huang, Peizhou
Zhong, Zixuan
Wan, Zhongwei
Zhou, Donghao
Alam, Samiul
Wang, Xin
Li, Zexin
Dou, Zhihao
Zhu, Li
Xiong, Jing
Tao, Chaofan
Xu, Yan
Dimitriadis, Dimitrios
Zhang, Tuo
Zhang, Mi
contents Deep Research Agents (DRAs) generate citation-rich reports via multi-step search and synthesis, yet existing benchmarks mainly target text-only settings or short-form multimodal QA, missing end-to-end multimodal evidence use. We introduce MMDeepResearch-Bench (MMDR-Bench), a benchmark of 140 expert-crafted tasks across 21 domains, where each task provides an image-text bundle to evaluate multimodal understanding and citation-grounded report generation. Compared to prior setups, MMDR-Bench emphasizes report-style synthesis with explicit evidence use, where models must connect visual artifacts to sourced claims and maintain consistency across narrative, citations, and visual references. We further propose a unified, interpretable evaluation pipeline: Formula-LLM Adaptive Evaluation (FLAE) for report quality, Trustworthy Retrieval-Aligned Citation Evaluation (TRACE) for citation-grounded evidence alignment, and Multimodal Support-Aligned Integrity Check (MOSAIC) for text-visual integrity, each producing fine-grained signals that support error diagnosis beyond a single overall score. Experiments across 25 state-of-the-art models reveal systematic trade-offs between generation quality, citation discipline, and multimodal grounding, highlighting that strong prose alone does not guarantee faithful evidence use and that multimodal integrity remains a key bottleneck for deep research agents.
format Preprint
id arxiv_https___arxiv_org_abs_2601_12346
institution arXiv
publishDate 2026
record_format arxiv
spellingShingle MMDeepResearch-Bench: A Benchmark for Multimodal Deep Research Agents
Huang, Peizhou
Zhong, Zixuan
Wan, Zhongwei
Zhou, Donghao
Alam, Samiul
Wang, Xin
Li, Zexin
Dou, Zhihao
Zhu, Li
Xiong, Jing
Tao, Chaofan
Xu, Yan
Dimitriadis, Dimitrios
Zhang, Tuo
Zhang, Mi
Computer Vision and Pattern Recognition
Deep Research Agents (DRAs) generate citation-rich reports via multi-step search and synthesis, yet existing benchmarks mainly target text-only settings or short-form multimodal QA, missing end-to-end multimodal evidence use. We introduce MMDeepResearch-Bench (MMDR-Bench), a benchmark of 140 expert-crafted tasks across 21 domains, where each task provides an image-text bundle to evaluate multimodal understanding and citation-grounded report generation. Compared to prior setups, MMDR-Bench emphasizes report-style synthesis with explicit evidence use, where models must connect visual artifacts to sourced claims and maintain consistency across narrative, citations, and visual references. We further propose a unified, interpretable evaluation pipeline: Formula-LLM Adaptive Evaluation (FLAE) for report quality, Trustworthy Retrieval-Aligned Citation Evaluation (TRACE) for citation-grounded evidence alignment, and Multimodal Support-Aligned Integrity Check (MOSAIC) for text-visual integrity, each producing fine-grained signals that support error diagnosis beyond a single overall score. Experiments across 25 state-of-the-art models reveal systematic trade-offs between generation quality, citation discipline, and multimodal grounding, highlighting that strong prose alone does not guarantee faithful evidence use and that multimodal integrity remains a key bottleneck for deep research agents.
title MMDeepResearch-Bench: A Benchmark for Multimodal Deep Research Agents
topic Computer Vision and Pattern Recognition
url https://arxiv.org/abs/2601.12346