Vista Equipo: :: Library Catalog

Guardado en:

Detalles Bibliográficos
Autores principales:	Zhu, Xuanyu, Dong, Yuhao, Wang, Rundong, Shi, Yang, Wu, Zhipeng, Peng, Yinlun, Zhang, YiFan, Lou, Yihang, Zhang, Yuanxing, Liu, Ziwei, Bai, Yan, Zhou, Yuan
Formato:	Preprint
Publicado:	2026
Materias:	Artificial Intelligence
Acceso en línea:	https://arxiv.org/abs/2603.15030
Etiquetas:	Agregar Etiqueta Sin Etiquetas, Sea el primero en etiquetar este registro!

_version_	1866911530256171008
author	Zhu, Xuanyu Dong, Yuhao Wang, Rundong Shi, Yang Wu, Zhipeng Peng, Yinlun Zhang, YiFan Lou, Yihang Zhang, Yuanxing Liu, Ziwei Bai, Yan Zhou, Yuan
author_facet	Zhu, Xuanyu Dong, Yuhao Wang, Rundong Shi, Yang Wu, Zhipeng Peng, Yinlun Zhang, YiFan Lou, Yihang Zhang, Yuanxing Liu, Ziwei Bai, Yan Zhou, Yuan
contents	Recent advancements extend Multimodal Large Language Models (MLLMs) beyond standard visual question answering to utilizing external tools for advanced visual tasks. Despite this progress, precisely executing and effectively composing diverse tools for complex tasks remain persistent bottleneck. Constrained by sparse tool-sets and simple tool-use trajectories, existing benchmarks fail to capture complex and diverse tool interactions, falling short in evaluating model performance under practical, real-world conditions. To bridge this gap, we introduce VisualToolChain-Bench(VTC-Bench), a comprehensive benchmark designed to evaluate tool-use proficiency in MLLMs. To align with realistic computer vision pipelines, our framework features 32 diverse OpenCV-based visual operations. This rich tool-set enables extensive combinations, allowing VTC-Bench to rigorously assess multi-tool composition and long-horizon, multi-step plan execution. For precise evaluation, we provide 680 curated problems structured across a nine-category cognitive hierarchy, each with ground-truth execution trajectories. Extensive experiments on 19 leading MLLMs reveal critical limitations in current models' visual agentic capabilities. Specifically, models struggle to adapt to diverse tool-sets and generalize to unseen operations, with the leading model Gemini-3.0-Pro only achieving 51% on our benchmark. Furthermore, multi-tool composition remains a persistent challenge. When facing complex tasks, models struggle to formulate efficient execution plans, relying heavily on a narrow, suboptimal subset of familiar functions rather than selecting the optimal tools. By identifying these fundamental challenges, VTC-Bench establishes a rigorous baseline to guide the development of more generalized visual agentic models.
format	Preprint
id	arxiv_https___arxiv_org_abs_2603_15030
institution	arXiv
publishDate	2026
record_format	arxiv
spellingShingle	VTC-Bench: Evaluating Agentic Multimodal Models via Compositional Visual Tool Chaining Zhu, Xuanyu Dong, Yuhao Wang, Rundong Shi, Yang Wu, Zhipeng Peng, Yinlun Zhang, YiFan Lou, Yihang Zhang, Yuanxing Liu, Ziwei Bai, Yan Zhou, Yuan Artificial Intelligence Recent advancements extend Multimodal Large Language Models (MLLMs) beyond standard visual question answering to utilizing external tools for advanced visual tasks. Despite this progress, precisely executing and effectively composing diverse tools for complex tasks remain persistent bottleneck. Constrained by sparse tool-sets and simple tool-use trajectories, existing benchmarks fail to capture complex and diverse tool interactions, falling short in evaluating model performance under practical, real-world conditions. To bridge this gap, we introduce VisualToolChain-Bench(VTC-Bench), a comprehensive benchmark designed to evaluate tool-use proficiency in MLLMs. To align with realistic computer vision pipelines, our framework features 32 diverse OpenCV-based visual operations. This rich tool-set enables extensive combinations, allowing VTC-Bench to rigorously assess multi-tool composition and long-horizon, multi-step plan execution. For precise evaluation, we provide 680 curated problems structured across a nine-category cognitive hierarchy, each with ground-truth execution trajectories. Extensive experiments on 19 leading MLLMs reveal critical limitations in current models' visual agentic capabilities. Specifically, models struggle to adapt to diverse tool-sets and generalize to unseen operations, with the leading model Gemini-3.0-Pro only achieving 51% on our benchmark. Furthermore, multi-tool composition remains a persistent challenge. When facing complex tasks, models struggle to formulate efficient execution plans, relying heavily on a narrow, suboptimal subset of familiar functions rather than selecting the optimal tools. By identifying these fundamental challenges, VTC-Bench establishes a rigorous baseline to guide the development of more generalized visual agentic models.
title	VTC-Bench: Evaluating Agentic Multimodal Models via Compositional Visual Tool Chaining
topic	Artificial Intelligence
url	https://arxiv.org/abs/2603.15030

Ejemplares similares