Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Xue, Yu, Qu, Haoxuan, Li, Zhuoling, Lou, Yihang, Bai, Yan, Rahmani, Hossein, Liu, Jun
Format:	Preprint
Published:	2026
Subjects:	Computer Vision and Pattern Recognition
Online Access:	https://arxiv.org/abs/2606.02518
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866911742323326976
author	Xue, Yu Qu, Haoxuan Li, Zhuoling Lou, Yihang Bai, Yan Rahmani, Hossein Liu, Jun
author_facet	Xue, Yu Qu, Haoxuan Li, Zhuoling Lou, Yihang Bai, Yan Rahmani, Hossein Liu, Jun
contents	Fine-grained image classification (FGIC) has broad applications and has attracted significant research attention. In this paper, we explore a novel paradigm for solving FGIC by proposing \textbf{ToolFG}, the first tool-integrated MLLM-based framework tailored to FGIC. ToolFG enables MLLMs to autonomously and flexibly use external tools during the reasoning process, actively interact with images, and collect verifiable visual cues for distinguishing highly similar categories in a more \textit{reliable} and \textit{well-grounded} manner. To equip the model with such tool-use ability, we design a novel \textbf{MCTS-guided tool-use knowledge distillation mechanism}, which effectively mines tool-use- and FGIC-relevant knowledge from advanced proprietary MLLMs for model training. Furthermore, we propose a \textbf{model-tool co-evolution mechanism} that jointly refines the toolset and the model's tool-use policy, driving them toward a mutually adapted and FGIC-specialized state. Extensive experiments demonstrate the effectiveness of our framework.
format	Preprint
id	arxiv_https___arxiv_org_abs_2606_02518
institution	arXiv
publishDate	2026
record_format	arxiv
spellingShingle	ToolFG: Towards Well-Grounded Fine-Grained Image Classification Xue, Yu Qu, Haoxuan Li, Zhuoling Lou, Yihang Bai, Yan Rahmani, Hossein Liu, Jun Computer Vision and Pattern Recognition Fine-grained image classification (FGIC) has broad applications and has attracted significant research attention. In this paper, we explore a novel paradigm for solving FGIC by proposing \textbf{ToolFG}, the first tool-integrated MLLM-based framework tailored to FGIC. ToolFG enables MLLMs to autonomously and flexibly use external tools during the reasoning process, actively interact with images, and collect verifiable visual cues for distinguishing highly similar categories in a more \textit{reliable} and \textit{well-grounded} manner. To equip the model with such tool-use ability, we design a novel \textbf{MCTS-guided tool-use knowledge distillation mechanism}, which effectively mines tool-use- and FGIC-relevant knowledge from advanced proprietary MLLMs for model training. Furthermore, we propose a \textbf{model-tool co-evolution mechanism} that jointly refines the toolset and the model's tool-use policy, driving them toward a mutually adapted and FGIC-specialized state. Extensive experiments demonstrate the effectiveness of our framework.
title	ToolFG: Towards Well-Grounded Fine-Grained Image Classification
topic	Computer Vision and Pattern Recognition
url	https://arxiv.org/abs/2606.02518

Similar Items