Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Chen, Wei, Du, Chaoqun, Gu, Feng, He, Wei, Li, Qizhen, Liu, Zide, Pan, Xuhao, Ren, Chang, Rao, Xudong, Wang, Chenfeng, Wei, Tao, Yu, Chengjun, Yu, Pengfei, Zheng, Yufei, Zhou, Chunpeng, Zhou, Pan, Zhu, Xuhan
Format:	Preprint
Published:	2025
Subjects:	Computer Vision and Pattern Recognition
Online Access:	https://arxiv.org/abs/2512.02895
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866912745591406592
author	Chen, Wei Du, Chaoqun Gu, Feng He, Wei Li, Qizhen Liu, Zide Pan, Xuhao Ren, Chang Rao, Xudong Wang, Chenfeng Wei, Tao Yu, Chengjun Yu, Pengfei Zheng, Yufei Zhou, Chunpeng Zhou, Pan Zhu, Xuhan
author_facet	Chen, Wei Du, Chaoqun Gu, Feng He, Wei Li, Qizhen Liu, Zide Pan, Xuhao Ren, Chang Rao, Xudong Wang, Chenfeng Wei, Tao Yu, Chengjun Yu, Pengfei Zheng, Yufei Zhou, Chunpeng Zhou, Pan Zhu, Xuhan
contents	We present MindGPT-4ov, a multimodal large language model (MLLM) that introduces a general post-training paradigm spanning data production, model training, and efficient deployment. It achieves state-of-the-art performance across multiple benchmarks at low cost, effectively enhancing the foundational capabilities of MLLMs and the generalization ability. Focusing on data construction, supervised fine-tuning strategies, and multimodal reinforcement learning methods, this work proposes three key innovations: (1) An information density-based data generation scheme, integrated with a dual-dimensional tree-structured label system, enabling automated generation of high-quality cross-domain data. (2) A collaborative curriculum supervised fine-tuning approach that balances the injection of domain-specific knowledge with the preservation of general capabilities. (3) A hybrid reinforcement learning paradigm that enhances reasoning ability while simultaneously addressing multi-objective optimization such as diversity exploration, maintenance of multimodal perception, and response conciseness. Moreover, we implement a series of infrastructure optimizations, such as 5D parallel training, operator optimization, and inference quantization to enhance training and inference efficiency while reducing the cost of domain adaptation. Experimental results demonstrate that the MindGPT-4ov model outperforms state-of-the-art models on benchmarks such as MMBench, MMStar, MathVision, and MathVista. In addition, MindGPT-4ov also demonstrates superior user experience in vertical domain tasks, enabling a seamless transition from academic research to industrial deployment. MindGPT-4ov provides a general post-training paradigm applicable to a wide range of MLLMs. The model weights, datasets, and code for the Qwen3-VL-based variants will be recently open-sourced to support the community's development of MLLMs.
format	Preprint
id	arxiv_https___arxiv_org_abs_2512_02895
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	MindGPT-4ov: An Enhanced MLLM via a Multi-Stage Post-Training Paradigm Chen, Wei Du, Chaoqun Gu, Feng He, Wei Li, Qizhen Liu, Zide Pan, Xuhao Ren, Chang Rao, Xudong Wang, Chenfeng Wei, Tao Yu, Chengjun Yu, Pengfei Zheng, Yufei Zhou, Chunpeng Zhou, Pan Zhu, Xuhan Computer Vision and Pattern Recognition We present MindGPT-4ov, a multimodal large language model (MLLM) that introduces a general post-training paradigm spanning data production, model training, and efficient deployment. It achieves state-of-the-art performance across multiple benchmarks at low cost, effectively enhancing the foundational capabilities of MLLMs and the generalization ability. Focusing on data construction, supervised fine-tuning strategies, and multimodal reinforcement learning methods, this work proposes three key innovations: (1) An information density-based data generation scheme, integrated with a dual-dimensional tree-structured label system, enabling automated generation of high-quality cross-domain data. (2) A collaborative curriculum supervised fine-tuning approach that balances the injection of domain-specific knowledge with the preservation of general capabilities. (3) A hybrid reinforcement learning paradigm that enhances reasoning ability while simultaneously addressing multi-objective optimization such as diversity exploration, maintenance of multimodal perception, and response conciseness. Moreover, we implement a series of infrastructure optimizations, such as 5D parallel training, operator optimization, and inference quantization to enhance training and inference efficiency while reducing the cost of domain adaptation. Experimental results demonstrate that the MindGPT-4ov model outperforms state-of-the-art models on benchmarks such as MMBench, MMStar, MathVision, and MathVista. In addition, MindGPT-4ov also demonstrates superior user experience in vertical domain tasks, enabling a seamless transition from academic research to industrial deployment. MindGPT-4ov provides a general post-training paradigm applicable to a wide range of MLLMs. The model weights, datasets, and code for the Qwen3-VL-based variants will be recently open-sourced to support the community's development of MLLMs.
title	MindGPT-4ov: An Enhanced MLLM via a Multi-Stage Post-Training Paradigm
topic	Computer Vision and Pattern Recognition
url	https://arxiv.org/abs/2512.02895

Similar Items