Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Wang, Zixiao, Wang, Yuxin, Wang, Xiaorui, Xing, Mengting, Gao, Jie, Xu, Jianjun, Liu, Guangcan, Jin, Chenhui, Wang, Zhuo, Zhang, Shengzhuo, Xie, Hongtao
Format:	Preprint
Published:	2025
Subjects:	Machine Learning Computation and Language
Online Access:	https://arxiv.org/abs/2507.01951
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866909680796696576
author	Wang, Zixiao Wang, Yuxin Wang, Xiaorui Xing, Mengting Gao, Jie Xu, Jianjun Liu, Guangcan Jin, Chenhui Wang, Zhuo Zhang, Shengzhuo Xie, Hongtao
author_facet	Wang, Zixiao Wang, Yuxin Wang, Xiaorui Xing, Mengting Gao, Jie Xu, Jianjun Liu, Guangcan Jin, Chenhui Wang, Zhuo Zhang, Shengzhuo Xie, Hongtao
contents	We introduce our first reflective generative model MetaStone-S1, which obtains OpenAI o3-mini's performance via the new Reflective Generative Form. The new form focuses on high-quality reasoning trajectory selection and contains two novelties: 1) A unified interface for policy and process reward model: we share the backbone network and use task-specific heads for reasoning trajectory predicting and scoring respectively, introducing only 53M extra parameters for trajectory scoring. 2) Eliminating the reliance on process-level annotation: we provide a self-supervised process reward model, which can directly learn the high-quality reasoning trajectory selection from the outcome reward. Equipped with the reflective generative form, MetaStone-S1 is naturally suitable for test-time scaling, and we provide three reasoning effort modes (low, medium, and high) based on the controllable thinking length. Experiments demonstrate that our MetaStone-S1 achieves comparable performance to OpenAI o3-mini's series with only 32B parameter size. To support the research community, we have open-sourced MetaStone-S1 at https://github.com/MetaStone-AI/MetaStone-S1.
format	Preprint
id	arxiv_https___arxiv_org_abs_2507_01951
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	Test-Time Scaling with Reflective Generative Model Wang, Zixiao Wang, Yuxin Wang, Xiaorui Xing, Mengting Gao, Jie Xu, Jianjun Liu, Guangcan Jin, Chenhui Wang, Zhuo Zhang, Shengzhuo Xie, Hongtao Machine Learning Computation and Language We introduce our first reflective generative model MetaStone-S1, which obtains OpenAI o3-mini's performance via the new Reflective Generative Form. The new form focuses on high-quality reasoning trajectory selection and contains two novelties: 1) A unified interface for policy and process reward model: we share the backbone network and use task-specific heads for reasoning trajectory predicting and scoring respectively, introducing only 53M extra parameters for trajectory scoring. 2) Eliminating the reliance on process-level annotation: we provide a self-supervised process reward model, which can directly learn the high-quality reasoning trajectory selection from the outcome reward. Equipped with the reflective generative form, MetaStone-S1 is naturally suitable for test-time scaling, and we provide three reasoning effort modes (low, medium, and high) based on the controllable thinking length. Experiments demonstrate that our MetaStone-S1 achieves comparable performance to OpenAI o3-mini's series with only 32B parameter size. To support the research community, we have open-sourced MetaStone-S1 at https://github.com/MetaStone-AI/MetaStone-S1.
title	Test-Time Scaling with Reflective Generative Model
topic	Machine Learning Computation and Language
url	https://arxiv.org/abs/2507.01951

Similar Items