Saved in:
Bibliographic Details
Main Authors: Wang, Zixiao, Wang, Yuxin, Wang, Xiaorui, Xing, Mengting, Gao, Jie, Xu, Jianjun, Liu, Guangcan, Jin, Chenhui, Wang, Zhuo, Zhang, Shengzhuo, Xie, Hongtao
Format: Preprint
Published: 2025
Subjects:
Online Access:https://arxiv.org/abs/2507.01951
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866909680796696576
author Wang, Zixiao
Wang, Yuxin
Wang, Xiaorui
Xing, Mengting
Gao, Jie
Xu, Jianjun
Liu, Guangcan
Jin, Chenhui
Wang, Zhuo
Zhang, Shengzhuo
Xie, Hongtao
author_facet Wang, Zixiao
Wang, Yuxin
Wang, Xiaorui
Xing, Mengting
Gao, Jie
Xu, Jianjun
Liu, Guangcan
Jin, Chenhui
Wang, Zhuo
Zhang, Shengzhuo
Xie, Hongtao
contents We introduce our first reflective generative model MetaStone-S1, which obtains OpenAI o3-mini's performance via the new Reflective Generative Form. The new form focuses on high-quality reasoning trajectory selection and contains two novelties: 1) A unified interface for policy and process reward model: we share the backbone network and use task-specific heads for reasoning trajectory predicting and scoring respectively, introducing only 53M extra parameters for trajectory scoring. 2) Eliminating the reliance on process-level annotation: we provide a self-supervised process reward model, which can directly learn the high-quality reasoning trajectory selection from the outcome reward. Equipped with the reflective generative form, MetaStone-S1 is naturally suitable for test-time scaling, and we provide three reasoning effort modes (low, medium, and high) based on the controllable thinking length. Experiments demonstrate that our MetaStone-S1 achieves comparable performance to OpenAI o3-mini's series with only 32B parameter size. To support the research community, we have open-sourced MetaStone-S1 at https://github.com/MetaStone-AI/MetaStone-S1.
format Preprint
id arxiv_https___arxiv_org_abs_2507_01951
institution arXiv
publishDate 2025
record_format arxiv
spellingShingle Test-Time Scaling with Reflective Generative Model
Wang, Zixiao
Wang, Yuxin
Wang, Xiaorui
Xing, Mengting
Gao, Jie
Xu, Jianjun
Liu, Guangcan
Jin, Chenhui
Wang, Zhuo
Zhang, Shengzhuo
Xie, Hongtao
Machine Learning
Computation and Language
We introduce our first reflective generative model MetaStone-S1, which obtains OpenAI o3-mini's performance via the new Reflective Generative Form. The new form focuses on high-quality reasoning trajectory selection and contains two novelties: 1) A unified interface for policy and process reward model: we share the backbone network and use task-specific heads for reasoning trajectory predicting and scoring respectively, introducing only 53M extra parameters for trajectory scoring. 2) Eliminating the reliance on process-level annotation: we provide a self-supervised process reward model, which can directly learn the high-quality reasoning trajectory selection from the outcome reward. Equipped with the reflective generative form, MetaStone-S1 is naturally suitable for test-time scaling, and we provide three reasoning effort modes (low, medium, and high) based on the controllable thinking length. Experiments demonstrate that our MetaStone-S1 achieves comparable performance to OpenAI o3-mini's series with only 32B parameter size. To support the research community, we have open-sourced MetaStone-S1 at https://github.com/MetaStone-AI/MetaStone-S1.
title Test-Time Scaling with Reflective Generative Model
topic Machine Learning
Computation and Language
url https://arxiv.org/abs/2507.01951