Saved in:
| Main Authors: | , , , , , , , , , , |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2507.01951 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1866909680796696576 |
|---|---|
| author | Wang, Zixiao Wang, Yuxin Wang, Xiaorui Xing, Mengting Gao, Jie Xu, Jianjun Liu, Guangcan Jin, Chenhui Wang, Zhuo Zhang, Shengzhuo Xie, Hongtao |
| author_facet | Wang, Zixiao Wang, Yuxin Wang, Xiaorui Xing, Mengting Gao, Jie Xu, Jianjun Liu, Guangcan Jin, Chenhui Wang, Zhuo Zhang, Shengzhuo Xie, Hongtao |
| contents | We introduce our first reflective generative model MetaStone-S1, which obtains OpenAI o3-mini's performance via the new Reflective Generative Form. The new form focuses on high-quality reasoning trajectory selection and contains two novelties: 1) A unified interface for policy and process reward model: we share the backbone network and use task-specific heads for reasoning trajectory predicting and scoring respectively, introducing only 53M extra parameters for trajectory scoring. 2) Eliminating the reliance on process-level annotation: we provide a self-supervised process reward model, which can directly learn the high-quality reasoning trajectory selection from the outcome reward. Equipped with the reflective generative form, MetaStone-S1 is naturally suitable for test-time scaling, and we provide three reasoning effort modes (low, medium, and high) based on the controllable thinking length. Experiments demonstrate that our MetaStone-S1 achieves comparable performance to OpenAI o3-mini's series with only 32B parameter size. To support the research community, we have open-sourced MetaStone-S1 at https://github.com/MetaStone-AI/MetaStone-S1. |
| format | Preprint |
| id |
arxiv_https___arxiv_org_abs_2507_01951 |
| institution | arXiv |
| publishDate | 2025 |
| record_format | arxiv |
| spellingShingle | Test-Time Scaling with Reflective Generative Model Wang, Zixiao Wang, Yuxin Wang, Xiaorui Xing, Mengting Gao, Jie Xu, Jianjun Liu, Guangcan Jin, Chenhui Wang, Zhuo Zhang, Shengzhuo Xie, Hongtao Machine Learning Computation and Language We introduce our first reflective generative model MetaStone-S1, which obtains OpenAI o3-mini's performance via the new Reflective Generative Form. The new form focuses on high-quality reasoning trajectory selection and contains two novelties: 1) A unified interface for policy and process reward model: we share the backbone network and use task-specific heads for reasoning trajectory predicting and scoring respectively, introducing only 53M extra parameters for trajectory scoring. 2) Eliminating the reliance on process-level annotation: we provide a self-supervised process reward model, which can directly learn the high-quality reasoning trajectory selection from the outcome reward. Equipped with the reflective generative form, MetaStone-S1 is naturally suitable for test-time scaling, and we provide three reasoning effort modes (low, medium, and high) based on the controllable thinking length. Experiments demonstrate that our MetaStone-S1 achieves comparable performance to OpenAI o3-mini's series with only 32B parameter size. To support the research community, we have open-sourced MetaStone-S1 at https://github.com/MetaStone-AI/MetaStone-S1. |
| title | Test-Time Scaling with Reflective Generative Model |
| topic | Machine Learning Computation and Language |
| url | https://arxiv.org/abs/2507.01951 |