Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Liu, Shiqi, He, Zeyu, Zhan, Guojian, Tao, Letian, Zheng, Zhilong, Wu, Jiang, Wang, Yinuo, Guan, Yang, Sheng, Kehua, Zhang, Bo, Li, Keqiang, Duan, Jingliang, Li, Shengbo Eben
Format:	Preprint
Published:	2026
Subjects:	Computation and Language Artificial Intelligence
Online Access:	https://arxiv.org/abs/2602.15620
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866910252997279744
author	Liu, Shiqi He, Zeyu Zhan, Guojian Tao, Letian Zheng, Zhilong Wu, Jiang Wang, Yinuo Guan, Yang Sheng, Kehua Zhang, Bo Li, Keqiang Duan, Jingliang Li, Shengbo Eben
author_facet	Liu, Shiqi He, Zeyu Zhan, Guojian Tao, Letian Zheng, Zhilong Wu, Jiang Wang, Yinuo Guan, Yang Sheng, Kehua Zhang, Bo Li, Keqiang Duan, Jingliang Li, Shengbo Eben
contents	Reinforcement Learning (RL) has significantly improved large language model reasoning, but existing RL fine-tuning methods rely heavily on heuristic techniques such as entropy regularization and reweighting to maintain stability. In practice, they often suffer from late-stage performance collapse, leading to degraded reasoning quality and unstable training. We identify a key factor behind this instability: a small fraction of tokens, termed spurious tokens (around 0.01%), which contribute little to the reasoning outcome but receive disproportionately amplified gradient updates due to inheriting the full sequence-level reward. We present a unified framework for evaluating token-level optimization impacts across spurious risk, gradient norms, and entropy changes. Building on the analysis of token characteristics that severely disrupt optimization, we propose the Silencing Spurious Tokens (S2T) mechanism to efficiently suppress their gradient perturbations. Incorporating this mechanism into a group-based objective, we propose Spurious-Token-Aware Policy Optimization (STAPO), which promotes stable and effective large-scale model refinement. Across six mathematical reasoning benchmarks using Qwen 1.7B, 8B, and 14B base models, STAPO consistently demonstrates superior entropy stability and achieves an average performance improvement of 11.49% ($ρ_{\mathrm{T}}$=1.0, top-p=1.0) and 3.73% ($ρ_{\mathrm{T}}$=0.7, top-p=0.9) over GRPO, 20-Entropy, and JustRL.
format	Preprint
id	arxiv_https___arxiv_org_abs_2602_15620
institution	arXiv
publishDate	2026
record_format	arxiv
spellingShingle	STAPO: Stabilizing Reinforcement Learning for LLMs by Silencing Rare Spurious Tokens Liu, Shiqi He, Zeyu Zhan, Guojian Tao, Letian Zheng, Zhilong Wu, Jiang Wang, Yinuo Guan, Yang Sheng, Kehua Zhang, Bo Li, Keqiang Duan, Jingliang Li, Shengbo Eben Computation and Language Artificial Intelligence Reinforcement Learning (RL) has significantly improved large language model reasoning, but existing RL fine-tuning methods rely heavily on heuristic techniques such as entropy regularization and reweighting to maintain stability. In practice, they often suffer from late-stage performance collapse, leading to degraded reasoning quality and unstable training. We identify a key factor behind this instability: a small fraction of tokens, termed spurious tokens (around 0.01%), which contribute little to the reasoning outcome but receive disproportionately amplified gradient updates due to inheriting the full sequence-level reward. We present a unified framework for evaluating token-level optimization impacts across spurious risk, gradient norms, and entropy changes. Building on the analysis of token characteristics that severely disrupt optimization, we propose the Silencing Spurious Tokens (S2T) mechanism to efficiently suppress their gradient perturbations. Incorporating this mechanism into a group-based objective, we propose Spurious-Token-Aware Policy Optimization (STAPO), which promotes stable and effective large-scale model refinement. Across six mathematical reasoning benchmarks using Qwen 1.7B, 8B, and 14B base models, STAPO consistently demonstrates superior entropy stability and achieves an average performance improvement of 11.49% ($ρ_{\mathrm{T}}$=1.0, top-p=1.0) and 3.73% ($ρ_{\mathrm{T}}$=0.7, top-p=0.9) over GRPO, 20-Entropy, and JustRL.
title	STAPO: Stabilizing Reinforcement Learning for LLMs by Silencing Rare Spurious Tokens
topic	Computation and Language Artificial Intelligence
url	https://arxiv.org/abs/2602.15620

Similar Items