Saved in:
Bibliographic Details
Main Authors: Liu, Shiqi, He, Zeyu, Zhan, Guojian, Tao, Letian, Zheng, Zhilong, Wu, Jiang, Wang, Yinuo, Guan, Yang, Sheng, Kehua, Zhang, Bo, Li, Keqiang, Duan, Jingliang, Li, Shengbo Eben
Format: Preprint
Published: 2026
Subjects:
Online Access:https://arxiv.org/abs/2602.15620
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866910252997279744
author Liu, Shiqi
He, Zeyu
Zhan, Guojian
Tao, Letian
Zheng, Zhilong
Wu, Jiang
Wang, Yinuo
Guan, Yang
Sheng, Kehua
Zhang, Bo
Li, Keqiang
Duan, Jingliang
Li, Shengbo Eben
author_facet Liu, Shiqi
He, Zeyu
Zhan, Guojian
Tao, Letian
Zheng, Zhilong
Wu, Jiang
Wang, Yinuo
Guan, Yang
Sheng, Kehua
Zhang, Bo
Li, Keqiang
Duan, Jingliang
Li, Shengbo Eben
contents Reinforcement Learning (RL) has significantly improved large language model reasoning, but existing RL fine-tuning methods rely heavily on heuristic techniques such as entropy regularization and reweighting to maintain stability. In practice, they often suffer from late-stage performance collapse, leading to degraded reasoning quality and unstable training. We identify a key factor behind this instability: a small fraction of tokens, termed spurious tokens (around 0.01%), which contribute little to the reasoning outcome but receive disproportionately amplified gradient updates due to inheriting the full sequence-level reward. We present a unified framework for evaluating token-level optimization impacts across spurious risk, gradient norms, and entropy changes. Building on the analysis of token characteristics that severely disrupt optimization, we propose the Silencing Spurious Tokens (S2T) mechanism to efficiently suppress their gradient perturbations. Incorporating this mechanism into a group-based objective, we propose Spurious-Token-Aware Policy Optimization (STAPO), which promotes stable and effective large-scale model refinement. Across six mathematical reasoning benchmarks using Qwen 1.7B, 8B, and 14B base models, STAPO consistently demonstrates superior entropy stability and achieves an average performance improvement of 11.49% ($ρ_{\mathrm{T}}$=1.0, top-p=1.0) and 3.73% ($ρ_{\mathrm{T}}$=0.7, top-p=0.9) over GRPO, 20-Entropy, and JustRL.
format Preprint
id arxiv_https___arxiv_org_abs_2602_15620
institution arXiv
publishDate 2026
record_format arxiv
spellingShingle STAPO: Stabilizing Reinforcement Learning for LLMs by Silencing Rare Spurious Tokens
Liu, Shiqi
He, Zeyu
Zhan, Guojian
Tao, Letian
Zheng, Zhilong
Wu, Jiang
Wang, Yinuo
Guan, Yang
Sheng, Kehua
Zhang, Bo
Li, Keqiang
Duan, Jingliang
Li, Shengbo Eben
Computation and Language
Artificial Intelligence
Reinforcement Learning (RL) has significantly improved large language model reasoning, but existing RL fine-tuning methods rely heavily on heuristic techniques such as entropy regularization and reweighting to maintain stability. In practice, they often suffer from late-stage performance collapse, leading to degraded reasoning quality and unstable training. We identify a key factor behind this instability: a small fraction of tokens, termed spurious tokens (around 0.01%), which contribute little to the reasoning outcome but receive disproportionately amplified gradient updates due to inheriting the full sequence-level reward. We present a unified framework for evaluating token-level optimization impacts across spurious risk, gradient norms, and entropy changes. Building on the analysis of token characteristics that severely disrupt optimization, we propose the Silencing Spurious Tokens (S2T) mechanism to efficiently suppress their gradient perturbations. Incorporating this mechanism into a group-based objective, we propose Spurious-Token-Aware Policy Optimization (STAPO), which promotes stable and effective large-scale model refinement. Across six mathematical reasoning benchmarks using Qwen 1.7B, 8B, and 14B base models, STAPO consistently demonstrates superior entropy stability and achieves an average performance improvement of 11.49% ($ρ_{\mathrm{T}}$=1.0, top-p=1.0) and 3.73% ($ρ_{\mathrm{T}}$=0.7, top-p=0.9) over GRPO, 20-Entropy, and JustRL.
title STAPO: Stabilizing Reinforcement Learning for LLMs by Silencing Rare Spurious Tokens
topic Computation and Language
Artificial Intelligence
url https://arxiv.org/abs/2602.15620