Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Yu, Qinan, Tartaglini, Alexa, Hase, Peter, Guestrin, Carlos, Potts, Christopher
Format:	Preprint
Published:	2026
Subjects:	Computation and Language
Online Access:	https://arxiv.org/abs/2604.22074
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866914503526973440
author	Yu, Qinan Tartaglini, Alexa Hase, Peter Guestrin, Carlos Potts, Christopher
author_facet	Yu, Qinan Tartaglini, Alexa Hase, Peter Guestrin, Carlos Potts, Christopher
contents	Reinforcement Learning from Verifiable Rewards (RLVR) on chain-of-thought reasoning has become a standard part of language model post-training recipes. A common assumption is that the reasoning chains trained through RLVR reliably represent how a model gets to its answer. In this paper, we develop two metrics for critically examining this assumption: Causal Importance of Reasoning (CIR), which measures the cumulative effect of reasoning tokens on the final answer, and Sufficiency of Reasoning (SR), which measures whether a verifier can arrive at an unambiguous answer based on the reasoning alone. Through experiments with the Qwen2.5 model series and ReasoningGym tasks, we find that: (1) while RLVR does improve task accuracy, it does not reliably improve CIR or SR, calling the role of reasoning in model performance into question; (2) a small amount of SFT before RLVR can be a remedy for low CIR and SR; and (3) CIR and SR can be improved even without SFT by applying auxiliary CIR/SR rewards on top of the outcome-based reward. This joint reward matches the accuracy of RLVR while also leading to causally important and sufficient reasoning. These results show that RLVR does not always lead models to rely on reasoning in the way that is commonly thought, but this issue can be remedied with simple modifications to the post-training procedure.
format	Preprint
id	arxiv_https___arxiv_org_abs_2604_22074
institution	arXiv
publishDate	2026
record_format	arxiv
spellingShingle	Outcome Rewards Do Not Guarantee Verifiable or Causally Important Reasoning Yu, Qinan Tartaglini, Alexa Hase, Peter Guestrin, Carlos Potts, Christopher Computation and Language Reinforcement Learning from Verifiable Rewards (RLVR) on chain-of-thought reasoning has become a standard part of language model post-training recipes. A common assumption is that the reasoning chains trained through RLVR reliably represent how a model gets to its answer. In this paper, we develop two metrics for critically examining this assumption: Causal Importance of Reasoning (CIR), which measures the cumulative effect of reasoning tokens on the final answer, and Sufficiency of Reasoning (SR), which measures whether a verifier can arrive at an unambiguous answer based on the reasoning alone. Through experiments with the Qwen2.5 model series and ReasoningGym tasks, we find that: (1) while RLVR does improve task accuracy, it does not reliably improve CIR or SR, calling the role of reasoning in model performance into question; (2) a small amount of SFT before RLVR can be a remedy for low CIR and SR; and (3) CIR and SR can be improved even without SFT by applying auxiliary CIR/SR rewards on top of the outcome-based reward. This joint reward matches the accuracy of RLVR while also leading to causally important and sufficient reasoning. These results show that RLVR does not always lead models to rely on reasoning in the way that is commonly thought, but this issue can be remedied with simple modifications to the post-training procedure.
title	Outcome Rewards Do Not Guarantee Verifiable or Causally Important Reasoning
topic	Computation and Language
url	https://arxiv.org/abs/2604.22074

Similar Items