Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Huang, Chongxuan, Lin, Lei, Shi, Xiaodong, Hu, Wenping, Tang, Ruiming
Format:	Preprint
Published:	2026
Subjects:	Computation and Language
Online Access:	https://arxiv.org/abs/2601.14700
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866911389198581760
author	Huang, Chongxuan Lin, Lei Shi, Xiaodong Hu, Wenping Tang, Ruiming
author_facet	Huang, Chongxuan Lin, Lei Shi, Xiaodong Hu, Wenping Tang, Ruiming
contents	Reinforcement Learning with Verifiable Rewards (RLVR) has demonstrated promising gains in enhancing the reasoning capabilities of large language models. However, its dependence on domain-specific verifiers significantly restricts its applicability to open and general domains. Recent efforts such as RLPR have extended RLVR to general domains, enabling training on broader datasets and achieving improvements over RLVR. However, a notable limitation of these methods is their tendency to overfit to reference answers, which constrains the model's ability to generate diverse outputs. This limitation is particularly pronounced in open-ended tasks such as writing, where multiple plausible answers exist. To address this, we propose DARL, a simple yet effective reinforcement learning framework that encourages the generation of diverse answers within a controlled deviation range from the reference while preserving alignment with it. Our framework is fully compatible with existing general reinforcement learning methods and can be seamlessly integrated without additional verifiers. Extensive experiments on thirteen benchmarks demonstrate consistent improvements in reasoning performance. Notably, DARL surpasses RLPR, achieving average gains of 1.3 points on six reasoning benchmarks and 9.5 points on seven general benchmarks, highlighting its effectiveness in improving both reasoning accuracy and output diversity.
format	Preprint
id	arxiv_https___arxiv_org_abs_2601_14700
institution	arXiv
publishDate	2026
record_format	arxiv
spellingShingle	DARL: Encouraging Diverse Answers for General Reasoning without Verifiers Huang, Chongxuan Lin, Lei Shi, Xiaodong Hu, Wenping Tang, Ruiming Computation and Language Reinforcement Learning with Verifiable Rewards (RLVR) has demonstrated promising gains in enhancing the reasoning capabilities of large language models. However, its dependence on domain-specific verifiers significantly restricts its applicability to open and general domains. Recent efforts such as RLPR have extended RLVR to general domains, enabling training on broader datasets and achieving improvements over RLVR. However, a notable limitation of these methods is their tendency to overfit to reference answers, which constrains the model's ability to generate diverse outputs. This limitation is particularly pronounced in open-ended tasks such as writing, where multiple plausible answers exist. To address this, we propose DARL, a simple yet effective reinforcement learning framework that encourages the generation of diverse answers within a controlled deviation range from the reference while preserving alignment with it. Our framework is fully compatible with existing general reinforcement learning methods and can be seamlessly integrated without additional verifiers. Extensive experiments on thirteen benchmarks demonstrate consistent improvements in reasoning performance. Notably, DARL surpasses RLPR, achieving average gains of 1.3 points on six reasoning benchmarks and 9.5 points on seven general benchmarks, highlighting its effectiveness in improving both reasoning accuracy and output diversity.
title	DARL: Encouraging Diverse Answers for General Reasoning without Verifiers
topic	Computation and Language
url	https://arxiv.org/abs/2601.14700

Similar Items