Saved in:
| Main Authors: | Aggarwal, Pranjal, Ghazvininejad, Marjan, Kim, Seungone, Kulikov, Ilia, Lanchantin, Jack, Li, Xian, Li, Tianjian, Liu, Bo, Neubig, Graham, Ovalle, Anaelia, Saha, Swarnadeep, Sukhbaatar, Sainbayar, Welleck, Sean, Weston, Jason, Whitehouse, Chenxi, Williams, Adina, Xu, Jing, Yu, Ping, Yuan, Weizhe, Zhang, Jingyu, Zhao, Wenting |
|---|---|
| Format: | Preprint |
| Published: |
2026
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2603.18886 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
OptimalThinkingBench: Evaluating Over and Underthinking in LLMs
by: Aggarwal, Pranjal, et al.
Published: (2025)
by: Aggarwal, Pranjal, et al.
Published: (2025)
SPICE: Self-Play In Corpus Environments Improves Reasoning
by: Liu, Bo, et al.
Published: (2025)
by: Liu, Bo, et al.
Published: (2025)
The Majority is not always right: RL training for solution aggregation
by: Zhao, Wenting, et al.
Published: (2025)
by: Zhao, Wenting, et al.
Published: (2025)
Adaptive Decoding via Latent Preference Optimization
by: Dhuliawala, Shehzaad, et al.
Published: (2024)
by: Dhuliawala, Shehzaad, et al.
Published: (2024)
Diverse Preference Optimization
by: Lanchantin, Jack, et al.
Published: (2025)
by: Lanchantin, Jack, et al.
Published: (2025)
CoT-Self-Instruct: Building high-quality synthetic prompts for reasoning and non-reasoning tasks
by: Yu, Ping, et al.
Published: (2025)
by: Yu, Ping, et al.
Published: (2025)
Bridging Offline and Online Reinforcement Learning for LLMs
by: Lanchantin, Jack, et al.
Published: (2025)
by: Lanchantin, Jack, et al.
Published: (2025)
Following Length Constraints in Instructions
by: Yuan, Weizhe, et al.
Published: (2024)
by: Yuan, Weizhe, et al.
Published: (2024)
J1: Incentivizing Thinking in LLM-as-a-Judge via Reinforcement Learning
by: Whitehouse, Chenxi, et al.
Published: (2025)
by: Whitehouse, Chenxi, et al.
Published: (2025)
Gym-Anything: Turn any Software into an Agent Environment
by: Aggarwal, Pranjal, et al.
Published: (2026)
by: Aggarwal, Pranjal, et al.
Published: (2026)
Learning to Plan & Reason for Evaluation with Thinking-LLM-as-a-Judge
by: Saha, Swarnadeep, et al.
Published: (2025)
by: Saha, Swarnadeep, et al.
Published: (2025)
StepWiser: Stepwise Generative Judges for Wiser Reasoning
by: Xiong, Wei, et al.
Published: (2025)
by: Xiong, Wei, et al.
Published: (2025)
Thinking LLMs: General Instruction Following with Thought Generation
by: Wu, Tianhao, et al.
Published: (2024)
by: Wu, Tianhao, et al.
Published: (2024)
L1: Controlling How Long A Reasoning Model Thinks With Reinforcement Learning
by: Aggarwal, Pranjal, et al.
Published: (2025)
by: Aggarwal, Pranjal, et al.
Published: (2025)
Programming with Pixels: Can Computer-Use Agents do Software Engineering?
by: Aggarwal, Pranjal, et al.
Published: (2025)
by: Aggarwal, Pranjal, et al.
Published: (2025)
Self-Improving Pretraining: using post-trained models to pretrain better models
by: Tan, Ellen Xiaoqing, et al.
Published: (2026)
by: Tan, Ellen Xiaoqing, et al.
Published: (2026)
Iterative Reasoning Preference Optimization
by: Pang, Richard Yuanzhe, et al.
Published: (2024)
by: Pang, Richard Yuanzhe, et al.
Published: (2024)
Jointly Reinforcing Diversity and Quality in Language Model Generations
by: Li, Tianjian, et al.
Published: (2025)
by: Li, Tianjian, et al.
Published: (2025)
Multi-Token Attention
by: Golovneva, Olga, et al.
Published: (2025)
by: Golovneva, Olga, et al.
Published: (2025)
Contextual Position Encoding: Learning to Count What's Important
by: Golovneva, Olga, et al.
Published: (2024)
by: Golovneva, Olga, et al.
Published: (2024)
Some things are more CRINGE than others: Iterative Preference Optimization with the Pairwise Cringe Loss
by: Xu, Jing, et al.
Published: (2023)
by: Xu, Jing, et al.
Published: (2023)
R.I.P.: Better Models by Survival of the Fittest Prompts
by: Yu, Ping, et al.
Published: (2025)
by: Yu, Ping, et al.
Published: (2025)
Reverse Training to Nurse the Reversal Curse
by: Golovneva, Olga, et al.
Published: (2024)
by: Golovneva, Olga, et al.
Published: (2024)
Self-Rewarding Language Models
by: Yuan, Weizhe, et al.
Published: (2024)
by: Yuan, Weizhe, et al.
Published: (2024)
AlphaVerus: Bootstrapping Formally Verified Code Generation through Self-Improving Translation and Treefinement
by: Aggarwal, Pranjal, et al.
Published: (2024)
by: Aggarwal, Pranjal, et al.
Published: (2024)
Self-Challenging Language Model Agents
by: Zhou, Yifei, et al.
Published: (2025)
by: Zhou, Yifei, et al.
Published: (2025)
Meta-Rewarding Language Models: Self-Improving Alignment with LLM-as-a-Meta-Judge
by: Wu, Tianhao, et al.
Published: (2024)
by: Wu, Tianhao, et al.
Published: (2024)
Agentic-R1: Distilled Dual-Strategy Reasoning
by: Du, Weihua, et al.
Published: (2025)
by: Du, Weihua, et al.
Published: (2025)
NaturalThoughts: Selecting and Distilling Reasoning Traces for General Reasoning Tasks
by: Li, Yang, et al.
Published: (2025)
by: Li, Yang, et al.
Published: (2025)
From Decoding to Meta-Generation: Inference-time Algorithms for Large Language Models
by: Welleck, Sean, et al.
Published: (2024)
by: Welleck, Sean, et al.
Published: (2024)
Verus-SpecGym: An Agentic Environment for Evaluating Specification Autoformalization
by: Agarwal, Anmol, et al.
Published: (2026)
by: Agarwal, Anmol, et al.
Published: (2026)
Self-Consistency Preference Optimization
by: Prasad, Archiki, et al.
Published: (2024)
by: Prasad, Archiki, et al.
Published: (2024)
Hybrid Reinforcement: When Reward Is Sparse, It's Better to Be Dense
by: Tao, Leitian, et al.
Published: (2025)
by: Tao, Leitian, et al.
Published: (2025)
SWEET-RL: Training Multi-Turn LLM Agents on Collaborative Reasoning Tasks
by: Zhou, Yifei, et al.
Published: (2025)
by: Zhou, Yifei, et al.
Published: (2025)
Multimodal RewardBench: Holistic Evaluation of Reward Models for Vision Language Models
by: Yasunaga, Michihiro, et al.
Published: (2025)
by: Yasunaga, Michihiro, et al.
Published: (2025)
Beg to Differ: Understanding Reasoning-Answer Misalignment Across Languages
by: Ovalle, Anaelia, et al.
Published: (2025)
by: Ovalle, Anaelia, et al.
Published: (2025)
Distilling System 2 into System 1
by: Yu, Ping, et al.
Published: (2024)
by: Yu, Ping, et al.
Published: (2024)
Training Large Language Models to Reason in a Continuous Latent Space
by: Hao, Shibo, et al.
Published: (2024)
by: Hao, Shibo, et al.
Published: (2024)
LLM Pretraining with Continuous Concepts
by: Tack, Jihoon, et al.
Published: (2025)
by: Tack, Jihoon, et al.
Published: (2025)
ALMA: Alignment with Minimal Annotation
by: Yasunaga, Michihiro, et al.
Published: (2024)
by: Yasunaga, Michihiro, et al.
Published: (2024)
Similar Items
-
OptimalThinkingBench: Evaluating Over and Underthinking in LLMs
by: Aggarwal, Pranjal, et al.
Published: (2025) -
SPICE: Self-Play In Corpus Environments Improves Reasoning
by: Liu, Bo, et al.
Published: (2025) -
The Majority is not always right: RL training for solution aggregation
by: Zhao, Wenting, et al.
Published: (2025) -
Adaptive Decoding via Latent Preference Optimization
by: Dhuliawala, Shehzaad, et al.
Published: (2024) -
Diverse Preference Optimization
by: Lanchantin, Jack, et al.
Published: (2025)