:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Wang, Tianlu, Kulikov, Ilia, Golovneva, Olga, Yu, Ping, Yuan, Weizhe, Dwivedi-Yu, Jane, Pang, Richard Yuanzhe, Fazel-Zarandi, Maryam, Weston, Jason, Li, Xian
Format:	Preprint
Published:	2024
Subjects:	Computation and Language Artificial Intelligence
Online Access:	https://arxiv.org/abs/2408.02666
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

CoT-Self-Instruct: Building high-quality synthetic prompts for reasoning and non-reasoning tasks
by: Yu, Ping, et al.
Published: (2025)

Self-Consistency Preference Optimization
by: Prasad, Archiki, et al.
Published: (2024)

Contextual Position Encoding: Learning to Count What's Important
by: Golovneva, Olga, et al.
Published: (2024)

J1: Incentivizing Thinking in LLM-as-a-Judge via Reinforcement Learning
by: Whitehouse, Chenxi, et al.
Published: (2025)

R.I.P.: Better Models by Survival of the Fittest Prompts
by: Yu, Ping, et al.
Published: (2025)

Self-Rewarding Language Models
by: Yuan, Weizhe, et al.
Published: (2024)

Distilling System 2 into System 1
by: Yu, Ping, et al.
Published: (2024)

Multi-Token Attention
by: Golovneva, Olga, et al.
Published: (2025)

Self-Improving Pretraining: using post-trained models to pretrain better models
by: Tan, Ellen Xiaoqing, et al.
Published: (2026)

Meta-Rewarding Language Models: Self-Improving Alignment with LLM-as-a-Meta-Judge
by: Wu, Tianhao, et al.
Published: (2024)

StepWiser: Stepwise Generative Judges for Wiser Reasoning
by: Xiong, Wei, et al.
Published: (2025)

Iterative Reasoning Preference Optimization
by: Pang, Richard Yuanzhe, et al.
Published: (2024)

Reverse Training to Nurse the Reversal Curse
by: Golovneva, Olga, et al.
Published: (2024)

Following Length Constraints in Instructions
by: Yuan, Weizhe, et al.
Published: (2024)

Learning to Plan & Reason for Evaluation with Thinking-LLM-as-a-Judge
by: Saha, Swarnadeep, et al.
Published: (2025)

Bridging Offline and Online Reinforcement Learning for LLMs
by: Lanchantin, Jack, et al.
Published: (2025)

Explore Theory of Mind: Program-guided adversarial data generation for theory of mind reasoning
by: Sclar, Melanie, et al.
Published: (2024)

System-Level Natural Language Feedback
by: Yuan, Weizhe, et al.
Published: (2023)

Hybrid Reinforcement: When Reward Is Sparse, It's Better to Be Dense
by: Tao, Leitian, et al.
Published: (2025)

SPICE: Self-Play In Corpus Environments Improves Reasoning
by: Liu, Bo, et al.
Published: (2025)

NaturalReasoning: Reasoning in the Wild with 2.8M Challenging Questions
by: Yuan, Weizhe, et al.
Published: (2025)

Efficient Tool Use with Chain-of-Abstraction Reasoning
by: Gao, Silin, et al.
Published: (2024)

The Era of Real-World Human Interaction: RL from User Conversations
by: Jin, Chuanyang, et al.
Published: (2025)

Adaptive Decoding via Latent Preference Optimization
by: Dhuliawala, Shehzaad, et al.
Published: (2024)

Diverse Preference Optimization
by: Lanchantin, Jack, et al.
Published: (2025)

Branch-Train-MiX: Mixing Expert LLMs into a Mixture-of-Experts LLM
by: Sukhbaatar, Sainbayar, et al.
Published: (2024)

Thinking LLMs: General Instruction Following with Thought Generation
by: Wu, Tianhao, et al.
Published: (2024)

Source2Synth: Synthetic Data Generation and Curation Grounded in Real Data Sources
by: Lupidi, Alisia, et al.
Published: (2024)

Reasoning over mathematical objects: on-policy reward modeling and test time aggregation
by: Aggarwal, Pranjal, et al.
Published: (2026)

Self-Challenging Language Model Agents
by: Zhou, Yifei, et al.
Published: (2025)

NaturalThoughts: Selecting and Distilling Reasoning Traces for General Reasoning Tasks
by: Li, Yang, et al.
Published: (2025)

LLM Pretraining with Continuous Concepts
by: Tack, Jihoon, et al.
Published: (2025)

Self-Taught Agentic Long Context Understanding
by: Zhuang, Yufan, et al.
Published: (2025)

Leveraging Implicit Feedback from Deployment Data in Dialogue
by: Pang, Richard Yuanzhe, et al.
Published: (2023)

OptimalThinkingBench: Evaluating Over and Underthinking in LLMs
by: Aggarwal, Pranjal, et al.
Published: (2025)

TOOLVERIFIER: Generalization to New Tools via Self-Verification
by: Mekala, Dheeraj, et al.
Published: (2024)

The Majority is not always right: RL training for solution aggregation
by: Zhao, Wenting, et al.
Published: (2025)

Recycling the Web: A Method to Enhance Pre-training Data Quality and Quantity for Language Models
by: Nguyen, Thao, et al.
Published: (2025)

RESTRAIN: From Spurious Votes to Signals -- Self-Driven RL with Self-Penalization
by: Yu, Zhaoning, et al.
Published: (2025)

V-STaR: Training Verifiers for Self-Taught Reasoners
by: Hosseini, Arian, et al.
Published: (2024)