Saved in:
| Main Authors: | Ankner, Zachary, Paul, Mansheej, Cui, Brandon, Chang, Jonathan D., Ammanabrolu, Prithviraj |
|---|---|
| Format: | Preprint |
| Published: |
2024
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2408.11791 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Simultaneous Multi-objective Alignment Across Verifiable and Non-verifiable Rewards
by: Shen, Yiran, et al.
Published: (2025)
by: Shen, Yiran, et al.
Published: (2025)
Perplexed by Perplexity: Perplexity-Based Data Pruning With Small Reference Models
by: Ankner, Zachary, et al.
Published: (2024)
by: Ankner, Zachary, et al.
Published: (2024)
Beyond Needle(s) in the Embodied Haystack: Environment, Architecture, and Training Considerations for Long Context Reasoning
by: Kim, Bosung, et al.
Published: (2025)
by: Kim, Bosung, et al.
Published: (2025)
A Practitioner's Guide to Multi-turn Agentic Reinforcement Learning
by: Wang, Ruiyi, et al.
Published: (2025)
by: Wang, Ruiyi, et al.
Published: (2025)
How Reasoning Evolves from Post-Training Data: An Empirical Study Using Chess
by: Dionisopoulos, Lucas, et al.
Published: (2026)
by: Dionisopoulos, Lucas, et al.
Published: (2026)
Scaling Laws for Precision
by: Kumar, Tanishq, et al.
Published: (2024)
by: Kumar, Tanishq, et al.
Published: (2024)
Preference-Based Learning in Audio Applications: A Systematic Analysis
by: Broukhim, Aaron, et al.
Published: (2025)
by: Broukhim, Aaron, et al.
Published: (2025)
Hydra: Sequentially-Dependent Draft Heads for Medusa Decoding
by: Ankner, Zachary, et al.
Published: (2024)
by: Ankner, Zachary, et al.
Published: (2024)
Introspective X Training: Feedback Conditioning Improves Scaling Across all LLM Training Stages
by: Cui, Brandon, et al.
Published: (2026)
by: Cui, Brandon, et al.
Published: (2026)
$μ$nit Scaling: Simple and Scalable FP8 LLM Training
by: Narayan, Saaketh, et al.
Published: (2025)
by: Narayan, Saaketh, et al.
Published: (2025)
Does your data spark joy? Performance gains from domain upsampling at the end of training
by: Blakeney, Cody, et al.
Published: (2024)
by: Blakeney, Cody, et al.
Published: (2024)
Soup to go: mitigating forgetting during continual learning with model averaging
by: Kleiman, Anat, et al.
Published: (2025)
by: Kleiman, Anat, et al.
Published: (2025)
In-context Ranking Preference Optimization
by: Wu, Junda, et al.
Published: (2025)
by: Wu, Junda, et al.
Published: (2025)
Self-Generated Critiques Boost Reward Modeling for Language Models
by: Yu, Yue, et al.
Published: (2024)
by: Yu, Yue, et al.
Published: (2024)
MASS-DPO: Multi-negative Active Sample Selection for Direct Policy Optimization
by: Surana, Rohan, et al.
Published: (2026)
by: Surana, Rohan, et al.
Published: (2026)
Behavior Cue Reasoning: Monitorable Reasoning Improves Efficiency and Safety through Oversight
by: Cui, Christopher Z., et al.
Published: (2026)
by: Cui, Christopher Z., et al.
Published: (2026)
Silent Tokens, Loud Effects: Padding in LLMs
by: Himelstein, Rom, et al.
Published: (2025)
by: Himelstein, Rom, et al.
Published: (2025)
Decoding the Critique Mechanism in Large Reasoning Models
by: Phan, Hoang, et al.
Published: (2026)
by: Phan, Hoang, et al.
Published: (2026)
BaNEL: Exploration Posteriors for Generative Modeling Using Only Negative Rewards
by: Lee, Sangyun, et al.
Published: (2025)
by: Lee, Sangyun, et al.
Published: (2025)
Noise Injection Systemically Degrades Large Language Model Safety Guardrails
by: Shahani, Prithviraj Singh, et al.
Published: (2025)
by: Shahani, Prithviraj Singh, et al.
Published: (2025)
RL for Consistency Models: Faster Reward Guided Text-to-Image Generation
by: Oertell, Owen, et al.
Published: (2024)
by: Oertell, Owen, et al.
Published: (2024)
Helping or Herding? Reward Model Ensembles Mitigate but do not Eliminate Reward Hacking
by: Eisenstein, Jacob, et al.
Published: (2023)
by: Eisenstein, Jacob, et al.
Published: (2023)
Provably Sample-Efficient Robust Reinforcement Learning with Average Reward
by: Roch, Zachary, et al.
Published: (2025)
by: Roch, Zachary, et al.
Published: (2025)
Explanation through Reward Model Reconciliation using POMDP Tree Search
by: Kraske, Benjamin D., et al.
Published: (2023)
by: Kraske, Benjamin D., et al.
Published: (2023)
Dreaming Out Loud: A Self-Synthesis Approach For Training Vision-Language Models With Developmentally Plausible Data
by: AlKhamissi, Badr, et al.
Published: (2024)
by: AlKhamissi, Badr, et al.
Published: (2024)
Learning to Keep a Promise: Scaling Language Model Decoding Parallelism with Learned Asynchronous Decoding
by: Jin, Tian, et al.
Published: (2025)
by: Jin, Tian, et al.
Published: (2025)
LoRA Training Provably Converges to a Low-Rank Global Minimum or It Fails Loudly (But it Probably Won't Fail)
by: Kim, Junsu, et al.
Published: (2025)
by: Kim, Junsu, et al.
Published: (2025)
Policy Learning from Large Vision-Language Model Feedback without Reward Modeling
by: Luu, Tung M., et al.
Published: (2025)
by: Luu, Tung M., et al.
Published: (2025)
How to Instruct Your Robot: Dense Language Annotations Power Robot Policy Learning
by: Kim, Bosung, et al.
Published: (2026)
by: Kim, Bosung, et al.
Published: (2026)
Critiques of World Models
by: Xing, Eric, et al.
Published: (2025)
by: Xing, Eric, et al.
Published: (2025)
Prioritize the Process, Not Just the Outcome: Rewarding Latent Thought Trajectories Improves Reasoning in Looped Language Models
by: Williams, Jonathan, et al.
Published: (2026)
by: Williams, Jonathan, et al.
Published: (2026)
RewardBench: Evaluating Reward Models for Language Modeling
by: Lambert, Nathan, et al.
Published: (2024)
by: Lambert, Nathan, et al.
Published: (2024)
Reward Generation via Large Vision-Language Model in Offline Reinforcement Learning
by: Lee, Younghwan, et al.
Published: (2025)
by: Lee, Younghwan, et al.
Published: (2025)
TALES: Text Adventure Learning Environment Suite
by: Cui, Christopher Zhang, et al.
Published: (2025)
by: Cui, Christopher Zhang, et al.
Published: (2025)
EvilGenie: A Reward Hacking Benchmark
by: Gabor, Jonathan, et al.
Published: (2025)
by: Gabor, Jonathan, et al.
Published: (2025)
Noise Contrastive Alignment of Language Models with Explicit Rewards
by: Chen, Huayu, et al.
Published: (2024)
by: Chen, Huayu, et al.
Published: (2024)
LoRA Learns Less and Forgets Less
by: Biderman, Dan, et al.
Published: (2024)
by: Biderman, Dan, et al.
Published: (2024)
Activation Reward Models for Few-Shot Model Alignment
by: Chai, Tianning, et al.
Published: (2025)
by: Chai, Tianning, et al.
Published: (2025)
Online Intrinsic Rewards for Decision Making Agents from Large Language Model Feedback
by: Zheng, Qinqing, et al.
Published: (2024)
by: Zheng, Qinqing, et al.
Published: (2024)
Reward Models Are Secretly Value Functions: Temporally Coherent Reward Modeling
by: Nikulkov, Alex
Published: (2026)
by: Nikulkov, Alex
Published: (2026)
Similar Items
-
Simultaneous Multi-objective Alignment Across Verifiable and Non-verifiable Rewards
by: Shen, Yiran, et al.
Published: (2025) -
Perplexed by Perplexity: Perplexity-Based Data Pruning With Small Reference Models
by: Ankner, Zachary, et al.
Published: (2024) -
Beyond Needle(s) in the Embodied Haystack: Environment, Architecture, and Training Considerations for Long Context Reasoning
by: Kim, Bosung, et al.
Published: (2025) -
A Practitioner's Guide to Multi-turn Agentic Reinforcement Learning
by: Wang, Ruiyi, et al.
Published: (2025) -
How Reasoning Evolves from Post-Training Data: An Empirical Study Using Chess
by: Dionisopoulos, Lucas, et al.
Published: (2026)