:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Ankner, Zachary, Paul, Mansheej, Cui, Brandon, Chang, Jonathan D., Ammanabrolu, Prithviraj
Format:	Preprint
Published:	2024
Subjects:	Machine Learning
Online Access:	https://arxiv.org/abs/2408.11791
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

Simultaneous Multi-objective Alignment Across Verifiable and Non-verifiable Rewards
by: Shen, Yiran, et al.
Published: (2025)

Perplexed by Perplexity: Perplexity-Based Data Pruning With Small Reference Models
by: Ankner, Zachary, et al.
Published: (2024)

Beyond Needle(s) in the Embodied Haystack: Environment, Architecture, and Training Considerations for Long Context Reasoning
by: Kim, Bosung, et al.
Published: (2025)

A Practitioner's Guide to Multi-turn Agentic Reinforcement Learning
by: Wang, Ruiyi, et al.
Published: (2025)

How Reasoning Evolves from Post-Training Data: An Empirical Study Using Chess
by: Dionisopoulos, Lucas, et al.
Published: (2026)

Scaling Laws for Precision
by: Kumar, Tanishq, et al.
Published: (2024)

Preference-Based Learning in Audio Applications: A Systematic Analysis
by: Broukhim, Aaron, et al.
Published: (2025)

Hydra: Sequentially-Dependent Draft Heads for Medusa Decoding
by: Ankner, Zachary, et al.
Published: (2024)

Introspective X Training: Feedback Conditioning Improves Scaling Across all LLM Training Stages
by: Cui, Brandon, et al.
Published: (2026)

$μ$nit Scaling: Simple and Scalable FP8 LLM Training
by: Narayan, Saaketh, et al.
Published: (2025)

Does your data spark joy? Performance gains from domain upsampling at the end of training
by: Blakeney, Cody, et al.
Published: (2024)

Soup to go: mitigating forgetting during continual learning with model averaging
by: Kleiman, Anat, et al.
Published: (2025)

In-context Ranking Preference Optimization
by: Wu, Junda, et al.
Published: (2025)

Self-Generated Critiques Boost Reward Modeling for Language Models
by: Yu, Yue, et al.
Published: (2024)

MASS-DPO: Multi-negative Active Sample Selection for Direct Policy Optimization
by: Surana, Rohan, et al.
Published: (2026)

Behavior Cue Reasoning: Monitorable Reasoning Improves Efficiency and Safety through Oversight
by: Cui, Christopher Z., et al.
Published: (2026)

Silent Tokens, Loud Effects: Padding in LLMs
by: Himelstein, Rom, et al.
Published: (2025)

Decoding the Critique Mechanism in Large Reasoning Models
by: Phan, Hoang, et al.
Published: (2026)

BaNEL: Exploration Posteriors for Generative Modeling Using Only Negative Rewards
by: Lee, Sangyun, et al.
Published: (2025)

Noise Injection Systemically Degrades Large Language Model Safety Guardrails
by: Shahani, Prithviraj Singh, et al.
Published: (2025)

RL for Consistency Models: Faster Reward Guided Text-to-Image Generation
by: Oertell, Owen, et al.
Published: (2024)

Helping or Herding? Reward Model Ensembles Mitigate but do not Eliminate Reward Hacking
by: Eisenstein, Jacob, et al.
Published: (2023)

Provably Sample-Efficient Robust Reinforcement Learning with Average Reward
by: Roch, Zachary, et al.
Published: (2025)

Explanation through Reward Model Reconciliation using POMDP Tree Search
by: Kraske, Benjamin D., et al.
Published: (2023)

Dreaming Out Loud: A Self-Synthesis Approach For Training Vision-Language Models With Developmentally Plausible Data
by: AlKhamissi, Badr, et al.
Published: (2024)

Learning to Keep a Promise: Scaling Language Model Decoding Parallelism with Learned Asynchronous Decoding
by: Jin, Tian, et al.
Published: (2025)

LoRA Training Provably Converges to a Low-Rank Global Minimum or It Fails Loudly (But it Probably Won't Fail)
by: Kim, Junsu, et al.
Published: (2025)

Policy Learning from Large Vision-Language Model Feedback without Reward Modeling
by: Luu, Tung M., et al.
Published: (2025)

How to Instruct Your Robot: Dense Language Annotations Power Robot Policy Learning
by: Kim, Bosung, et al.
Published: (2026)

Critiques of World Models
by: Xing, Eric, et al.
Published: (2025)

Prioritize the Process, Not Just the Outcome: Rewarding Latent Thought Trajectories Improves Reasoning in Looped Language Models
by: Williams, Jonathan, et al.
Published: (2026)

RewardBench: Evaluating Reward Models for Language Modeling
by: Lambert, Nathan, et al.
Published: (2024)

Reward Generation via Large Vision-Language Model in Offline Reinforcement Learning
by: Lee, Younghwan, et al.
Published: (2025)

TALES: Text Adventure Learning Environment Suite
by: Cui, Christopher Zhang, et al.
Published: (2025)

EvilGenie: A Reward Hacking Benchmark
by: Gabor, Jonathan, et al.
Published: (2025)

Noise Contrastive Alignment of Language Models with Explicit Rewards
by: Chen, Huayu, et al.
Published: (2024)

LoRA Learns Less and Forgets Less
by: Biderman, Dan, et al.
Published: (2024)

Activation Reward Models for Few-Shot Model Alignment
by: Chai, Tianning, et al.
Published: (2025)

Online Intrinsic Rewards for Decision Making Agents from Large Language Model Feedback
by: Zheng, Qinqing, et al.
Published: (2024)

Reward Models Are Secretly Value Functions: Temporally Coherent Reward Modeling
by: Nikulkov, Alex
Published: (2026)