Saved in:
| Main Authors: | Ahmed, Ahmed M., Rafailov, Rafael, Sharkov, Stepan, Li, Xuechen, Koyejo, Sanmi |
|---|---|
| Format: | Preprint |
| Published: |
2024
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2406.01013 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Reward Model Overoptimisation in Iterated RLHF
by: Wolf, Lorenz, et al.
Published: (2025)
by: Wolf, Lorenz, et al.
Published: (2025)
Extracting books from production language models
by: Ahmed, Ahmed, et al.
Published: (2026)
by: Ahmed, Ahmed, et al.
Published: (2026)
Discovering Implicit Large Language Model Alignment Objectives
by: Chen, Edward, et al.
Published: (2026)
by: Chen, Edward, et al.
Published: (2026)
Reasoning Models Don't Just Think Longer, They Move Differently
by: Gjølbye, Anders, et al.
Published: (2026)
by: Gjølbye, Anders, et al.
Published: (2026)
General Preference Reinforcement Learning
by: Umer, Muhammad, et al.
Published: (2026)
by: Umer, Muhammad, et al.
Published: (2026)
Why Do Safety Guardrails Degrade Across Languages?
by: Zhang, Max, et al.
Published: (2026)
by: Zhang, Max, et al.
Published: (2026)
Logits are All We Need to Adapt Closed Models
by: Hiranandani, Gaurush, et al.
Published: (2025)
by: Hiranandani, Gaurush, et al.
Published: (2025)
Quantifying the Effect of Test Set Contamination on Generative Evaluations
by: Schaeffer, Rylan, et al.
Published: (2026)
by: Schaeffer, Rylan, et al.
Published: (2026)
Direct Preference Optimization: Your Language Model is Secretly a Reward Model
by: Rafailov, Rafael, et al.
Published: (2023)
by: Rafailov, Rafael, et al.
Published: (2023)
Reliable and Efficient Amortized Model-based Evaluation
by: Truong, Sang, et al.
Published: (2025)
by: Truong, Sang, et al.
Published: (2025)
Scaling Laws for Downstream Task Performance of Large Language Models
by: Isik, Berivan, et al.
Published: (2024)
by: Isik, Berivan, et al.
Published: (2024)
Disentangling Length from Quality in Direct Preference Optimization
by: Park, Ryan, et al.
Published: (2024)
by: Park, Ryan, et al.
Published: (2024)
Scaling Laws for Reward Model Overoptimization in Direct Alignment Algorithms
by: Rafailov, Rafael, et al.
Published: (2024)
by: Rafailov, Rafael, et al.
Published: (2024)
From Passive to Active Reasoning: Can Large Language Models Ask the Right Questions under Incomplete Information?
by: Zhou, Zhanke, et al.
Published: (2025)
by: Zhou, Zhanke, et al.
Published: (2025)
Is Model Collapse Inevitable? Breaking the Curse of Recursion by Accumulating Real and Synthetic Data
by: Gerstgrasser, Matthias, et al.
Published: (2024)
by: Gerstgrasser, Matthias, et al.
Published: (2024)
Lean-ing on Quality: How High-Quality Data Beats Diverse Multilingual Data in AutoFormalization
by: Chan, Willy, et al.
Published: (2025)
by: Chan, Willy, et al.
Published: (2025)
Is Pre-training Truly Better Than Meta-Learning?
by: Miranda, Brando, et al.
Published: (2023)
by: Miranda, Brando, et al.
Published: (2023)
SpecEval: Evaluating Model Adherence to Behavior Specifications
by: Ahmed, Ahmed, et al.
Published: (2025)
by: Ahmed, Ahmed, et al.
Published: (2025)
Quantifying the Importance of Data Alignment in Downstream Model Performance
by: Chawla, Krrish, et al.
Published: (2025)
by: Chawla, Krrish, et al.
Published: (2025)
ZIP-FIT: Embedding-Free Data Selection via Compression-Based Alignment
by: Obbad, Elyas, et al.
Published: (2024)
by: Obbad, Elyas, et al.
Published: (2024)
Investigating Data Contamination for Pre-training Language Models
by: Jiang, Minhao, et al.
Published: (2024)
by: Jiang, Minhao, et al.
Published: (2024)
Aligning Modalities in Vision Large Language Models via Preference Fine-tuning
by: Zhou, Yiyang, et al.
Published: (2024)
by: Zhou, Yiyang, et al.
Published: (2024)
Beyond Scale: The Diversity Coefficient as a Data Quality Metric for Variability in Natural Language Data
by: Miranda, Brando, et al.
Published: (2023)
by: Miranda, Brando, et al.
Published: (2023)
Collapse or Thrive? Perils and Promises of Synthetic Data in a Self-Generating World
by: Kazdan, Joshua, et al.
Published: (2024)
by: Kazdan, Joshua, et al.
Published: (2024)
Diagnosing and Mitigating System Bias in Self-Rewarding RL
by: Tan, Chuyi, et al.
Published: (2025)
by: Tan, Chuyi, et al.
Published: (2025)
Reward Shaping to Mitigate Reward Hacking in RLHF
by: Fu, Jiayi, et al.
Published: (2025)
by: Fu, Jiayi, et al.
Published: (2025)
Why Has Predicting Downstream Capabilities of Frontier AI Models with Scale Remained Elusive?
by: Schaeffer, Rylan, et al.
Published: (2024)
by: Schaeffer, Rylan, et al.
Published: (2024)
CURE: Cultural Understanding and Reasoning Evaluation - A Framework for "Thick" Culture Alignment Evaluation in LLMs
by: Vo, Truong, et al.
Published: (2025)
by: Vo, Truong, et al.
Published: (2025)
Directional Alignment Mitigates Reward Hacking in Reinforcement Learning for Language Models
by: Deng, Wenlong, et al.
Published: (2026)
by: Deng, Wenlong, et al.
Published: (2026)
Causally Inspired Regularization Enables Domain General Representations
by: Salaudeen, Olawale, et al.
Published: (2024)
by: Salaudeen, Olawale, et al.
Published: (2024)
BanglaASTE: A Novel Framework for Aspect-Sentiment-Opinion Extraction in Bangla E-commerce Reviews Using Ensemble Deep Learning
by: Islam, Ariful, et al.
Published: (2025)
by: Islam, Ariful, et al.
Published: (2025)
Best-of-N Jailbreaking
by: Hughes, John, et al.
Published: (2024)
by: Hughes, John, et al.
Published: (2024)
SATBench: Benchmarking LLMs' Logical Reasoning via Automated Puzzle Generation from SAT Formulas
by: Wei, Anjiang, et al.
Published: (2025)
by: Wei, Anjiang, et al.
Published: (2025)
Language Models May Verbatim Complete Text They Were Not Explicitly Trained On
by: Liu, Ken Ziyu, et al.
Published: (2025)
by: Liu, Ken Ziyu, et al.
Published: (2025)
Measurement to Meaning: A Validity-Centered Framework for AI Evaluation
by: Salaudeen, Olawale, et al.
Published: (2025)
by: Salaudeen, Olawale, et al.
Published: (2025)
HiFA: High-fidelity Text-to-3D Generation with Advanced Diffusion Guidance
by: Zhu, Junzhe, et al.
Published: (2023)
by: Zhu, Junzhe, et al.
Published: (2023)
Online Merging Optimizers for Boosting Rewards and Mitigating Tax in Alignment
by: Lu, Keming, et al.
Published: (2024)
by: Lu, Keming, et al.
Published: (2024)
Let's Measure Information Step-by-Step: AI-Based Evaluation Beyond Vibes
by: Robertson, Zachary, et al.
Published: (2025)
by: Robertson, Zachary, et al.
Published: (2025)
KGGen: Extracting Knowledge Graphs from Plain Text with Language Models
by: Mo, Belinda, et al.
Published: (2025)
by: Mo, Belinda, et al.
Published: (2025)
Scalable Multi-phase Word Embedding Using Conjunctive Propositional Clauses
by: Kadhim, Ahmed K., et al.
Published: (2025)
by: Kadhim, Ahmed K., et al.
Published: (2025)
Similar Items
-
Reward Model Overoptimisation in Iterated RLHF
by: Wolf, Lorenz, et al.
Published: (2025) -
Extracting books from production language models
by: Ahmed, Ahmed, et al.
Published: (2026) -
Discovering Implicit Large Language Model Alignment Objectives
by: Chen, Edward, et al.
Published: (2026) -
Reasoning Models Don't Just Think Longer, They Move Differently
by: Gjølbye, Anders, et al.
Published: (2026) -
General Preference Reinforcement Learning
by: Umer, Muhammad, et al.
Published: (2026)