Saved in:
| Main Authors: | Che, Zora, Casper, Stephen, Kirk, Robert, Satheesh, Anirudh, Slocum, Stewart, McKinney, Lev E, Gandikota, Rohit, Ewart, Aidan, Rosati, Domenic, Wu, Zichu, Cai, Zikui, Chughtai, Bilal, Gal, Yarin, Huang, Furong, Hadfield-Menell, Dylan |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2502.05209 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Eight Methods to Evaluate Robust Unlearning in LLMs
by: Lynch, Aengus, et al.
Published: (2024)
by: Lynch, Aengus, et al.
Published: (2024)
Diverse Preference Learning for Capabilities and Alignment
by: Slocum, Stewart, et al.
Published: (2025)
by: Slocum, Stewart, et al.
Published: (2025)
Alignment Tampering: How Reinforcement Learning from Human Feedback Is Exploited to Optimize Misaligned Biases
by: Hahm, Dongyoon, et al.
Published: (2026)
by: Hahm, Dongyoon, et al.
Published: (2026)
Randomness, Not Representation: The Unreliability of Evaluating Cultural Alignment in LLMs
by: Khan, Ariba, et al.
Published: (2025)
by: Khan, Ariba, et al.
Published: (2025)
Pitfalls of Evidence-Based AI Policy
by: Casper, Stephen, et al.
Published: (2025)
by: Casper, Stephen, et al.
Published: (2025)
Deep Ignorance: Filtering Pretraining Data Builds Tamper-Resistant Safeguards into Open-Weight LLMs
by: O'Brien, Kyle, et al.
Published: (2025)
by: O'Brien, Kyle, et al.
Published: (2025)
Defending Against Unforeseen Failure Modes with Latent Adversarial Training
by: Casper, Stephen, et al.
Published: (2024)
by: Casper, Stephen, et al.
Published: (2024)
GULPS: Two-Qubit Gate Synthesis via Linear Programming for Heterogeneous Instruction Sets
by: McKinney, Evan, et al.
Published: (2025)
by: McKinney, Evan, et al.
Published: (2025)
Compositional Adversarial Training for Robust Visual Watermarking
by: Satheesh, Anirudh, et al.
Published: (2026)
by: Satheesh, Anirudh, et al.
Published: (2026)
Latent Adversarial Training Improves Robustness to Persistent Harmful Behaviors in LLMs
by: Sheshadri, Abhay, et al.
Published: (2024)
by: Sheshadri, Abhay, et al.
Published: (2024)
Provably Efficient Algorithms for S- and Non-Rectangular Robust MDPs with General Parameterization
by: Satheesh, Anirudh, et al.
Published: (2026)
by: Satheesh, Anirudh, et al.
Published: (2026)
EnsemW2S: Enhancing Weak-to-Strong Generalization with Large Language Model Ensembles
by: Agrawal, Aakriti, et al.
Published: (2024)
by: Agrawal, Aakriti, et al.
Published: (2024)
EnsemW2S: Enhancing Weak-to-Strong Generalization with Large Language Model Ensembles
by: Agrawal, Aakriti, et al.
Published: (2025)
by: Agrawal, Aakriti, et al.
Published: (2025)
SAFLEX: Self-Adaptive Augmentation via Feature Label Extrapolation
by: Ding, Mucong, et al.
Published: (2024)
by: Ding, Mucong, et al.
Published: (2024)
AegisLLM: Scaling Agentic Systems for Self-Reflective Defense in LLM Security
by: Cai, Zikui, et al.
Published: (2025)
by: Cai, Zikui, et al.
Published: (2025)
Distributional Process Reward Models: Calibrated Prediction of Future Rewards via Conditional Optimal Transport
by: Ma, Rachel, et al.
Published: (2026)
by: Ma, Rachel, et al.
Published: (2026)
CALMA: A Process for Deriving Context-aligned Axes for Language Model Alignment
by: Soni, Prajna, et al.
Published: (2025)
by: Soni, Prajna, et al.
Published: (2025)
Disjoint Processing Mechanisms of Hierarchical and Linear Grammars in Large Language Models
by: Sankaranarayanan, Aruna, et al.
Published: (2025)
by: Sankaranarayanan, Aruna, et al.
Published: (2025)
Distributional Preference Learning: Understanding and Accounting for Hidden Context in RLHF
by: Siththaranjan, Anand, et al.
Published: (2023)
by: Siththaranjan, Anand, et al.
Published: (2023)
Prompt Injection as Role Confusion
by: Ye, Charles, et al.
Published: (2026)
by: Ye, Charles, et al.
Published: (2026)
Postcolonialism and Migration in French Comics
by: McKinney, Mark
Published: (2025)
by: McKinney, Mark
Published: (2025)
Grandmothering While Black: A Twenty‐First‐Century Story of Love, Coercion, and Survival. By Lashawnda L.Pittman. University of California Press, Oakland, California, 2023. 336 pp. $92.04 (hardcover). ISBN: 978‐0‐52‐038995‐3; $29.95 (paperback). ISBN: 978‐0‐52‐038996‐0; $29.95 (ebook). ISBN: 978‐0‐52‐038997‐7
by: Elliana McKinney
Published: (2025)
by: Elliana McKinney
Published: (2025)
Schools Inquiring About Seven-Day School Rerecording of Public and Instructional Television Programs.
by: McKinney, Eleanor
Published: (1975)
by: McKinney, Eleanor
Published: (1975)
Distilling Diversity and Control in Diffusion Models
by: Gandikota, Rohit, et al.
Published: (2025)
by: Gandikota, Rohit, et al.
Published: (2025)
Water conservation surveys of New South Wales
by: McKinney, Hugh Giffen
Published: (1896)
by: McKinney, Hugh Giffen
Published: (1896)
Evolution of erect marine bryozoan faunas : repeated succes of unilaminate species
by: McKinney, F.K
Published: (1986)
by: McKinney, F.K
Published: (1986)
Created from nafta : the structure, function, and significance of the treatys related institutions / Joseph A. McKinney
by: McKinney, Joseph A
by: McKinney, Joseph A
Media Utilization in the Classroom.
by: Bowie, Melvin McKinney
Published: (1985)
by: Bowie, Melvin McKinney
Published: (1985)
Conceptual and Practical Matters: The Challenges and Benefits of Conducting Educational Research Using Historical Data. Sage Research Methods Cases Part 2
by: Stephen J. McKinney
Published: (2017)
by: Stephen J. McKinney
Published: (2017)
The Contribution of Iona and Peter Opie to Children's Literature.
by: McKinney, Barbara J.
Published: (1996)
by: McKinney, Barbara J.
Published: (1996)
Another Degree? What For?
by: McKinney, Eleanor R.
Published: (1969)
by: McKinney, Eleanor R.
Published: (1969)
A Constrained Multi-Agent Reinforcement Learning Approach to Autonomous Traffic Signal Control
by: Satheesh, Anirudh, et al.
Published: (2025)
by: Satheesh, Anirudh, et al.
Published: (2025)
Regret Analysis of Unichain Average Reward Constrained MDPs with General Parameterization
by: Satheesh, Anirudh, et al.
Published: (2026)
by: Satheesh, Anirudh, et al.
Published: (2026)
Cooperative Inverse Reinforcement Learning
by: Hadfield-Menell, Dylan, et al.
Published: (2016)
by: Hadfield-Menell, Dylan, et al.
Published: (2016)
Flexible Agent Alignment with Goal Inference from Open-Ended Dialog
by: Ma, Rachel, et al.
Published: (2025)
by: Ma, Rachel, et al.
Published: (2025)
Layered Unlearning for Adversarial Relearning
by: Qian, Timothy, et al.
Published: (2025)
by: Qian, Timothy, et al.
Published: (2025)
Goal Inference from Open-Ended Dialog
by: Ma, Rachel, et al.
Published: (2024)
by: Ma, Rachel, et al.
Published: (2024)
Activation Steering via Generative Causal Mediation
by: Sankaranarayanan, Aruna, et al.
Published: (2026)
by: Sankaranarayanan, Aruna, et al.
Published: (2026)
MORSE-500: A Programmatically Controllable Video Benchmark to Stress-Test Multimodal Reasoning
by: Cai, Zikui, et al.
Published: (2025)
by: Cai, Zikui, et al.
Published: (2025)
Uncertainty-Aware Answer Selection for Improved Reasoning in Multi-LLM Systems
by: Agrawal, Aakriti, et al.
Published: (2025)
by: Agrawal, Aakriti, et al.
Published: (2025)
Similar Items
-
Eight Methods to Evaluate Robust Unlearning in LLMs
by: Lynch, Aengus, et al.
Published: (2024) -
Diverse Preference Learning for Capabilities and Alignment
by: Slocum, Stewart, et al.
Published: (2025) -
Alignment Tampering: How Reinforcement Learning from Human Feedback Is Exploited to Optimize Misaligned Biases
by: Hahm, Dongyoon, et al.
Published: (2026) -
Randomness, Not Representation: The Unreliability of Evaluating Cultural Alignment in LLMs
by: Khan, Ariba, et al.
Published: (2025) -
Pitfalls of Evidence-Based AI Policy
by: Casper, Stephen, et al.
Published: (2025)