Saved in:
| Main Author: | Sahoo, Subramanyam |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2511.13016 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Calibration Collapse Under Sycophancy Fine-Tuning: How Reward Hacking Breaks Uncertainty Quantification in LLMs
by: Sahoo, Subramanyam
Published: (2026)
by: Sahoo, Subramanyam
Published: (2026)
The Horcrux: Mechanistically Interpretable Task Decomposition for Detecting and Mitigating Reward Hacking in Embodied AI Systems
by: Sahoo, Subramanyam, et al.
Published: (2025)
by: Sahoo, Subramanyam, et al.
Published: (2025)
The Double Life of Code World Models: Provably Unmasking Malicious Behavior Through Execution Traces
by: Sahoo, Subramanyam
Published: (2025)
by: Sahoo, Subramanyam
Published: (2025)
A Gradient Analysis Framework for Rewarding Good and Penalizing Bad Examples in Language Models
by: Tuan, Yi-Lin, et al.
Published: (2024)
by: Tuan, Yi-Lin, et al.
Published: (2024)
The Deepfake Detective: Interpreting Neural Forensics Through Sparse Features and Manifolds
by: Sahoo, Subramanyam, et al.
Published: (2025)
by: Sahoo, Subramanyam, et al.
Published: (2025)
The Reasoning Trap -- Logical Reasoning as a Mechanistic Pathway to Situational Awareness
by: Sahoo, Subramanyam, et al.
Published: (2026)
by: Sahoo, Subramanyam, et al.
Published: (2026)
Ambient Diffusion Omni: Training Good Models with Bad Data
by: Daras, Giannis, et al.
Published: (2025)
by: Daras, Giannis, et al.
Published: (2025)
When Shallow Wins: Silent Failures and the Depth-Accuracy Paradox in Latent Reasoning
by: Sahoo, Subramanyam, et al.
Published: (2026)
by: Sahoo, Subramanyam, et al.
Published: (2026)
Boardwalk Empire: How Generative AI is Revolutionizing Economic Paradigms
by: Sahoo, Subramanyam, et al.
Published: (2024)
by: Sahoo, Subramanyam, et al.
Published: (2024)
Feel-Good Thompson Sampling for Contextual Bandits: a Markov Chain Monte Carlo Showdown
by: Anand, Emile, et al.
Published: (2025)
by: Anand, Emile, et al.
Published: (2025)
Position: The Complexity of Perfect AI Alignment -- Formalizing the RLHF Trilemma
by: Sahoo, Subramanyam, et al.
Published: (2025)
by: Sahoo, Subramanyam, et al.
Published: (2025)
The Great Contradiction Showdown: How Jailbreak and Stealth Wrestle in Vision-Language Models?
by: Kao, Ching-Chia, et al.
Published: (2024)
by: Kao, Ching-Chia, et al.
Published: (2024)
When Bad Data Leads to Good Models
by: Li, Kenneth, et al.
Published: (2025)
by: Li, Kenneth, et al.
Published: (2025)
Good Allocations from Bad Estimates
by: Casacuberta, Sílvia, et al.
Published: (2026)
by: Casacuberta, Sílvia, et al.
Published: (2026)
Blog Data Showdown: Machine Learning vs Neuro-Symbolic Models for Gender Classification
by: Sinshaw, Natnael Tilahun, et al.
Published: (2025)
by: Sinshaw, Natnael Tilahun, et al.
Published: (2025)
I Can't Believe It's Not Robust: Catastrophic Collapse of Safety Classifiers under Embedding Drift
by: Sahoo, Subramanyam, et al.
Published: (2026)
by: Sahoo, Subramanyam, et al.
Published: (2026)
SAHOO: Safeguarded Alignment for High-Order Optimization Objectives in Recursive Self-Improvement
by: Sahoo, Subramanyam, et al.
Published: (2026)
by: Sahoo, Subramanyam, et al.
Published: (2026)
From Rattle to Roar: Optimizer Showdown for MambaStock on S&P 500
by: Chan, Alena, et al.
Published: (2025)
by: Chan, Alena, et al.
Published: (2025)
BadReward: Clean-Label Poisoning of Reward Models in Text-to-Image RLHF
by: Duan, Kaiwen, et al.
Published: (2025)
by: Duan, Kaiwen, et al.
Published: (2025)
Vertical Federated Learning in Practice: The Good, the Bad, and the Ugly
by: Wu, Zhaomin, et al.
Published: (2025)
by: Wu, Zhaomin, et al.
Published: (2025)
Sequence-Aware Inline Measurement Attribution for Good-Bad Wafer Diagnosis
by: Miyaguchi, Kohei, et al.
Published: (2025)
by: Miyaguchi, Kohei, et al.
Published: (2025)
Agent Performing Autonomous Stock Trading under Good and Bad Situations
by: Luo, Yunfei, et al.
Published: (2023)
by: Luo, Yunfei, et al.
Published: (2023)
Simulation, Modelling and Classification of Wiki Contributors: Spotting The Good, The Bad, and The Ugly
by: Méndez, Silvia García, et al.
Published: (2024)
by: Méndez, Silvia García, et al.
Published: (2024)
GRAM-R$^2$: Self-Training Generative Foundation Reward Models for Reward Reasoning
by: Wang, Chenglong, et al.
Published: (2025)
by: Wang, Chenglong, et al.
Published: (2025)
FADE: Why Bad Descriptions Happen to Good Features
by: Puri, Bruno, et al.
Published: (2025)
by: Puri, Bruno, et al.
Published: (2025)
Imitate the Good and Avoid the Bad: An Incremental Approach to Safe Reinforcement Learning
by: Hoang, Huy, et al.
Published: (2023)
by: Hoang, Huy, et al.
Published: (2023)
Bad Values but Good Behavior: Learning Highly Misspecified Bandits and MDPs
by: Banerjee, Debangshu, et al.
Published: (2023)
by: Banerjee, Debangshu, et al.
Published: (2023)
Entropy-Guided Data-Efficient Training for Multimodal Reasoning Reward Models
by: Yang, Shidong, et al.
Published: (2026)
by: Yang, Shidong, et al.
Published: (2026)
Adversarial Training of Reward Models
by: Bukharin, Alexander, et al.
Published: (2025)
by: Bukharin, Alexander, et al.
Published: (2025)
Scaling Laws Revisited: Modeling the Role of Data Quality in Language Model Pretraining
by: Subramanyam, Anirudh, et al.
Published: (2025)
by: Subramanyam, Anirudh, et al.
Published: (2025)
Good Actions Succeed, Bad Actions Generalize: A Case Study on Why RL Generalizes Better
by: Song, Meng
Published: (2025)
by: Song, Meng
Published: (2025)
A Model Can Help Itself: Reward-Free Self-Training for LLM Reasoning
by: Li, Mengqi, et al.
Published: (2025)
by: Li, Mengqi, et al.
Published: (2025)
The Good, the Bad, and the Sampled: a No-Regret Approach to Safe Online Classification
by: Baharav, Tavor Z., et al.
Published: (2025)
by: Baharav, Tavor Z., et al.
Published: (2025)
RationalRewards: Reasoning Rewards Scale Visual Generation Both Training and Test Time
by: Wang, Haozhe, et al.
Published: (2026)
by: Wang, Haozhe, et al.
Published: (2026)
Keypoint Aware Masked Image Modelling
by: Krishna, Madhava, et al.
Published: (2024)
by: Krishna, Madhava, et al.
Published: (2024)
AgentCollabBench: Diagnosing When Good Agents Make Bad Collaborators
by: Mazumder, Aritra, et al.
Published: (2026)
by: Mazumder, Aritra, et al.
Published: (2026)
Long-Tailed Classification by Keeping the Good and Removing the Bad Momentum Causal Effect
by: Tang, Kaihua, et al.
Published: (2020)
by: Tang, Kaihua, et al.
Published: (2020)
Comparing Bad Apples to Good Oranges: Aligning Large Language Models via Joint Preference Optimization
by: Bansal, Hritik, et al.
Published: (2024)
by: Bansal, Hritik, et al.
Published: (2024)
AceMath: Advancing Frontier Math Reasoning with Post-Training and Reward Modeling
by: Liu, Zihan, et al.
Published: (2024)
by: Liu, Zihan, et al.
Published: (2024)
SoundnessBench: Can Your AI Scientist Really Tell Good Research Ideas from Bad Ones?
by: Ho, Sy-Tuyen, et al.
Published: (2026)
by: Ho, Sy-Tuyen, et al.
Published: (2026)
Similar Items
-
Calibration Collapse Under Sycophancy Fine-Tuning: How Reward Hacking Breaks Uncertainty Quantification in LLMs
by: Sahoo, Subramanyam
Published: (2026) -
The Horcrux: Mechanistically Interpretable Task Decomposition for Detecting and Mitigating Reward Hacking in Embodied AI Systems
by: Sahoo, Subramanyam, et al.
Published: (2025) -
The Double Life of Code World Models: Provably Unmasking Malicious Behavior Through Execution Traces
by: Sahoo, Subramanyam
Published: (2025) -
A Gradient Analysis Framework for Rewarding Good and Penalizing Bad Examples in Language Models
by: Tuan, Yi-Lin, et al.
Published: (2024) -
The Deepfake Detective: Interpreting Neural Forensics Through Sparse Features and Manifolds
by: Sahoo, Subramanyam, et al.
Published: (2025)