Saved in:
| Main Authors: | Serrano, Alex, Xing, Wen, Lindner, David, Jenner, Erik |
|---|---|
| Format: | Preprint |
| Published: |
2026
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2603.02202 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
RL-Obfuscation: Can Language Models Learn to Evade Latent-Space Monitors?
by: Gupta, Rohan, et al.
Published: (2025)
by: Gupta, Rohan, et al.
Published: (2025)
Can Reasoning Models Obfuscate Reasoning? Stress-Testing Chain-of-Thought Monitorability
by: Zolkowski, Artur, et al.
Published: (2025)
by: Zolkowski, Artur, et al.
Published: (2025)
MISR: Measuring Instrumental Self-Reasoning in Frontier Models
by: Fronsdal, Kai, et al.
Published: (2024)
by: Fronsdal, Kai, et al.
Published: (2024)
Evaluating Frontier Models for Stealth and Situational Awareness
by: Phuong, Mary, et al.
Published: (2025)
by: Phuong, Mary, et al.
Published: (2025)
Obfuscated Activations Bypass LLM Latent-Space Defenses
by: Bailey, Luke, et al.
Published: (2024)
by: Bailey, Luke, et al.
Published: (2024)
Neural Chameleons: Language Models Can Learn to Hide Their Thoughts from Unseen Activation Monitors
by: McGuinness, Max, et al.
Published: (2025)
by: McGuinness, Max, et al.
Published: (2025)
Early Signs of Steganographic Capabilities in Frontier LLMs
by: Zolkowski, Artur, et al.
Published: (2025)
by: Zolkowski, Artur, et al.
Published: (2025)
Evaluating Frontier Models for Dangerous Capabilities
by: Phuong, Mary, et al.
Published: (2024)
by: Phuong, Mary, et al.
Published: (2024)
Can Machines Learn the True Probabilities?
by: Kim, Jinsook
Published: (2024)
by: Kim, Jinsook
Published: (2024)
When Your AIs Deceive You: Challenges of Partial Observability in Reinforcement Learning from Human Feedback
by: Lang, Leon, et al.
Published: (2024)
by: Lang, Leon, et al.
Published: (2024)
Evidence of Learned Look-Ahead in a Chess-Playing Neural Network
by: Jenner, Erik, et al.
Published: (2024)
by: Jenner, Erik, et al.
Published: (2024)
Predicting Fault-Ride-Through Probability of Inverter-Dominated Power Grids using Machine Learning
by: Nauck, Christian, et al.
Published: (2024)
by: Nauck, Christian, et al.
Published: (2024)
STARC: A General Framework For Quantifying Differences Between Reward Functions
by: Skalse, Joar, et al.
Published: (2023)
by: Skalse, Joar, et al.
Published: (2023)
Pragmatist Intelligence: Where the Principle of Usefulness Can Take ANNs
by: Bikić, Antonio, et al.
Published: (2024)
by: Bikić, Antonio, et al.
Published: (2024)
Does Spatial Cognition Emerge in Frontier Models?
by: Ramakrishnan, Santhosh Kumar, et al.
Published: (2024)
by: Ramakrishnan, Santhosh Kumar, et al.
Published: (2024)
An Investigation of Offline Reinforcement Learning in Factorisable Action Spaces
by: Beeson, Alex, et al.
Published: (2024)
by: Beeson, Alex, et al.
Published: (2024)
Puzzled by Puzzles: When Vision-Language Models Can't Take a Hint
by: Lee, Heekyung, et al.
Published: (2025)
by: Lee, Heekyung, et al.
Published: (2025)
MONA: Myopic Optimization with Non-myopic Approval Can Mitigate Multi-step Reward Hacking
by: Farquhar, Sebastian, et al.
Published: (2025)
by: Farquhar, Sebastian, et al.
Published: (2025)
Learning to Represent Surroundings, Anticipate Motion and Take Informed Actions in Unstructured Environments
by: Zhi, Weiming
Published: (2024)
by: Zhi, Weiming
Published: (2024)
Analysis of Value Iteration Through Absolute Probability Sequences
by: Mustafin, Arsenii, et al.
Published: (2025)
by: Mustafin, Arsenii, et al.
Published: (2025)
PPGF: Probability Pattern-Guided Time Series Forecasting
by: Sun, Yanru, et al.
Published: (2025)
by: Sun, Yanru, et al.
Published: (2025)
Adversaries Can Misuse Combinations of Safe Models
by: Jones, Erik, et al.
Published: (2024)
by: Jones, Erik, et al.
Published: (2024)
It Takes a Good Model to Train a Good Model: Generalized Gaussian Priors for Optimized LLMs
by: Wu, Jun, et al.
Published: (2025)
by: Wu, Jun, et al.
Published: (2025)
Probability-density-aware Semi-supervised Learning
by: Liu, Shuyang, et al.
Published: (2024)
by: Liu, Shuyang, et al.
Published: (2024)
Exploration Hacking: Can LLMs Learn to Resist RL Training?
by: Jang, Eyon, et al.
Published: (2026)
by: Jang, Eyon, et al.
Published: (2026)
Taking the GP Out of the Loop
by: Bafna, Mehul, et al.
Published: (2025)
by: Bafna, Mehul, et al.
Published: (2025)
Unveiling High-Probability Generalization in Decentralized SGD
by: Wang, Jiahuan, et al.
Published: (2026)
by: Wang, Jiahuan, et al.
Published: (2026)
Stabilized Inverse Probability Weighting via Isotonic Calibration
by: van der Laan, Lars, et al.
Published: (2024)
by: van der Laan, Lars, et al.
Published: (2024)
Discretizing Continuous Action Space with Unimodal Probability Distributions for On-Policy Reinforcement Learning
by: Zhu, Yuanyang, et al.
Published: (2024)
by: Zhu, Yuanyang, et al.
Published: (2024)
Olivia: Harmonizing Time Series Foundation Models with Power Spectral Density
by: Fei, Jingru, et al.
Published: (2026)
by: Fei, Jingru, et al.
Published: (2026)
Vision-Language Models are Zero-Shot Reward Models for Reinforcement Learning
by: Rocamonde, Juan, et al.
Published: (2023)
by: Rocamonde, Juan, et al.
Published: (2023)
Joint Bayesian Parameter and Model Order Estimation for Low-Rank Probability Mass Tensors
by: Chege, Joseph K., et al.
Published: (2024)
by: Chege, Joseph K., et al.
Published: (2024)
Density-Informed VAE (DiVAE): Reliable Log-Prior Probability via Density Alignment Regularization
by: Alessi, Michele, et al.
Published: (2025)
by: Alessi, Michele, et al.
Published: (2025)
Probably Approximately Correct Causal Discovery
by: Wei, Mian, et al.
Published: (2025)
by: Wei, Mian, et al.
Published: (2025)
Gram: Assessing sabotage propensities via automated alignment auditing
by: Lindner, David, et al.
Published: (2026)
by: Lindner, David, et al.
Published: (2026)
Strategic Dishonesty Can Undermine AI Safety Evaluations of Frontier LLMs
by: Panfilov, Alexander, et al.
Published: (2025)
by: Panfilov, Alexander, et al.
Published: (2025)
Sabotage Evaluations for Frontier Models
by: Benton, Joe, et al.
Published: (2024)
by: Benton, Joe, et al.
Published: (2024)
Multiclass Calibration Assessment and Recalibration of Probability Predictions via the Linear Log Odds Calibration Function
by: Vennos, Amy, et al.
Published: (2026)
by: Vennos, Amy, et al.
Published: (2026)
Visual Exploration of Stopword Probabilities in Topic Models
by: Xue, Shuangjiang, et al.
Published: (2025)
by: Xue, Shuangjiang, et al.
Published: (2025)
Interpretable Probability Estimation with LLMs via Shapley Reconstruction
by: Nan, Yang, et al.
Published: (2026)
by: Nan, Yang, et al.
Published: (2026)
Similar Items
-
RL-Obfuscation: Can Language Models Learn to Evade Latent-Space Monitors?
by: Gupta, Rohan, et al.
Published: (2025) -
Can Reasoning Models Obfuscate Reasoning? Stress-Testing Chain-of-Thought Monitorability
by: Zolkowski, Artur, et al.
Published: (2025) -
MISR: Measuring Instrumental Self-Reasoning in Frontier Models
by: Fronsdal, Kai, et al.
Published: (2024) -
Evaluating Frontier Models for Stealth and Situational Awareness
by: Phuong, Mary, et al.
Published: (2025) -
Obfuscated Activations Bypass LLM Latent-Space Defenses
by: Bailey, Luke, et al.
Published: (2024)