:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Serrano, Alex, Xing, Wen, Lindner, David, Jenner, Erik
Format:	Preprint
Published:	2026
Subjects:	Machine Learning
Online Access:	https://arxiv.org/abs/2603.02202
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

RL-Obfuscation: Can Language Models Learn to Evade Latent-Space Monitors?
by: Gupta, Rohan, et al.
Published: (2025)

Can Reasoning Models Obfuscate Reasoning? Stress-Testing Chain-of-Thought Monitorability
by: Zolkowski, Artur, et al.
Published: (2025)

MISR: Measuring Instrumental Self-Reasoning in Frontier Models
by: Fronsdal, Kai, et al.
Published: (2024)

Evaluating Frontier Models for Stealth and Situational Awareness
by: Phuong, Mary, et al.
Published: (2025)

Obfuscated Activations Bypass LLM Latent-Space Defenses
by: Bailey, Luke, et al.
Published: (2024)

Neural Chameleons: Language Models Can Learn to Hide Their Thoughts from Unseen Activation Monitors
by: McGuinness, Max, et al.
Published: (2025)

Early Signs of Steganographic Capabilities in Frontier LLMs
by: Zolkowski, Artur, et al.
Published: (2025)

Evaluating Frontier Models for Dangerous Capabilities
by: Phuong, Mary, et al.
Published: (2024)

Can Machines Learn the True Probabilities?
by: Kim, Jinsook
Published: (2024)

When Your AIs Deceive You: Challenges of Partial Observability in Reinforcement Learning from Human Feedback
by: Lang, Leon, et al.
Published: (2024)

Evidence of Learned Look-Ahead in a Chess-Playing Neural Network
by: Jenner, Erik, et al.
Published: (2024)

Predicting Fault-Ride-Through Probability of Inverter-Dominated Power Grids using Machine Learning
by: Nauck, Christian, et al.
Published: (2024)

STARC: A General Framework For Quantifying Differences Between Reward Functions
by: Skalse, Joar, et al.
Published: (2023)

Pragmatist Intelligence: Where the Principle of Usefulness Can Take ANNs
by: Bikić, Antonio, et al.
Published: (2024)

Does Spatial Cognition Emerge in Frontier Models?
by: Ramakrishnan, Santhosh Kumar, et al.
Published: (2024)

An Investigation of Offline Reinforcement Learning in Factorisable Action Spaces
by: Beeson, Alex, et al.
Published: (2024)

Puzzled by Puzzles: When Vision-Language Models Can't Take a Hint
by: Lee, Heekyung, et al.
Published: (2025)

MONA: Myopic Optimization with Non-myopic Approval Can Mitigate Multi-step Reward Hacking
by: Farquhar, Sebastian, et al.
Published: (2025)

Learning to Represent Surroundings, Anticipate Motion and Take Informed Actions in Unstructured Environments
by: Zhi, Weiming
Published: (2024)

Analysis of Value Iteration Through Absolute Probability Sequences
by: Mustafin, Arsenii, et al.
Published: (2025)

PPGF: Probability Pattern-Guided Time Series Forecasting
by: Sun, Yanru, et al.
Published: (2025)

Adversaries Can Misuse Combinations of Safe Models
by: Jones, Erik, et al.
Published: (2024)

It Takes a Good Model to Train a Good Model: Generalized Gaussian Priors for Optimized LLMs
by: Wu, Jun, et al.
Published: (2025)

Probability-density-aware Semi-supervised Learning
by: Liu, Shuyang, et al.
Published: (2024)

Exploration Hacking: Can LLMs Learn to Resist RL Training?
by: Jang, Eyon, et al.
Published: (2026)

Taking the GP Out of the Loop
by: Bafna, Mehul, et al.
Published: (2025)

Unveiling High-Probability Generalization in Decentralized SGD
by: Wang, Jiahuan, et al.
Published: (2026)

Stabilized Inverse Probability Weighting via Isotonic Calibration
by: van der Laan, Lars, et al.
Published: (2024)

Discretizing Continuous Action Space with Unimodal Probability Distributions for On-Policy Reinforcement Learning
by: Zhu, Yuanyang, et al.
Published: (2024)

Olivia: Harmonizing Time Series Foundation Models with Power Spectral Density
by: Fei, Jingru, et al.
Published: (2026)

Vision-Language Models are Zero-Shot Reward Models for Reinforcement Learning
by: Rocamonde, Juan, et al.
Published: (2023)

Joint Bayesian Parameter and Model Order Estimation for Low-Rank Probability Mass Tensors
by: Chege, Joseph K., et al.
Published: (2024)

Density-Informed VAE (DiVAE): Reliable Log-Prior Probability via Density Alignment Regularization
by: Alessi, Michele, et al.
Published: (2025)

Probably Approximately Correct Causal Discovery
by: Wei, Mian, et al.
Published: (2025)

Gram: Assessing sabotage propensities via automated alignment auditing
by: Lindner, David, et al.
Published: (2026)

Strategic Dishonesty Can Undermine AI Safety Evaluations of Frontier LLMs
by: Panfilov, Alexander, et al.
Published: (2025)

Sabotage Evaluations for Frontier Models
by: Benton, Joe, et al.
Published: (2024)

Multiclass Calibration Assessment and Recalibration of Probability Predictions via the Linear Log Odds Calibration Function
by: Vennos, Amy, et al.
Published: (2026)

Visual Exploration of Stopword Probabilities in Topic Models
by: Xue, Shuangjiang, et al.
Published: (2025)

Interpretable Probability Estimation with LLMs via Shapley Reconstruction
by: Nan, Yang, et al.
Published: (2026)