:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Jahan, Sohely, Sun, Ruimin
Format:	Preprint
Published:	2025
Subjects:	Machine Learning
Online Access:	https://arxiv.org/abs/2512.09403
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

Safety Game: Inference-Time Alignment of Black-Box LLMs via Constrained Optimization
by: Nguyen, Tuan, et al.
Published: (2025)

Bounded Behavioral Indistinguishability for Black-Box LLM Distillation
by: Hasan, Munawar
Published: (2026)

Transient Turn Injection: Exposing Stateless Multi-Turn Vulnerabilities in Large Language Models
by: Rayhan, Naheed, et al.
Published: (2026)

Breaking Free: How to Hack Safety Guardrails in Black-Box Diffusion Models!
by: Kotyan, Shashank, et al.
Published: (2024)

Turning Black Box into White Box: Dataset Distillation Leaks
by: Chen, Huajie, et al.
Published: (2026)

Matryoshka Pilot: Learning to Drive Black-Box LLMs with LLMs
by: Li, Changhao, et al.
Published: (2024)

Boundary Point Jailbreaking of Black-Box LLMs
by: Davies, Xander, et al.
Published: (2026)

When Style Breaks Safety: Defending LLMs Against Superficial Style Alignment
by: Xiao, Yuxin, et al.
Published: (2025)

Reducing the Safety Tax in LLM Safety Alignment with On-Policy Self-Distillation
by: Fu, Yu, et al.
Published: (2026)

Hair-Trigger Alignment: Black-Box Evaluation Cannot Guarantee Post-Update Alignment
by: Bakman, Yavuz, et al.
Published: (2026)

BridgePure: Limited Protection Leakage Can Break Black-Box Data Protection
by: Wang, Yihan, et al.
Published: (2024)

Safety Filters for Black-Box Dynamical Systems by Learning Discriminating Hyperplanes
by: Lavanakul, Will, et al.
Published: (2024)

Enabling Fine-Grained Operating Points for Black-Box LLMs
by: Beyazit, Ege, et al.
Published: (2025)

The Geometry of Alignment Collapse: When Fine-Tuning Breaks Safety
by: Springer, Max, et al.
Published: (2026)

SODA: Semi On-Policy Black-Box Distillation for Large Language Models
by: Chen, Xiwen, et al.
Published: (2026)

Multilingual Safety Alignment via Self-Distillation
by: Qin, Ruiyang, et al.
Published: (2026)

Bayesian Safety Validation for Failure Probability Estimation of Black-Box Systems
by: Moss, Robert J., et al.
Published: (2023)

Explaining the Behavior of Black-Box Prediction Algorithms with Causal Learning
by: Sani, Numair, et al.
Published: (2020)

PRESTO: Preimage-Informed Instruction Optimization for Prompting Black-Box LLMs
by: Chu, Jaewon, et al.
Published: (2025)

Certifiable Black-Box Attacks with Randomized Adversarial Examples: Breaking Defenses with Provable Confidence
by: Hong, Hanbin, et al.
Published: (2023)

Does Alignment Tuning Really Break LLMs' Internal Confidence?
by: Oh, Hongseok, et al.
Published: (2024)

GLiRA: Black-Box Membership Inference Attack via Knowledge Distillation
by: Galichin, Andrey V., et al.
Published: (2024)

Beyond the Black Box: Interpretability of LLMs in Finance
by: Tatsat, Hariom, et al.
Published: (2025)

Aligning Logits Generatively for Principled Black-Box Knowledge Distillation
by: Ma, Jing, et al.
Published: (2022)

Black-Box Forgetting
by: Kuwana, Yusuke, et al.
Published: (2024)

Stochastic Monkeys at Play: Random Augmentations Cheaply Break LLM Safety Alignment
by: Vega, Jason, et al.
Published: (2024)

Smoothing the Black-Box: Signed-Distance Supervision for Black-Box Model Copying
by: Jiménez, Rubén, et al.
Published: (2026)

Breaking the Safety-Capability Tradeoff: Reinforcement Learning with Verifiable Rewards Maintains Safety Guardrails in LLMs
by: Cho, Dongkyu Derek, et al.
Published: (2025)

Breaking the Reasoning Horizon in Entity Alignment Foundation Models
by: Cui, Yuanning, et al.
Published: (2026)

PCS: Perceived Confidence Scoring of Black Box LLMs with Metamorphic Relations
by: Salimian, Sina, et al.
Published: (2025)

SafePassage: High-Fidelity Information Extraction with Black Box LLMs
by: Barrow, Joe, et al.
Published: (2025)

ExecTune: Effective Steering of Black-Box LLMs with Guide Models
by: Lingam, Vijay, et al.
Published: (2026)

Breaking Bad: Interpretability-Based Safety Audits of State-of-the-Art LLMs
by: Agarwal, Krishiv, et al.
Published: (2026)

Auto-Tuning Safety Guardrails for Black-Box Large Language Models
by: Abdulkadir, Perry
Published: (2025)

FedAL: Black-Box Federated Knowledge Distillation Enabled by Adversarial Learning
by: Han, Pengchao, et al.
Published: (2023)

Any-Depth Alignment: Unlocking Innate Safety Alignment of LLMs to Any-Depth
by: Zhang, Jiawei, et al.
Published: (2025)

In-Context Explainers: Harnessing LLMs for Explaining Black Box Models
by: Kroeger, Nicholas, et al.
Published: (2023)

Tree of Attacks: Jailbreaking Black-Box LLMs Automatically
by: Mehrotra, Anay, et al.
Published: (2023)

Breaking the Black Box: Inherently Interpretable Physics-Constrained Machine Learning With Weighted Mixed-Effects for Imbalanced Seismic Data
by: Sreenath, Vemula, et al.
Published: (2025)

Hierarchical Support Vector State Partitioning for Distilling Black Box Reinforcement Learning Policies
by: Deproost, Senne, et al.
Published: (2026)