Saved in:
| Main Authors: | Jahan, Sohely, Sun, Ruimin |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2512.09403 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Safety Game: Inference-Time Alignment of Black-Box LLMs via Constrained Optimization
by: Nguyen, Tuan, et al.
Published: (2025)
by: Nguyen, Tuan, et al.
Published: (2025)
Bounded Behavioral Indistinguishability for Black-Box LLM Distillation
by: Hasan, Munawar
Published: (2026)
by: Hasan, Munawar
Published: (2026)
Transient Turn Injection: Exposing Stateless Multi-Turn Vulnerabilities in Large Language Models
by: Rayhan, Naheed, et al.
Published: (2026)
by: Rayhan, Naheed, et al.
Published: (2026)
Breaking Free: How to Hack Safety Guardrails in Black-Box Diffusion Models!
by: Kotyan, Shashank, et al.
Published: (2024)
by: Kotyan, Shashank, et al.
Published: (2024)
Turning Black Box into White Box: Dataset Distillation Leaks
by: Chen, Huajie, et al.
Published: (2026)
by: Chen, Huajie, et al.
Published: (2026)
Matryoshka Pilot: Learning to Drive Black-Box LLMs with LLMs
by: Li, Changhao, et al.
Published: (2024)
by: Li, Changhao, et al.
Published: (2024)
Boundary Point Jailbreaking of Black-Box LLMs
by: Davies, Xander, et al.
Published: (2026)
by: Davies, Xander, et al.
Published: (2026)
When Style Breaks Safety: Defending LLMs Against Superficial Style Alignment
by: Xiao, Yuxin, et al.
Published: (2025)
by: Xiao, Yuxin, et al.
Published: (2025)
Reducing the Safety Tax in LLM Safety Alignment with On-Policy Self-Distillation
by: Fu, Yu, et al.
Published: (2026)
by: Fu, Yu, et al.
Published: (2026)
Hair-Trigger Alignment: Black-Box Evaluation Cannot Guarantee Post-Update Alignment
by: Bakman, Yavuz, et al.
Published: (2026)
by: Bakman, Yavuz, et al.
Published: (2026)
BridgePure: Limited Protection Leakage Can Break Black-Box Data Protection
by: Wang, Yihan, et al.
Published: (2024)
by: Wang, Yihan, et al.
Published: (2024)
Safety Filters for Black-Box Dynamical Systems by Learning Discriminating Hyperplanes
by: Lavanakul, Will, et al.
Published: (2024)
by: Lavanakul, Will, et al.
Published: (2024)
Enabling Fine-Grained Operating Points for Black-Box LLMs
by: Beyazit, Ege, et al.
Published: (2025)
by: Beyazit, Ege, et al.
Published: (2025)
The Geometry of Alignment Collapse: When Fine-Tuning Breaks Safety
by: Springer, Max, et al.
Published: (2026)
by: Springer, Max, et al.
Published: (2026)
SODA: Semi On-Policy Black-Box Distillation for Large Language Models
by: Chen, Xiwen, et al.
Published: (2026)
by: Chen, Xiwen, et al.
Published: (2026)
Multilingual Safety Alignment via Self-Distillation
by: Qin, Ruiyang, et al.
Published: (2026)
by: Qin, Ruiyang, et al.
Published: (2026)
Bayesian Safety Validation for Failure Probability Estimation of Black-Box Systems
by: Moss, Robert J., et al.
Published: (2023)
by: Moss, Robert J., et al.
Published: (2023)
Explaining the Behavior of Black-Box Prediction Algorithms with Causal Learning
by: Sani, Numair, et al.
Published: (2020)
by: Sani, Numair, et al.
Published: (2020)
PRESTO: Preimage-Informed Instruction Optimization for Prompting Black-Box LLMs
by: Chu, Jaewon, et al.
Published: (2025)
by: Chu, Jaewon, et al.
Published: (2025)
Certifiable Black-Box Attacks with Randomized Adversarial Examples: Breaking Defenses with Provable Confidence
by: Hong, Hanbin, et al.
Published: (2023)
by: Hong, Hanbin, et al.
Published: (2023)
Does Alignment Tuning Really Break LLMs' Internal Confidence?
by: Oh, Hongseok, et al.
Published: (2024)
by: Oh, Hongseok, et al.
Published: (2024)
GLiRA: Black-Box Membership Inference Attack via Knowledge Distillation
by: Galichin, Andrey V., et al.
Published: (2024)
by: Galichin, Andrey V., et al.
Published: (2024)
Beyond the Black Box: Interpretability of LLMs in Finance
by: Tatsat, Hariom, et al.
Published: (2025)
by: Tatsat, Hariom, et al.
Published: (2025)
Aligning Logits Generatively for Principled Black-Box Knowledge Distillation
by: Ma, Jing, et al.
Published: (2022)
by: Ma, Jing, et al.
Published: (2022)
Black-Box Forgetting
by: Kuwana, Yusuke, et al.
Published: (2024)
by: Kuwana, Yusuke, et al.
Published: (2024)
Stochastic Monkeys at Play: Random Augmentations Cheaply Break LLM Safety Alignment
by: Vega, Jason, et al.
Published: (2024)
by: Vega, Jason, et al.
Published: (2024)
Smoothing the Black-Box: Signed-Distance Supervision for Black-Box Model Copying
by: Jiménez, Rubén, et al.
Published: (2026)
by: Jiménez, Rubén, et al.
Published: (2026)
Breaking the Safety-Capability Tradeoff: Reinforcement Learning with Verifiable Rewards Maintains Safety Guardrails in LLMs
by: Cho, Dongkyu Derek, et al.
Published: (2025)
by: Cho, Dongkyu Derek, et al.
Published: (2025)
Breaking the Reasoning Horizon in Entity Alignment Foundation Models
by: Cui, Yuanning, et al.
Published: (2026)
by: Cui, Yuanning, et al.
Published: (2026)
PCS: Perceived Confidence Scoring of Black Box LLMs with Metamorphic Relations
by: Salimian, Sina, et al.
Published: (2025)
by: Salimian, Sina, et al.
Published: (2025)
SafePassage: High-Fidelity Information Extraction with Black Box LLMs
by: Barrow, Joe, et al.
Published: (2025)
by: Barrow, Joe, et al.
Published: (2025)
ExecTune: Effective Steering of Black-Box LLMs with Guide Models
by: Lingam, Vijay, et al.
Published: (2026)
by: Lingam, Vijay, et al.
Published: (2026)
Breaking Bad: Interpretability-Based Safety Audits of State-of-the-Art LLMs
by: Agarwal, Krishiv, et al.
Published: (2026)
by: Agarwal, Krishiv, et al.
Published: (2026)
Auto-Tuning Safety Guardrails for Black-Box Large Language Models
by: Abdulkadir, Perry
Published: (2025)
by: Abdulkadir, Perry
Published: (2025)
FedAL: Black-Box Federated Knowledge Distillation Enabled by Adversarial Learning
by: Han, Pengchao, et al.
Published: (2023)
by: Han, Pengchao, et al.
Published: (2023)
Any-Depth Alignment: Unlocking Innate Safety Alignment of LLMs to Any-Depth
by: Zhang, Jiawei, et al.
Published: (2025)
by: Zhang, Jiawei, et al.
Published: (2025)
In-Context Explainers: Harnessing LLMs for Explaining Black Box Models
by: Kroeger, Nicholas, et al.
Published: (2023)
by: Kroeger, Nicholas, et al.
Published: (2023)
Tree of Attacks: Jailbreaking Black-Box LLMs Automatically
by: Mehrotra, Anay, et al.
Published: (2023)
by: Mehrotra, Anay, et al.
Published: (2023)
Breaking the Black Box: Inherently Interpretable Physics-Constrained Machine Learning With Weighted Mixed-Effects for Imbalanced Seismic Data
by: Sreenath, Vemula, et al.
Published: (2025)
by: Sreenath, Vemula, et al.
Published: (2025)
Hierarchical Support Vector State Partitioning for Distilling Black Box Reinforcement Learning Policies
by: Deproost, Senne, et al.
Published: (2026)
by: Deproost, Senne, et al.
Published: (2026)
Similar Items
-
Safety Game: Inference-Time Alignment of Black-Box LLMs via Constrained Optimization
by: Nguyen, Tuan, et al.
Published: (2025) -
Bounded Behavioral Indistinguishability for Black-Box LLM Distillation
by: Hasan, Munawar
Published: (2026) -
Transient Turn Injection: Exposing Stateless Multi-Turn Vulnerabilities in Large Language Models
by: Rayhan, Naheed, et al.
Published: (2026) -
Breaking Free: How to Hack Safety Guardrails in Black-Box Diffusion Models!
by: Kotyan, Shashank, et al.
Published: (2024) -
Turning Black Box into White Box: Dataset Distillation Leaks
by: Chen, Huajie, et al.
Published: (2026)