:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Kezins, Nikita, Ekka, Urbas, Berrang, Pascal, Arnaboldi, Luca
Format:	Preprint
Published:	2026
Subjects:	Machine Learning
Online Access:	https://arxiv.org/abs/2605.10901
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

3S-Attack: Spatial, Spectral and Semantic Invisible Backdoor Attack Against DNN Models
by: Yin, Jianyao, et al.
Published: (2025)

Automatic LLM Red Teaming
by: Belaire, Roman, et al.
Published: (2025)

PAC-Bayesian Generalization Guarantees for Fairness on Stochastic and Deterministic Classifiers
by: Bastian, Julien, et al.
Published: (2026)

Repetita Iuvant: Data Repetition Allows SGD to Learn High-Dimensional Multi-Index Functions
by: Arnaboldi, Luca, et al.
Published: (2024)

Escaping mediocrity: how two-layer networks learn hard generalized linear models with SGD
by: Arnaboldi, Luca, et al.
Published: (2023)

Link Stealing Attacks Against Inductive Graph Neural Networks
by: Wu, Yixin, et al.
Published: (2024)

Safe LLM-Controlled Robots with Formal Guarantees via Reachability Analysis
by: Hafez, Ahmad, et al.
Published: (2025)

Constitutional Classifiers: Defending against Universal Jailbreaks across Thousands of Hours of Red Teaming
by: Sharma, Mrinank, et al.
Published: (2025)

Co-RedTeam: Orchestrated Security Discovery and Exploitation with LLM Agents
by: He, Pengfei, et al.
Published: (2026)

Online Learning and Information Exponents: On The Importance of Batch size, and Time/Complexity Tradeoffs
by: Arnaboldi, Luca, et al.
Published: (2024)

The Benefits of Reusing Batches for Gradient Descent in Two-Layer Networks: Breaking the Curse of Information and Leap Exponents
by: Dandi, Yatin, et al.
Published: (2024)

Beyond Benchmarks: Dynamic, Automatic And Systematic Red-Teaming Agents For Trustworthy Medical Language Models
by: Pan, Jiazhen, et al.
Published: (2025)

Guardrails in Logit Space: Safety Token Regularization for LLM Alignment
by: Bach, Thong, et al.
Published: (2026)

Adaptive Instruction Composition for Automated LLM Red-Teaming
by: Zymet, Jesse, et al.
Published: (2026)

LLM-Assisted Red Teaming of Diffusion Models through "Failures Are Fated, But Can Be Faded"
by: Sagar, Som, et al.
Published: (2024)

OTora: A Unified Red Teaming Framework for Reasoning-Level Denial-of-Service in LLM Agents
by: Li, Xinyu, et al.
Published: (2026)

MAD-MAX: Modular And Diverse Malicious Attack MiXtures for Automated LLM Red Teaming
by: Schoepf, Stefan, et al.
Published: (2025)

Capability-Based Scaling Trends for LLM-Based Red-Teaming
by: Panfilov, Alexander, et al.
Published: (2025)

Abstractive Red-Teaming of Language Model Character
by: Rahn, Nate, et al.
Published: (2026)

Adversarial Robustness Guarantees for Quantum Classifiers
by: Dowling, Neil, et al.
Published: (2024)

Stable-GFlowNet: Toward Diverse and Robust LLM Red-Teaming via Contrastive Trajectory Balance
by: Kwon, Minchan, et al.
Published: (2026)

Geometric Red-Teaming for Robotic Manipulation
by: Goel, Divyam, et al.
Published: (2025)

Sharper Guarantees for Learning Neural Network Classifiers with Gradient Methods
by: Taheri, Hossein, et al.
Published: (2024)

From Firewalls to Frontiers: AI Red-Teaming is a Domain-Specific Evolution of Cyber Red-Teaming
by: Sinha, Anusha, et al.
Published: (2025)

Deep Learning as Neural Low-Degree Filtering: A Spectral Theory of Hierarchical Feature Learning
by: Dandi, Yatin, et al.
Published: (2026)

Asymptotics of SGD in Sequence-Single Index Models and Single-Layer Attention Networks
by: Arnaboldi, Luca, et al.
Published: (2025)

RedRFT: A Light-Weight Benchmark for Reinforcement Fine-Tuning-Based Red Teaming
by: Zheng, Xiang, et al.
Published: (2025)

Explainable Clustering Beyond Worst-Case Guarantees
by: Fleissner, Maximilian, et al.
Published: (2024)

Interpretability Guarantees with Merlin-Arthur Classifiers
by: Wäldchen, Stephan, et al.
Published: (2022)

Formal Mechanistic Interpretability: Automated Circuit Discovery with Provable Guarantees
by: Hadad, Itamar, et al.
Published: (2026)

The 4/$δ$ Bound: Designing Predictable LLM-Verifier Systems for Formal Method Guarantee
by: Dantas, PIerre, et al.
Published: (2025)

Quantifying Multimodal Capabilities: Formal Generalization Guarantees in Pairwise Metric Learning
by: Zhou, Richeng, et al.
Published: (2026)

DuoGuard: A Two-Player RL-Driven Framework for Multilingual LLM Guardrails
by: Deng, Yihe, et al.
Published: (2025)

Red-Teaming Segment Anything Model
by: Jankowski, Krzysztof, et al.
Published: (2024)

ReactionTeam: Teaming Experts for Divergent Thinking Beyond Typical Reaction Patterns
by: Guo, Taicheng, et al.
Published: (2023)

Embodied Red Teaming for Auditing Robotic Foundation Models
by: Karnik, Sathwik, et al.
Published: (2024)

Automated Red Teaming with GOAT: the Generative Offensive Agent Tester
by: Pavlova, Maya, et al.
Published: (2024)

Red-Teaming for Inducing Societal Bias in Large Language Models
by: Luo, Chu Fei, et al.
Published: (2024)

UDora: A Unified Red Teaming Framework against LLM Agents by Dynamically Hijacking Their Own Reasoning
by: Zhang, Jiawei, et al.
Published: (2025)

Efficient Evaluation of LLM Performance with Statistical Guarantees
by: Wu, Skyler, et al.
Published: (2026)