:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Rahn, Nate, Qi, Allison, Griffin, Avery, Michala, Jonathan, Sleight, Henry, Jones, Erik
Format:	Preprint
Published:	2026
Subjects:	Machine Learning
Online Access:	https://arxiv.org/abs/2602.12318
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

Persona Vectors: Monitoring and Controlling Character Traits in Language Models
by: Chen, Runjin, et al.
Published: (2025)

Three Concrete Challenges and Two Hopes for the Safety of Unsupervised Elicitation
by: Canavan, Callum, et al.
Published: (2026)

All Code, No Thought: Current Language Models Struggle to Reason in Ciphered Language
by: Guo, Shiyuan, et al.
Published: (2025)

The LLM Has Left The Chat: Evidence of Bail Preferences in Large Language Models
by: Ensign, Danielle, et al.
Published: (2025)

Text-Diffusion Red-Teaming of Large Language Models: Unveiling Harmful Behaviors with Proximity Constraints
by: Nöther, Jonathan, et al.
Published: (2025)

Policy Optimization in a Noisy Neighborhood: On Return Landscapes in Continuous Control
by: Rahn, Nate, et al.
Published: (2023)

Beyond Data Filtering: Knowledge Localization for Capability Removal in LLMs
by: Shilov, Igor, et al.
Published: (2025)

Red-Teaming for Inducing Societal Bias in Large Language Models
by: Luo, Chu Fei, et al.
Published: (2024)

Eliciting Harmful Capabilities by Fine-Tuning On Safeguarded Outputs
by: Kaunismaa, Jackson, et al.
Published: (2026)

TreeTeaming: Autonomous Red-Teaming of Vision-Language Models via Hierarchical Strategy Exploration
by: Li, Chunxiao, et al.
Published: (2026)

RedTopic: Toward Topic-Diverse Red Teaming of Large Language Models
by: Ding, Jiale, et al.
Published: (2025)

Automated Red Teaming with GOAT: the Generative Offensive Agent Tester
by: Pavlova, Maya, et al.
Published: (2024)

Towards Safeguarding LLM Fine-tuning APIs against Cipher Attacks
by: Youstra, Jack, et al.
Published: (2025)

Beyond Benchmarks: Dynamic, Automatic And Systematic Red-Teaming Agents For Trustworthy Medical Language Models
by: Pan, Jiazhen, et al.
Published: (2025)

Automatic LLM Red Teaming
by: Belaire, Roman, et al.
Published: (2025)

Stabilising Explainability Fragility in Cybersecurity AI: The Impact and Mitigation of Multicollinearity in Public Benchmark Datasets
by: Vourganas, Ioannis J., et al.
Published: (2026)

Best-of-N Jailbreaking
by: Hughes, John, et al.
Published: (2024)

Red-Teaming Segment Anything Model
by: Jankowski, Krzysztof, et al.
Published: (2024)

Embodied Red Teaming for Auditing Robotic Foundation Models
by: Karnik, Sathwik, et al.
Published: (2024)

Large Language Models Are Zero-Shot Time Series Forecasters
by: Gruver, Nate, et al.
Published: (2023)

LOCOST: State-Space Models for Long Document Abstractive Summarization
by: Bronnec, Florian Le, et al.
Published: (2024)

Jailbreak-Zero: A Path to Pareto Optimal Red Teaming for Large Language Models
by: Hu, Kai, et al.
Published: (2025)

Geometric Red-Teaming for Robotic Manipulation
by: Goel, Divyam, et al.
Published: (2025)

From Firewalls to Frontiers: AI Red-Teaming is a Domain-Specific Evolution of Cyber Red-Teaming
by: Sinha, Anusha, et al.
Published: (2025)

BRIDO: Bringing Democratic Order to Abstractive Summarization
by: Lee, Junhyun, et al.
Published: (2025)

Recursive Abstractive Processing for Retrieval in Dynamic Datasets
by: Chucri, Charbel, et al.
Published: (2024)

RedRFT: A Light-Weight Benchmark for Reinforcement Fine-Tuning-Based Red Teaming
by: Zheng, Xiang, et al.
Published: (2025)

LLM-Assisted Red Teaming of Diffusion Models through "Failures Are Fated, But Can Be Faded"
by: Sagar, Som, et al.
Published: (2024)

Beyond Red-Teaming: Formal Guarantees of LLM Guardrail Classifiers
by: Kezins, Nikita, et al.
Published: (2026)

Beyond Win Rates: A Clustering-Based Approach to Character Balance Analysis in Team-Based Games
by: Zhou, Haokun
Published: (2025)

RAPTOR: Recursive Abstractive Processing for Tree-Organized Retrieval
by: Sarthi, Parth, et al.
Published: (2024)

Stress-Testing Model Specs Reveals Character Differences among Language Models
by: Zhang, Jifan, et al.
Published: (2025)

ALERT: A Comprehensive Benchmark for Assessing Large Language Models' Safety through Red Teaming
by: Tedeschi, Simone, et al.
Published: (2024)

From Actions to Words: Towards Abstractive-Textual Policy Summarization in RL
by: Admoni, Sahar, et al.
Published: (2025)

Arondight: Red Teaming Large Vision Language Models with Auto-generated Multi-modal Jailbreak Prompts
by: Liu, Yi, et al.
Published: (2024)

Quality-Diversity Red-Teaming: Automated Generation of High-Quality and Diverse Attackers for Large Language Models
by: Wang, Ren-Jian, et al.
Published: (2025)

ARMs: Adaptive Red-Teaming Agent against Multimodal Models with Plug-and-Play Attacks
by: Chen, Zhaorun, et al.
Published: (2025)

Attack Atlas: A Practitioner's Perspective on Challenges and Pitfalls in Red Teaming GenAI
by: Rawat, Ambrish, et al.
Published: (2024)

Cmprsr: Abstractive Token-Level Question-Agnostic Prompt Compressor
by: Zakazov, Ivan, et al.
Published: (2025)

Abstractive summarization from Audio Transcription
by: Derkach, Ilia
Published: (2024)