Saved in:
| Main Authors: | Rahn, Nate, Qi, Allison, Griffin, Avery, Michala, Jonathan, Sleight, Henry, Jones, Erik |
|---|---|
| Format: | Preprint |
| Published: |
2026
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2602.12318 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Persona Vectors: Monitoring and Controlling Character Traits in Language Models
by: Chen, Runjin, et al.
Published: (2025)
by: Chen, Runjin, et al.
Published: (2025)
Three Concrete Challenges and Two Hopes for the Safety of Unsupervised Elicitation
by: Canavan, Callum, et al.
Published: (2026)
by: Canavan, Callum, et al.
Published: (2026)
All Code, No Thought: Current Language Models Struggle to Reason in Ciphered Language
by: Guo, Shiyuan, et al.
Published: (2025)
by: Guo, Shiyuan, et al.
Published: (2025)
The LLM Has Left The Chat: Evidence of Bail Preferences in Large Language Models
by: Ensign, Danielle, et al.
Published: (2025)
by: Ensign, Danielle, et al.
Published: (2025)
Text-Diffusion Red-Teaming of Large Language Models: Unveiling Harmful Behaviors with Proximity Constraints
by: Nöther, Jonathan, et al.
Published: (2025)
by: Nöther, Jonathan, et al.
Published: (2025)
Policy Optimization in a Noisy Neighborhood: On Return Landscapes in Continuous Control
by: Rahn, Nate, et al.
Published: (2023)
by: Rahn, Nate, et al.
Published: (2023)
Beyond Data Filtering: Knowledge Localization for Capability Removal in LLMs
by: Shilov, Igor, et al.
Published: (2025)
by: Shilov, Igor, et al.
Published: (2025)
Red-Teaming for Inducing Societal Bias in Large Language Models
by: Luo, Chu Fei, et al.
Published: (2024)
by: Luo, Chu Fei, et al.
Published: (2024)
Eliciting Harmful Capabilities by Fine-Tuning On Safeguarded Outputs
by: Kaunismaa, Jackson, et al.
Published: (2026)
by: Kaunismaa, Jackson, et al.
Published: (2026)
TreeTeaming: Autonomous Red-Teaming of Vision-Language Models via Hierarchical Strategy Exploration
by: Li, Chunxiao, et al.
Published: (2026)
by: Li, Chunxiao, et al.
Published: (2026)
RedTopic: Toward Topic-Diverse Red Teaming of Large Language Models
by: Ding, Jiale, et al.
Published: (2025)
by: Ding, Jiale, et al.
Published: (2025)
Automated Red Teaming with GOAT: the Generative Offensive Agent Tester
by: Pavlova, Maya, et al.
Published: (2024)
by: Pavlova, Maya, et al.
Published: (2024)
Towards Safeguarding LLM Fine-tuning APIs against Cipher Attacks
by: Youstra, Jack, et al.
Published: (2025)
by: Youstra, Jack, et al.
Published: (2025)
Beyond Benchmarks: Dynamic, Automatic And Systematic Red-Teaming Agents For Trustworthy Medical Language Models
by: Pan, Jiazhen, et al.
Published: (2025)
by: Pan, Jiazhen, et al.
Published: (2025)
Automatic LLM Red Teaming
by: Belaire, Roman, et al.
Published: (2025)
by: Belaire, Roman, et al.
Published: (2025)
Stabilising Explainability Fragility in Cybersecurity AI: The Impact and Mitigation of Multicollinearity in Public Benchmark Datasets
by: Vourganas, Ioannis J., et al.
Published: (2026)
by: Vourganas, Ioannis J., et al.
Published: (2026)
Best-of-N Jailbreaking
by: Hughes, John, et al.
Published: (2024)
by: Hughes, John, et al.
Published: (2024)
Red-Teaming Segment Anything Model
by: Jankowski, Krzysztof, et al.
Published: (2024)
by: Jankowski, Krzysztof, et al.
Published: (2024)
Embodied Red Teaming for Auditing Robotic Foundation Models
by: Karnik, Sathwik, et al.
Published: (2024)
by: Karnik, Sathwik, et al.
Published: (2024)
Large Language Models Are Zero-Shot Time Series Forecasters
by: Gruver, Nate, et al.
Published: (2023)
by: Gruver, Nate, et al.
Published: (2023)
LOCOST: State-Space Models for Long Document Abstractive Summarization
by: Bronnec, Florian Le, et al.
Published: (2024)
by: Bronnec, Florian Le, et al.
Published: (2024)
Jailbreak-Zero: A Path to Pareto Optimal Red Teaming for Large Language Models
by: Hu, Kai, et al.
Published: (2025)
by: Hu, Kai, et al.
Published: (2025)
Geometric Red-Teaming for Robotic Manipulation
by: Goel, Divyam, et al.
Published: (2025)
by: Goel, Divyam, et al.
Published: (2025)
From Firewalls to Frontiers: AI Red-Teaming is a Domain-Specific Evolution of Cyber Red-Teaming
by: Sinha, Anusha, et al.
Published: (2025)
by: Sinha, Anusha, et al.
Published: (2025)
BRIDO: Bringing Democratic Order to Abstractive Summarization
by: Lee, Junhyun, et al.
Published: (2025)
by: Lee, Junhyun, et al.
Published: (2025)
Recursive Abstractive Processing for Retrieval in Dynamic Datasets
by: Chucri, Charbel, et al.
Published: (2024)
by: Chucri, Charbel, et al.
Published: (2024)
RedRFT: A Light-Weight Benchmark for Reinforcement Fine-Tuning-Based Red Teaming
by: Zheng, Xiang, et al.
Published: (2025)
by: Zheng, Xiang, et al.
Published: (2025)
LLM-Assisted Red Teaming of Diffusion Models through "Failures Are Fated, But Can Be Faded"
by: Sagar, Som, et al.
Published: (2024)
by: Sagar, Som, et al.
Published: (2024)
Beyond Red-Teaming: Formal Guarantees of LLM Guardrail Classifiers
by: Kezins, Nikita, et al.
Published: (2026)
by: Kezins, Nikita, et al.
Published: (2026)
Beyond Win Rates: A Clustering-Based Approach to Character Balance Analysis in Team-Based Games
by: Zhou, Haokun
Published: (2025)
by: Zhou, Haokun
Published: (2025)
RAPTOR: Recursive Abstractive Processing for Tree-Organized Retrieval
by: Sarthi, Parth, et al.
Published: (2024)
by: Sarthi, Parth, et al.
Published: (2024)
Stress-Testing Model Specs Reveals Character Differences among Language Models
by: Zhang, Jifan, et al.
Published: (2025)
by: Zhang, Jifan, et al.
Published: (2025)
ALERT: A Comprehensive Benchmark for Assessing Large Language Models' Safety through Red Teaming
by: Tedeschi, Simone, et al.
Published: (2024)
by: Tedeschi, Simone, et al.
Published: (2024)
From Actions to Words: Towards Abstractive-Textual Policy Summarization in RL
by: Admoni, Sahar, et al.
Published: (2025)
by: Admoni, Sahar, et al.
Published: (2025)
Arondight: Red Teaming Large Vision Language Models with Auto-generated Multi-modal Jailbreak Prompts
by: Liu, Yi, et al.
Published: (2024)
by: Liu, Yi, et al.
Published: (2024)
Quality-Diversity Red-Teaming: Automated Generation of High-Quality and Diverse Attackers for Large Language Models
by: Wang, Ren-Jian, et al.
Published: (2025)
by: Wang, Ren-Jian, et al.
Published: (2025)
ARMs: Adaptive Red-Teaming Agent against Multimodal Models with Plug-and-Play Attacks
by: Chen, Zhaorun, et al.
Published: (2025)
by: Chen, Zhaorun, et al.
Published: (2025)
Attack Atlas: A Practitioner's Perspective on Challenges and Pitfalls in Red Teaming GenAI
by: Rawat, Ambrish, et al.
Published: (2024)
by: Rawat, Ambrish, et al.
Published: (2024)
Cmprsr: Abstractive Token-Level Question-Agnostic Prompt Compressor
by: Zakazov, Ivan, et al.
Published: (2025)
by: Zakazov, Ivan, et al.
Published: (2025)
Abstractive summarization from Audio Transcription
by: Derkach, Ilia
Published: (2024)
by: Derkach, Ilia
Published: (2024)
Similar Items
-
Persona Vectors: Monitoring and Controlling Character Traits in Language Models
by: Chen, Runjin, et al.
Published: (2025) -
Three Concrete Challenges and Two Hopes for the Safety of Unsupervised Elicitation
by: Canavan, Callum, et al.
Published: (2026) -
All Code, No Thought: Current Language Models Struggle to Reason in Ciphered Language
by: Guo, Shiyuan, et al.
Published: (2025) -
The LLM Has Left The Chat: Evidence of Bail Preferences in Large Language Models
by: Ensign, Danielle, et al.
Published: (2025) -
Text-Diffusion Red-Teaming of Large Language Models: Unveiling Harmful Behaviors with Proximity Constraints
by: Nöther, Jonathan, et al.
Published: (2025)