:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Northen, Trent R, Wang, Mingxun
Format:	Preprint
Published:	2026
Subjects:	Computation and Language
Online Access:	https://arxiv.org/abs/2603.09154
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

Evaluating Alignment of Behavioral Dispositions in LLMs
by: Taubenfeld, Amir, et al.
Published: (2026)

PMark: Towards Robust and Distortion-free Semantic-level Watermarking with Channel Constraints
by: Huo, Jiahao, et al.
Published: (2025)

LABBench2: An Improved Benchmark for AI Systems Performing Biology Research
by: Laurent, Jon M, et al.
Published: (2026)

TorchOpera: A Compound AI System for LLM Safety
by: Han, Shanshan, et al.
Published: (2024)

SafetyFlow: An Agent-Flow System for Automated LLM Safety Benchmarking
by: Zhu, Xiangyang, et al.
Published: (2025)

Narrative Landscape: Mapping Narrative Dispositions Across LLMs
by: Jung, Donghoon, et al.
Published: (2026)

Safetywashing: Do AI Safety Benchmarks Actually Measure Safety Progress?
by: Ren, Richard, et al.
Published: (2024)

AnswerCarefully: A Dataset for Improving the Safety of Japanese LLM Output
by: Suzuki, Hisami, et al.
Published: (2025)

The State of Multilingual LLM Safety Research: From Measuring the Language Gap to Mitigating It
by: Yong, Zheng-Xin, et al.
Published: (2025)

Agent-SafetyBench: Evaluating the Safety of LLM Agents
by: Zhang, Zhexin, et al.
Published: (2024)

When Safety Blocks Sense: Measuring Semantic Confusion in LLM Refusals
by: Anonto, Riad Ahmed, et al.
Published: (2025)

AccelOpt: A Self-Improving LLM Agentic System for AI Accelerator Kernel Optimization
by: Zhang, Genghan, et al.
Published: (2025)

Improving LLM Safety Alignment with Dual-Objective Optimization
by: Zhao, Xuandong, et al.
Published: (2025)

Libra-Leaderboard: Towards Responsible AI through a Balanced Leaderboard of Safety and Capability
by: Li, Haonan, et al.
Published: (2024)

Exploiting LLM-as-a-Judge Disposition on Free Text Legal QA via Prompt Optimization
by: Elganayni, Mohamed Hesham, et al.
Published: (2026)

The Homogenization Problem in LLMs: Towards Meaningful Diversity in AI Safety
by: Rios-Sialer, Ian
Published: (2026)

Interpretation Meets Safety: A Survey on Interpretation Methods and Tools for Improving LLM Safety
by: Lee, Seongmin, et al.
Published: (2025)

LLM-Detector: Improving AI-Generated Chinese Text Detection with Open-Source LLM Instruction Tuning
by: Wang, Rongsheng, et al.
Published: (2024)

Taiwan Safety Benchmark and Breeze Guard: Toward Trustworthy AI for Taiwanese Mandarin
by: Hsu, Po-Chun, et al.
Published: (2026)

SafeTutors: Benchmarking Pedagogical Safety in AI Tutoring Systems
by: Hazra, Rima, et al.
Published: (2026)

Check Yourself Before You Wreck Yourself: Selectively Quitting Improves LLM Agent Safety
by: Bonagiri, Vamshi Krishna, et al.
Published: (2025)

DocCHA: Towards LLM-Augmented Interactive Online diagnosis System
by: Liu, Xinyi, et al.
Published: (2025)

Mic Drop or Data Flop? Evaluating the Fitness for Purpose of AI Voice Interviewers for Data Collection within Quantitative & Qualitative Research Contexts
by: Tirumala, Shreyas, et al.
Published: (2025)

LLM Rationalis? Measuring Bargaining Capabilities of AI Negotiators
by: Shah, Cheril, et al.
Published: (2025)

Can AI Debias the News? LLM Interventions Improve Cross-Partisan Receptivity but LLMs Overestimate Their Own Effectiveness
by: Feroz, Faisal, et al.
Published: (2026)

AEGIS: Online Adaptive AI Content Safety Moderation with Ensemble of LLM Experts
by: Ghosh, Shaona, et al.
Published: (2024)

A Coin Flip for Safety: LLM Judges Fail to Reliably Measure Adversarial Robustness
by: Schwinn, Leo, et al.
Published: (2026)

Beware of Your Po! Measuring and Mitigating AI Safety Risks in Role-Play Fine-Tuning of LLMs
by: Zhao, Weixiang, et al.
Published: (2025)

AI Safety in Generative AI Large Language Models: A Survey
by: Chua, Jaymari, et al.
Published: (2024)

Entropy-Aware Speculative Decoding Toward Improved LLM Reasoning
by: Su, Tiancheng, et al.
Published: (2025)

Beyond Binary: Towards Fine-Grained LLM-Generated Text Detection via Role Recognition and Involvement Measurement
by: Cheng, Zihao, et al.
Published: (2024)

Instructional Segment Embedding: Improving LLM Safety with Instruction Hierarchy
by: Wu, Tong, et al.
Published: (2024)

Towards Fair and Comprehensive Evaluation of Routers in Collaborative LLM Systems
by: Wu, Wanxing, et al.
Published: (2026)

Aegis2.0: A Diverse AI Safety Dataset and Risks Taxonomy for Alignment of LLM Guardrails
by: Ghosh, Shaona, et al.
Published: (2025)

RAM: Towards an Ever-Improving Memory System by Learning from Communications
by: Li, Jiaqi, et al.
Published: (2024)

Safety Is Not Universal: The Selective Safety Trap in LLM Alignment
by: Brito, Iago Alves, et al.
Published: (2026)

FinSafetyBench: Evaluating LLM Safety in Real-World Financial Scenarios
by: Hou, Yutao, et al.
Published: (2026)

SafetyAnalyst: Interpretable, Transparent, and Steerable Safety Moderation for AI Behavior
by: Li, Jing-Jing, et al.
Published: (2024)

Qorgau: Evaluating LLM Safety in Kazakh-Russian Bilingual Contexts
by: Goloburda, Maiya, et al.
Published: (2025)

Measuring Political Preferences in AI Systems: An Integrative Approach
by: Rozado, David
Published: (2025)