Enregistré dans:
Détails bibliographiques
Auteurs principaux: Choi, Dasol, Kim, Eugenia, Noh, Jaewon, Seo, Sang, Kim, Eunmi, Oh, Myunggyo, Park, Yunjin, Kartono, Brigitta Jesica, Pichlmeier, Josef, Berndt, Helena, Mendu, Sai Krishna, Tungka, Glenn Johannes, Gökçe, Özlem, Gehlot, Suresh, Pratt, Katherine, Minnich, Amanda, Park, Haon
Format: Preprint
Publié: 2026
Sujets:
Accès en ligne:https://arxiv.org/abs/2605.05662
Tags: Ajouter un tag
Pas de tags, Soyez le premier à ajouter un tag!
Table des matières:
  • Current LLM safety benchmarks are predominantly English-centric and often rely on translation, failing to capture country-specific harms. Moreover, they rarely evaluate a model's ability to detect culturally embedded sensitivities as distinct from universal harms. We introduce XL-SafetyBench. a suite of 5,500 test cases across 10 country-language pairs, comprising a Jailbreak Benchmark of country-grounded adversarial prompts and a Cultural Benchmark where local sensitivities are embedded within innocuous requests. Each item is constructed via a multi-stage pipeline that combines LLM-assisted discovery, automated validation gates, and dual independent native-speaker annotators per country. To distinguish principled refusal from comprehension failure, we evaluate Attack Success Rate (ASR) alongside two complementary metrics we introduce: Neutral-Safe Rate (NSR) and Cultural Sensitivity Rate (CSR). Evaluating 10 frontier and 27 local LLMs reveals two key findings. First, jailbreak robustness and cultural awareness do not show a coupled relationship among frontier models, so a composite safety score obscures per-axis variation. Second, local models exhibit a near-linear ASR-NSR trade-off (r = -0.81), indicating that their apparent safety reflects generation failure rather than genuine alignment. XL-SafetyBench enables more nuanced, cross-cultural safety evaluation in the multilingual era.