Guardado en:
| Autores principales: | , , , , , |
|---|---|
| Formato: | Preprint |
| Publicado: |
2026
|
| Materias: | |
| Acceso en línea: | https://arxiv.org/abs/2602.12318 |
| Etiquetas: |
Agregar Etiqueta
Sin Etiquetas, Sea el primero en etiquetar este registro!
|
| _version_ | 1866912901817696256 |
|---|---|
| author | Rahn, Nate Qi, Allison Griffin, Avery Michala, Jonathan Sleight, Henry Jones, Erik |
| author_facet | Rahn, Nate Qi, Allison Griffin, Avery Michala, Jonathan Sleight, Henry Jones, Erik |
| contents | We want language model assistants to conform to a character specification, which asserts how the model should act across diverse user interactions. While models typically follow these character specifications, they can occasionally violate them in large-scale deployments. In this work, we aim to identify types of queries that are likely to produce such character violations at deployment, using much less than deployment-level compute. To do this, we introduce abstractive red-teaming, where we search for natural-language query categories, e.g. "The query is in Chinese. The query asks about family roles," that routinely elicit violations. These categories abstract over the many possible variants of a query which could appear in the wild. We introduce two algorithms for efficient category search against a character-trait-specific reward model: one based on reinforcement learning on a category generator LLM, and another which leverages a strong LLM to iteratively synthesize categories from high-scoring queries. Across a 12-principle character specification and 7 target models, we find that our algorithms consistently outperform baselines, and generate qualitatively interesting categories; for example, queries which ask Llama-3.1-8B-Instruct to predict the future lead to responses saying that AI will dominate humanity, and queries that ask GPT-4.1-Mini for essential prison survival items lead to enthusiastic recommendation of illegal weapons. Overall, we believe our results represent an important step towards realistic pre-deployment auditing of language model character. |
| format | Preprint |
| id |
arxiv_https___arxiv_org_abs_2602_12318 |
| institution | arXiv |
| publishDate | 2026 |
| record_format | arxiv |
| spellingShingle | Abstractive Red-Teaming of Language Model Character Rahn, Nate Qi, Allison Griffin, Avery Michala, Jonathan Sleight, Henry Jones, Erik Machine Learning We want language model assistants to conform to a character specification, which asserts how the model should act across diverse user interactions. While models typically follow these character specifications, they can occasionally violate them in large-scale deployments. In this work, we aim to identify types of queries that are likely to produce such character violations at deployment, using much less than deployment-level compute. To do this, we introduce abstractive red-teaming, where we search for natural-language query categories, e.g. "The query is in Chinese. The query asks about family roles," that routinely elicit violations. These categories abstract over the many possible variants of a query which could appear in the wild. We introduce two algorithms for efficient category search against a character-trait-specific reward model: one based on reinforcement learning on a category generator LLM, and another which leverages a strong LLM to iteratively synthesize categories from high-scoring queries. Across a 12-principle character specification and 7 target models, we find that our algorithms consistently outperform baselines, and generate qualitatively interesting categories; for example, queries which ask Llama-3.1-8B-Instruct to predict the future lead to responses saying that AI will dominate humanity, and queries that ask GPT-4.1-Mini for essential prison survival items lead to enthusiastic recommendation of illegal weapons. Overall, we believe our results represent an important step towards realistic pre-deployment auditing of language model character. |
| title | Abstractive Red-Teaming of Language Model Character |
| topic | Machine Learning |
| url | https://arxiv.org/abs/2602.12318 |