Enregistré dans:
Détails bibliographiques
Auteurs principaux: Zhao, Jiahao, Dong, Liwei
Format: Preprint
Publié: 2025
Sujets:
Accès en ligne:https://arxiv.org/abs/2508.08243
Tags: Ajouter un tag
Pas de tags, Soyez le premier à ajouter un tag!
_version_ 1866912551106772992
author Zhao, Jiahao
Dong, Liwei
author_facet Zhao, Jiahao
Dong, Liwei
contents Unlimited, or so-called helpful-only language models are trained without safety alignment constraints and never refuse user queries. They are widely used by leading AI companies as internal tools for red teaming and alignment evaluation. For example, if a safety-aligned model produces harmful outputs similar to an unlimited model, this indicates alignment failures that require further attention. Despite their essential role in assessing alignment, such models are not available to the research community. We introduce Jinx, a helpful-only variant of popular open-weight LLMs. Jinx responds to all queries without refusals or safety filtering, while preserving the base model's capabilities in reasoning and instruction following. It provides researchers with an accessible tool for probing alignment failures, evaluating safety boundaries, and systematically studying failure modes in language model safety.
format Preprint
id arxiv_https___arxiv_org_abs_2508_08243
institution arXiv
publishDate 2025
record_format arxiv
spellingShingle Jinx: Unlimited LLMs for Probing Alignment Failures
Zhao, Jiahao
Dong, Liwei
Computation and Language
Unlimited, or so-called helpful-only language models are trained without safety alignment constraints and never refuse user queries. They are widely used by leading AI companies as internal tools for red teaming and alignment evaluation. For example, if a safety-aligned model produces harmful outputs similar to an unlimited model, this indicates alignment failures that require further attention. Despite their essential role in assessing alignment, such models are not available to the research community. We introduce Jinx, a helpful-only variant of popular open-weight LLMs. Jinx responds to all queries without refusals or safety filtering, while preserving the base model's capabilities in reasoning and instruction following. It provides researchers with an accessible tool for probing alignment failures, evaluating safety boundaries, and systematically studying failure modes in language model safety.
title Jinx: Unlimited LLMs for Probing Alignment Failures
topic Computation and Language
url https://arxiv.org/abs/2508.08243