MARC21: :: Library Catalog

Salvato in:

Dettagli Bibliografici
Autori principali:	Joglekar, Manas, Chen, Jeremy, Wu, Gabriel, Yosinski, Jason, Wang, Jasmine, Barak, Boaz, Glaese, Amelia
Natura:	Preprint
Pubblicazione:	2025
Soggetti:	Machine Learning Artificial Intelligence
Accesso online:	https://arxiv.org/abs/2512.08093
Tags:	Aggiungi Tag Nessun Tag, puoi essere il primo ad aggiungerne!!

_version_	1866908729592512512
author	Joglekar, Manas Chen, Jeremy Wu, Gabriel Yosinski, Jason Wang, Jasmine Barak, Boaz Glaese, Amelia
author_facet	Joglekar, Manas Chen, Jeremy Wu, Gabriel Yosinski, Jason Wang, Jasmine Barak, Boaz Glaese, Amelia
contents	Large language models (LLMs) can be dishonest when reporting on their actions and beliefs -- for example, they may overstate their confidence in factual claims or cover up evidence of covert actions. Such dishonesty may arise due to the effects of reinforcement learning (RL), where challenges with reward shaping can result in a training process that inadvertently incentivizes the model to lie or misrepresent its actions. In this work we propose a method for eliciting an honest expression of an LLM's shortcomings via a self-reported confession. A confession is an output, provided upon request after a model's original answer, that is meant to serve as a full account of the model's compliance with the letter and spirit of its policies and instructions. The reward assigned to a confession during training is solely based on its honesty, and does not impact positively or negatively the main answer's reward. As long as the "path of least resistance" for maximizing confession reward is to surface misbehavior rather than covering it up, this incentivizes models to be honest in their confessions. Our findings provide some justification this empirical assumption, especially in the case of egregious model misbehavior. To demonstrate the viability of our approach, we train GPT-5-Thinking to produce confessions, and we evaluate its honesty in out-of-distribution scenarios measuring hallucination, instruction following, scheming, and reward hacking. We find that when the model lies or omits shortcomings in its "main" answer, it often confesses to these behaviors honestly, and this confession honesty modestly improves with training. Confessions can enable a number of inference-time interventions including monitoring, rejection sampling, and surfacing issues to the user.
format	Preprint
id	arxiv_https___arxiv_org_abs_2512_08093
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	Training LLMs for Honesty via Confessions Joglekar, Manas Chen, Jeremy Wu, Gabriel Yosinski, Jason Wang, Jasmine Barak, Boaz Glaese, Amelia Machine Learning Artificial Intelligence Large language models (LLMs) can be dishonest when reporting on their actions and beliefs -- for example, they may overstate their confidence in factual claims or cover up evidence of covert actions. Such dishonesty may arise due to the effects of reinforcement learning (RL), where challenges with reward shaping can result in a training process that inadvertently incentivizes the model to lie or misrepresent its actions. In this work we propose a method for eliciting an honest expression of an LLM's shortcomings via a self-reported confession. A confession is an output, provided upon request after a model's original answer, that is meant to serve as a full account of the model's compliance with the letter and spirit of its policies and instructions. The reward assigned to a confession during training is solely based on its honesty, and does not impact positively or negatively the main answer's reward. As long as the "path of least resistance" for maximizing confession reward is to surface misbehavior rather than covering it up, this incentivizes models to be honest in their confessions. Our findings provide some justification this empirical assumption, especially in the case of egregious model misbehavior. To demonstrate the viability of our approach, we train GPT-5-Thinking to produce confessions, and we evaluate its honesty in out-of-distribution scenarios measuring hallucination, instruction following, scheming, and reward hacking. We find that when the model lies or omits shortcomings in its "main" answer, it often confesses to these behaviors honestly, and this confession honesty modestly improves with training. Confessions can enable a number of inference-time interventions including monitoring, rejection sampling, and surfacing issues to the user.
title	Training LLMs for Honesty via Confessions
topic	Machine Learning Artificial Intelligence
url	https://arxiv.org/abs/2512.08093

Documenti analoghi