Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Kim, Been, Hewitt, John, Nanda, Neel, Fiedel, Noah, Tafjord, Oyvind
Format:	Preprint
Published:	2025
Subjects:	Artificial Intelligence
Online Access:	https://arxiv.org/abs/2506.12152
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866912429803307008
author	Kim, Been Hewitt, John Nanda, Neel Fiedel, Noah Tafjord, Oyvind
author_facet	Kim, Been Hewitt, John Nanda, Neel Fiedel, Noah Tafjord, Oyvind
contents	The era of Large Language Models (LLMs) presents a new opportunity for interpretability--agentic interpretability: a multi-turn conversation with an LLM wherein the LLM proactively assists human understanding by developing and leveraging a mental model of the user, which in turn enables humans to develop better mental models of the LLM. Such conversation is a new capability that traditional `inspective' interpretability methods (opening the black-box) do not use. Having a language model that aims to teach and explain--beyond just knowing how to talk--is similar to a teacher whose goal is to teach well, understanding that their success will be measured by the student's comprehension. While agentic interpretability may trade off completeness for interactivity, making it less suitable for high-stakes safety situations with potentially deceptive models, it leverages a cooperative model to discover potentially superhuman concepts that can improve humans' mental model of machines. Agentic interpretability introduces challenges, particularly in evaluation, due to what we call `human-entangled-in-the-loop' nature (humans responses are integral part of the algorithm), making the design and evaluation difficult. We discuss possible solutions and proxy goals. As LLMs approach human parity in many tasks, agentic interpretability's promise is to help humans learn the potentially superhuman concepts of the LLMs, rather than see us fall increasingly far from understanding them.
format	Preprint
id	arxiv_https___arxiv_org_abs_2506_12152
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	Because we have LLMs, we Can and Should Pursue Agentic Interpretability Kim, Been Hewitt, John Nanda, Neel Fiedel, Noah Tafjord, Oyvind Artificial Intelligence The era of Large Language Models (LLMs) presents a new opportunity for interpretability--agentic interpretability: a multi-turn conversation with an LLM wherein the LLM proactively assists human understanding by developing and leveraging a mental model of the user, which in turn enables humans to develop better mental models of the LLM. Such conversation is a new capability that traditional `inspective' interpretability methods (opening the black-box) do not use. Having a language model that aims to teach and explain--beyond just knowing how to talk--is similar to a teacher whose goal is to teach well, understanding that their success will be measured by the student's comprehension. While agentic interpretability may trade off completeness for interactivity, making it less suitable for high-stakes safety situations with potentially deceptive models, it leverages a cooperative model to discover potentially superhuman concepts that can improve humans' mental model of machines. Agentic interpretability introduces challenges, particularly in evaluation, due to what we call `human-entangled-in-the-loop' nature (humans responses are integral part of the algorithm), making the design and evaluation difficult. We discuss possible solutions and proxy goals. As LLMs approach human parity in many tasks, agentic interpretability's promise is to help humans learn the potentially superhuman concepts of the LLMs, rather than see us fall increasingly far from understanding them.
title	Because we have LLMs, we Can and Should Pursue Agentic Interpretability
topic	Artificial Intelligence
url	https://arxiv.org/abs/2506.12152

Similar Items