Internformat: :: Library Catalog

Gespeichert in:

Bibliographische Detailangaben
Hauptverfasser:	Rane, Sunayana, Lake, Brenden M., Griffiths, Thomas L.
Format:	Preprint
Veröffentlicht:	2026
Schlagworte:	Artificial Intelligence
Online-Zugang:	https://arxiv.org/abs/2605.21683
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

_version_	1866913151093571584
author	Rane, Sunayana Lake, Brenden M. Griffiths, Thomas L.
author_facet	Rane, Sunayana Lake, Brenden M. Griffiths, Thomas L.
contents	Developing AI systems with a human-like understanding of everyday concepts is a key step towards developing safe, reliable systems whose behavior makes sense to humans. When probing concept understanding, asking questions about plausible category members (e.g., "Is a car a vehicle?") is likely to recall patterns in the model's vast training data. We pursue an alternative strategy, characterizing the boundaries of conceptual categories by asking about implausible category members (e.g., "Is an olive a vehicle?") to probe the kind of concept-level knowledge we take for granted in fellow humans. We characterize concept boundaries for a set of fundamental concepts by studying AI systems' assignments of objects to superordinate categories from a classic psychological study by Rosch and Mervis, as well as their assignments of the same objects to mismatched superordinate categories. We compare these assignments to those made by human participants on the full range of within-category and cross-category assignment tasks. Our results reveal a range of concepts for which which models differ in meaningful and surprising ways from humans, including treating "words" as belonging to categories like "vehicles" and "clothing," identifying several "vegetable" category members as "fruit," and assigning exemplars from non-weapon categories to the "weapons" category. We also demonstrate how these instances of concept misalignment translate into problematic downstream behavior with implications for AI safety.
format	Preprint
id	arxiv_https___arxiv_org_abs_2605_21683
institution	arXiv
publishDate	2026
record_format	arxiv
spellingShingle	Investigating Concept Alignment Using Implausible Category Members Rane, Sunayana Lake, Brenden M. Griffiths, Thomas L. Artificial Intelligence Developing AI systems with a human-like understanding of everyday concepts is a key step towards developing safe, reliable systems whose behavior makes sense to humans. When probing concept understanding, asking questions about plausible category members (e.g., "Is a car a vehicle?") is likely to recall patterns in the model's vast training data. We pursue an alternative strategy, characterizing the boundaries of conceptual categories by asking about implausible category members (e.g., "Is an olive a vehicle?") to probe the kind of concept-level knowledge we take for granted in fellow humans. We characterize concept boundaries for a set of fundamental concepts by studying AI systems' assignments of objects to superordinate categories from a classic psychological study by Rosch and Mervis, as well as their assignments of the same objects to mismatched superordinate categories. We compare these assignments to those made by human participants on the full range of within-category and cross-category assignment tasks. Our results reveal a range of concepts for which which models differ in meaningful and surprising ways from humans, including treating "words" as belonging to categories like "vehicles" and "clothing," identifying several "vegetable" category members as "fruit," and assigning exemplars from non-weapon categories to the "weapons" category. We also demonstrate how these instances of concept misalignment translate into problematic downstream behavior with implications for AI safety.
title	Investigating Concept Alignment Using Implausible Category Members
topic	Artificial Intelligence
url	https://arxiv.org/abs/2605.21683

Ähnliche Einträge