Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Leonard, Bridget, Murray, Scott O.
Format:	Preprint
Published:	2026
Subjects:	Computer Vision and Pattern Recognition Artificial Intelligence Neurons and Cognition
Online Access:	https://arxiv.org/abs/2601.16378
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866918301421010944
author	Leonard, Bridget Murray, Scott O.
author_facet	Leonard, Bridget Murray, Scott O.
contents	Multimodal language models (MLMs) perform well on semantic vision-language tasks but fail at spatial reasoning that requires adopting another agent's visual perspective. These errors reflect a persistent egocentric bias and raise questions about whether current models support allocentric reasoning. Inspired by human spatial cognition, we introduce perspective tokens, specialized embeddings that encode orientation through either (1) embodied body-keypoint cues or (2) abstract representations supporting mental rotation. Integrating these tokens into LLaVA-1.5-13B yields performance on level-2 visual perspective-taking tasks. Across synthetic and naturalistic benchmarks (Isle Bricks V2, COCO, 3DSRBench), perspective tokens improve accuracy, with rotation-based tokens generalizing to non-human reference agents. Representational analyses reveal that fine-tuning enhances latent orientation sensitivity already present in the base model, suggesting that MLMs contain precursors of allocentric reasoning but lack appropriate internal structure. Overall, embedding cognitively grounded spatial structure directly into token space provides a lightweight, model-agnostic mechanism for perspective-taking and more human-like spatial reasoning.
format	Preprint
id	arxiv_https___arxiv_org_abs_2601_16378
institution	arXiv
publishDate	2026
record_format	arxiv
spellingShingle	Cognitively-Inspired Tokens Overcome Egocentric Bias in Multimodal Models Leonard, Bridget Murray, Scott O. Computer Vision and Pattern Recognition Artificial Intelligence Neurons and Cognition Multimodal language models (MLMs) perform well on semantic vision-language tasks but fail at spatial reasoning that requires adopting another agent's visual perspective. These errors reflect a persistent egocentric bias and raise questions about whether current models support allocentric reasoning. Inspired by human spatial cognition, we introduce perspective tokens, specialized embeddings that encode orientation through either (1) embodied body-keypoint cues or (2) abstract representations supporting mental rotation. Integrating these tokens into LLaVA-1.5-13B yields performance on level-2 visual perspective-taking tasks. Across synthetic and naturalistic benchmarks (Isle Bricks V2, COCO, 3DSRBench), perspective tokens improve accuracy, with rotation-based tokens generalizing to non-human reference agents. Representational analyses reveal that fine-tuning enhances latent orientation sensitivity already present in the base model, suggesting that MLMs contain precursors of allocentric reasoning but lack appropriate internal structure. Overall, embedding cognitively grounded spatial structure directly into token space provides a lightweight, model-agnostic mechanism for perspective-taking and more human-like spatial reasoning.
title	Cognitively-Inspired Tokens Overcome Egocentric Bias in Multimodal Models
topic	Computer Vision and Pattern Recognition Artificial Intelligence Neurons and Cognition
url	https://arxiv.org/abs/2601.16378

Similar Items