Saved in:
Bibliographic Details
Main Authors: Leonard, Bridget, Murray, Scott O.
Format: Preprint
Published: 2026
Subjects:
Online Access:https://arxiv.org/abs/2601.16378
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866918301421010944
author Leonard, Bridget
Murray, Scott O.
author_facet Leonard, Bridget
Murray, Scott O.
contents Multimodal language models (MLMs) perform well on semantic vision-language tasks but fail at spatial reasoning that requires adopting another agent's visual perspective. These errors reflect a persistent egocentric bias and raise questions about whether current models support allocentric reasoning. Inspired by human spatial cognition, we introduce perspective tokens, specialized embeddings that encode orientation through either (1) embodied body-keypoint cues or (2) abstract representations supporting mental rotation. Integrating these tokens into LLaVA-1.5-13B yields performance on level-2 visual perspective-taking tasks. Across synthetic and naturalistic benchmarks (Isle Bricks V2, COCO, 3DSRBench), perspective tokens improve accuracy, with rotation-based tokens generalizing to non-human reference agents. Representational analyses reveal that fine-tuning enhances latent orientation sensitivity already present in the base model, suggesting that MLMs contain precursors of allocentric reasoning but lack appropriate internal structure. Overall, embedding cognitively grounded spatial structure directly into token space provides a lightweight, model-agnostic mechanism for perspective-taking and more human-like spatial reasoning.
format Preprint
id arxiv_https___arxiv_org_abs_2601_16378
institution arXiv
publishDate 2026
record_format arxiv
spellingShingle Cognitively-Inspired Tokens Overcome Egocentric Bias in Multimodal Models
Leonard, Bridget
Murray, Scott O.
Computer Vision and Pattern Recognition
Artificial Intelligence
Neurons and Cognition
Multimodal language models (MLMs) perform well on semantic vision-language tasks but fail at spatial reasoning that requires adopting another agent's visual perspective. These errors reflect a persistent egocentric bias and raise questions about whether current models support allocentric reasoning. Inspired by human spatial cognition, we introduce perspective tokens, specialized embeddings that encode orientation through either (1) embodied body-keypoint cues or (2) abstract representations supporting mental rotation. Integrating these tokens into LLaVA-1.5-13B yields performance on level-2 visual perspective-taking tasks. Across synthetic and naturalistic benchmarks (Isle Bricks V2, COCO, 3DSRBench), perspective tokens improve accuracy, with rotation-based tokens generalizing to non-human reference agents. Representational analyses reveal that fine-tuning enhances latent orientation sensitivity already present in the base model, suggesting that MLMs contain precursors of allocentric reasoning but lack appropriate internal structure. Overall, embedding cognitively grounded spatial structure directly into token space provides a lightweight, model-agnostic mechanism for perspective-taking and more human-like spatial reasoning.
title Cognitively-Inspired Tokens Overcome Egocentric Bias in Multimodal Models
topic Computer Vision and Pattern Recognition
Artificial Intelligence
Neurons and Cognition
url https://arxiv.org/abs/2601.16378