Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Ray, Arijit, Abdelkader, Ahmed, Mao, Chengzhi, Plummer, Bryan A., Saenko, Kate, Krishna, Ranjay, Guibas, Leonidas, Chu, Wen-Sheng
Format:	Preprint
Published:	2025
Subjects:	Computer Vision and Pattern Recognition Artificial Intelligence
Online Access:	https://arxiv.org/abs/2512.10941
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866913077642919936
author	Ray, Arijit Abdelkader, Ahmed Mao, Chengzhi Plummer, Bryan A. Saenko, Kate Krishna, Ranjay Guibas, Leonidas Chu, Wen-Sheng
author_facet	Ray, Arijit Abdelkader, Ahmed Mao, Chengzhi Plummer, Bryan A. Saenko, Kate Krishna, Ranjay Guibas, Leonidas Chu, Wen-Sheng
contents	Reasoning goes beyond language; the real world requires reasoning about space, time, affordances, and much more that words alone cannot convey. Existing multimodal models exploring the potential of reasoning with images are brittle and do not scale. They rely on calling specialist tools, costly generation of images, or handcrafted reasoning data to switch between text and image thoughts. Instead, we offer a simpler alternative -- Mull-Tokens -- modality-agnostic latent tokens pre-trained to hold intermediate information in either image or text modalities to let the model think free-form towards the correct answer. We investigate best practices to train Mull-Tokens inspired by latent reasoning frameworks. We first train Mull-Tokens using supervision from interleaved text-image traces, and then fine-tune without any supervision by only using the final answers. Across four challenging spatial reasoning benchmarks involving tasks such as solving puzzles and taking different perspectives, we demonstrate that Mull-Tokens improve upon several baselines utilizing text-only reasoning or interleaved image-text reasoning, achieving a +3% average improvement and up to +16% on a puzzle solving reasoning-heavy split compared to our strongest baseline. Adding to conversations around challenges in grounding textual and visual reasoning, Mull-Tokens offers a simple solution to abstractly think in multiple modalities.
format	Preprint
id	arxiv_https___arxiv_org_abs_2512_10941
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	Mull-Tokens: Modality-Agnostic Latent Thinking Ray, Arijit Abdelkader, Ahmed Mao, Chengzhi Plummer, Bryan A. Saenko, Kate Krishna, Ranjay Guibas, Leonidas Chu, Wen-Sheng Computer Vision and Pattern Recognition Artificial Intelligence Reasoning goes beyond language; the real world requires reasoning about space, time, affordances, and much more that words alone cannot convey. Existing multimodal models exploring the potential of reasoning with images are brittle and do not scale. They rely on calling specialist tools, costly generation of images, or handcrafted reasoning data to switch between text and image thoughts. Instead, we offer a simpler alternative -- Mull-Tokens -- modality-agnostic latent tokens pre-trained to hold intermediate information in either image or text modalities to let the model think free-form towards the correct answer. We investigate best practices to train Mull-Tokens inspired by latent reasoning frameworks. We first train Mull-Tokens using supervision from interleaved text-image traces, and then fine-tune without any supervision by only using the final answers. Across four challenging spatial reasoning benchmarks involving tasks such as solving puzzles and taking different perspectives, we demonstrate that Mull-Tokens improve upon several baselines utilizing text-only reasoning or interleaved image-text reasoning, achieving a +3% average improvement and up to +16% on a puzzle solving reasoning-heavy split compared to our strongest baseline. Adding to conversations around challenges in grounding textual and visual reasoning, Mull-Tokens offers a simple solution to abstractly think in multiple modalities.
title	Mull-Tokens: Modality-Agnostic Latent Thinking
topic	Computer Vision and Pattern Recognition Artificial Intelligence
url	https://arxiv.org/abs/2512.10941

Similar Items