Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Raj, Hilton, AV, Vishnuram
Format:	Preprint
Published:	2026
Subjects:	Computer Vision and Pattern Recognition Artificial Intelligence
Online Access:	https://arxiv.org/abs/2606.02463
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866917556248379392
author	Raj, Hilton AV, Vishnuram
author_facet	Raj, Hilton AV, Vishnuram
contents	In 3D environments, Embodied Agents answer spatially relevant questions through reasoning from a mixture of modalities including natural language, RGB images, point clouds, depth maps and camera poses. Existing Vision-Language models (VLMs) are fine-tuned over a single modality. This completely ignores the question semantics which may favor a different modality than the finetuned modality. To address this, we propose MASER (Modality-Adaptive SpEcialist Routing), a lightweight framework that trains five different modality adapters of a shared VLM backbone and learns a neural routing policy that selects the best adapter based on the question during inference. We encode each question with a frozen sentence transformer and pass the embedding through a small Multi-layer Perceptron (MLP) trained on oracle adapter-accuracy labels. We evaluate our methodology over the Open3D-VQA benchmark and our evaluations show that no single modality is universally optimal -- point-cloud answers are best in 51.5% of cases. MASER routes with 51.3% oracle agreement, outperforming a Random-Forest ablation (43.5%), with only a single adapter call per question.
format	Preprint
id	arxiv_https___arxiv_org_abs_2606_02463
institution	arXiv
publishDate	2026
record_format	arxiv
spellingShingle	MASER: Modality-Adaptive Specialist Routing for Embodied 3D Spatial Intelligence Raj, Hilton AV, Vishnuram Computer Vision and Pattern Recognition Artificial Intelligence In 3D environments, Embodied Agents answer spatially relevant questions through reasoning from a mixture of modalities including natural language, RGB images, point clouds, depth maps and camera poses. Existing Vision-Language models (VLMs) are fine-tuned over a single modality. This completely ignores the question semantics which may favor a different modality than the finetuned modality. To address this, we propose MASER (Modality-Adaptive SpEcialist Routing), a lightweight framework that trains five different modality adapters of a shared VLM backbone and learns a neural routing policy that selects the best adapter based on the question during inference. We encode each question with a frozen sentence transformer and pass the embedding through a small Multi-layer Perceptron (MLP) trained on oracle adapter-accuracy labels. We evaluate our methodology over the Open3D-VQA benchmark and our evaluations show that no single modality is universally optimal -- point-cloud answers are best in 51.5% of cases. MASER routes with 51.3% oracle agreement, outperforming a Random-Forest ablation (43.5%), with only a single adapter call per question.
title	MASER: Modality-Adaptive Specialist Routing for Embodied 3D Spatial Intelligence
topic	Computer Vision and Pattern Recognition Artificial Intelligence
url	https://arxiv.org/abs/2606.02463

Similar Items