Saved in:
Bibliographic Details
Main Authors: Raj, Hilton, AV, Vishnuram
Format: Preprint
Published: 2026
Subjects:
Online Access:https://arxiv.org/abs/2606.02463
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866917556248379392
author Raj, Hilton
AV, Vishnuram
author_facet Raj, Hilton
AV, Vishnuram
contents In 3D environments, Embodied Agents answer spatially relevant questions through reasoning from a mixture of modalities including natural language, RGB images, point clouds, depth maps and camera poses. Existing Vision-Language models (VLMs) are fine-tuned over a single modality. This completely ignores the question semantics which may favor a different modality than the finetuned modality. To address this, we propose MASER (Modality-Adaptive SpEcialist Routing), a lightweight framework that trains five different modality adapters of a shared VLM backbone and learns a neural routing policy that selects the best adapter based on the question during inference. We encode each question with a frozen sentence transformer and pass the embedding through a small Multi-layer Perceptron (MLP) trained on oracle adapter-accuracy labels. We evaluate our methodology over the Open3D-VQA benchmark and our evaluations show that no single modality is universally optimal -- point-cloud answers are best in 51.5% of cases. MASER routes with 51.3% oracle agreement, outperforming a Random-Forest ablation (43.5%), with only a single adapter call per question.
format Preprint
id arxiv_https___arxiv_org_abs_2606_02463
institution arXiv
publishDate 2026
record_format arxiv
spellingShingle MASER: Modality-Adaptive Specialist Routing for Embodied 3D Spatial Intelligence
Raj, Hilton
AV, Vishnuram
Computer Vision and Pattern Recognition
Artificial Intelligence
In 3D environments, Embodied Agents answer spatially relevant questions through reasoning from a mixture of modalities including natural language, RGB images, point clouds, depth maps and camera poses. Existing Vision-Language models (VLMs) are fine-tuned over a single modality. This completely ignores the question semantics which may favor a different modality than the finetuned modality. To address this, we propose MASER (Modality-Adaptive SpEcialist Routing), a lightweight framework that trains five different modality adapters of a shared VLM backbone and learns a neural routing policy that selects the best adapter based on the question during inference. We encode each question with a frozen sentence transformer and pass the embedding through a small Multi-layer Perceptron (MLP) trained on oracle adapter-accuracy labels. We evaluate our methodology over the Open3D-VQA benchmark and our evaluations show that no single modality is universally optimal -- point-cloud answers are best in 51.5% of cases. MASER routes with 51.3% oracle agreement, outperforming a Random-Forest ablation (43.5%), with only a single adapter call per question.
title MASER: Modality-Adaptive Specialist Routing for Embodied 3D Spatial Intelligence
topic Computer Vision and Pattern Recognition
Artificial Intelligence
url https://arxiv.org/abs/2606.02463