Gespeichert in:
Bibliographische Detailangaben
Hauptverfasser: Zeng, Changyu, Wang, Yifan, Wang, Zimu, Wang, Wei, Yang, Zhengni, Bao, Muyi, Xiao, Jiming, Nguyen, Anh, Yue, Yutao
Format: Preprint
Veröffentlicht: 2025
Schlagworte:
Online-Zugang:https://arxiv.org/abs/2509.16656
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
_version_ 1866908792885608448
author Zeng, Changyu
Wang, Yifan
Wang, Zimu
Wang, Wei
Yang, Zhengni
Bao, Muyi
Xiao, Jiming
Nguyen, Anh
Yue, Yutao
author_facet Zeng, Changyu
Wang, Yifan
Wang, Zimu
Wang, Wei
Yang, Zhengni
Bao, Muyi
Xiao, Jiming
Nguyen, Anh
Yue, Yutao
contents Recent advancements in 2D multimodal large language models (MLLMs) have significantly improved performance in vision-language tasks. However, extending these capabilities to 3D environments remains a distinct challenge due to the complexity of spatial reasoning. Nevertheless, existing 3D benchmarks often lack fine-grained numerical reasoning task annotations, limiting MLLMs' ability to perform precise spatial measurements and complex numerical reasoning. To address this gap, we introduce NUMINA, the first Natural Understanding benchmark for Multi-dimensional Intelligence and Numerical reasoning Abilities to enhance multimodal indoor perceptual understanding. NUMINA features multi-scale annotations and various question-answer pairs, generated using NUMINA-Flow, an automated annotation pipeline that integrates LLM rewriting and rule-based self-verification. We evaluate the performance of various state-of-the-art LLMs on NUMINA following the Chat-Scene framework, demonstrating that current LLMs struggle with multimodal numerical reasoning, particularly in performing precise computations such as distance and volume estimation, highlighting the need for further advancements in 3D models. The dataset and source codes can be obtained from https://github.com/fengshun124/NUMINA.
format Preprint
id arxiv_https___arxiv_org_abs_2509_16656
institution arXiv
publishDate 2025
record_format arxiv
spellingShingle NUMINA: A Natural Understanding Benchmark for Multi-dimensional Intelligence and Numerical Reasoning Abilities
Zeng, Changyu
Wang, Yifan
Wang, Zimu
Wang, Wei
Yang, Zhengni
Bao, Muyi
Xiao, Jiming
Nguyen, Anh
Yue, Yutao
Artificial Intelligence
Recent advancements in 2D multimodal large language models (MLLMs) have significantly improved performance in vision-language tasks. However, extending these capabilities to 3D environments remains a distinct challenge due to the complexity of spatial reasoning. Nevertheless, existing 3D benchmarks often lack fine-grained numerical reasoning task annotations, limiting MLLMs' ability to perform precise spatial measurements and complex numerical reasoning. To address this gap, we introduce NUMINA, the first Natural Understanding benchmark for Multi-dimensional Intelligence and Numerical reasoning Abilities to enhance multimodal indoor perceptual understanding. NUMINA features multi-scale annotations and various question-answer pairs, generated using NUMINA-Flow, an automated annotation pipeline that integrates LLM rewriting and rule-based self-verification. We evaluate the performance of various state-of-the-art LLMs on NUMINA following the Chat-Scene framework, demonstrating that current LLMs struggle with multimodal numerical reasoning, particularly in performing precise computations such as distance and volume estimation, highlighting the need for further advancements in 3D models. The dataset and source codes can be obtained from https://github.com/fengshun124/NUMINA.
title NUMINA: A Natural Understanding Benchmark for Multi-dimensional Intelligence and Numerical Reasoning Abilities
topic Artificial Intelligence
url https://arxiv.org/abs/2509.16656