Tabla de Contenidos: :: Library Catalog

Guardado en:

Detalles Bibliográficos
Autores principales:	Yao, Yang, Li, Lingyu, Song, Jiaxin, Chen, Chiyu, He, Zhenqi, Wang, Yixu, Wang, Xin, Gu, Tianle, Li, Jie, Teng, Yan, Wang, Yingchun
Formato:	Preprint
Publicado:	2025
Materias:	Computer Vision and Pattern Recognition Artificial Intelligence Computation and Language Machine Learning Multimedia
Acceso en línea:	https://arxiv.org/abs/2506.14805
Etiquetas:	Agregar Etiqueta Sin Etiquetas, Sea el primero en etiquetar este registro!

Tabla de Contenidos:

As Multimodal Large Language Models (MLLMs) continue to evolve, their cognitive and reasoning capabilities have seen remarkable progress. However, challenges in visual fine-grained perception and commonsense causal inference persist. This paper introduces Argus Inspection, a multimodal benchmark with two levels of difficulty, emphasizing detailed visual recognition while incorporating real-world commonsense understanding to evaluate causal reasoning abilities. Expanding on it, we present the Eye of Panoptes framework, which integrates a binary parametric Sigmoid metric with an indicator function, enabling a more holistic evaluation of MLLMs' responses in opinion-based reasoning tasks. Experiments conducted on 26 mainstream MLLMs reveal that the highest performance in visual fine-grained reasoning reaches only 0.46, highlighting considerable potential for enhancement. Our research offers valuable perspectives for the continued refinement of MLLMs.

Ejemplares similares