:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Yi, Zhihang, Zhao, Jian, Lv, Jiancheng, Wang, Tao
Format:	Preprint
Published:	2026
Subjects:	Computer Vision and Pattern Recognition Artificial Intelligence Computation and Language Machine Learning
Online Access:	https://arxiv.org/abs/2602.10138
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

ChartNet: A Million-Scale, High-Quality Multimodal Dataset for Robust Chart Understanding
by: Kondic, Jovana, et al.
Published: (2026)

ChartMoE: Mixture of Diversely Aligned Expert Connector for Chart Understanding
by: Xu, Zhengzhuo, et al.
Published: (2024)

On Pre-training of Multimodal Language Models Customized for Chart Understanding
by: Fan, Wan-Cyuan, et al.
Published: (2024)

From Pixels to Insights: A Survey on Automatic Chart Understanding in the Era of Large Foundation Models
by: Huang, Kung-Hsiang, et al.
Published: (2024)

In-Depth and In-Breadth: Pre-training Multimodal Language Models Customized for Comprehensive Chart Understanding
by: Fan, Wan-Cyuan, et al.
Published: (2025)

VOILA: Evaluation of MLLMs For Perceptual Understanding and Analogical Reasoning
by: Yilmaz, Nilay, et al.
Published: (2025)

ChartInsights: Evaluating Multimodal Large Language Models for Low-Level Chart Question Answering
by: Wu, Yifan, et al.
Published: (2024)

PP-DocBee2: Improved Baselines with Efficient Data for Multimodal Document Understanding
by: Huang, Kui, et al.
Published: (2025)

Mitigating Object Hallucinations in MLLMs via Multi-Frequency Perturbations
by: Li, Shuo, et al.
Published: (2025)

ChartHal: A Fine-grained Framework Evaluating Hallucination of Large Vision Language Models in Chart Understanding
by: Wang, Xingqi, et al.
Published: (2025)

PP-DocBee: Improving Multimodal Document Understanding Through a Bag of Tricks
by: Ni, Feng, et al.
Published: (2025)

Can MLLMs Understand the Deep Implication Behind Chinese Images?
by: Zhang, Chenhao, et al.
Published: (2024)

AskChart: Universal Chart Understanding through Textual Enhancement
by: Yang, Xudong, et al.
Published: (2024)

MME-Survey: A Comprehensive Survey on Evaluation of Multimodal LLMs
by: Fu, Chaoyou, et al.
Published: (2024)

VideoScaffold: Elastic-Scale Visual Hierarchies for Streaming Video Understanding in MLLMs
by: Zheng, Naishan, et al.
Published: (2025)

ChartCap: Mitigating Hallucination of Dense Chart Captioning
by: Lim, Junyoung, et al.
Published: (2025)

MLLMs Know Where to Look: Training-free Perception of Small Visual Details with Multimodal LLMs
by: Zhang, Jiarui, et al.
Published: (2025)

ChartGemma: Visual Instruction-tuning for Chart Reasoning in the Wild
by: Masry, Ahmed, et al.
Published: (2024)

Why Vision Language Models Struggle with Visual Arithmetic? Towards Enhanced Chart and Geometry Understanding
by: Huang, Kung-Hsiang, et al.
Published: (2025)

Test-Time Matching: Unlocking Compositional Reasoning in Multimodal Models
by: Zhu, Yinglun, et al.
Published: (2025)

Advancing Object Detection in Transportation with Multimodal Large Language Models (MLLMs): A Comprehensive Review and Empirical Testing
by: Ashqar, Huthaifa I., et al.
Published: (2024)

A Survey on Agentic Multimodal Large Language Models
by: Yao, Huanjin, et al.
Published: (2025)

VLSU: Mapping the Limits of Joint Multimodal Understanding for AI Safety
by: Palaskar, Shruti, et al.
Published: (2025)

A Survey on Benchmarks of Multimodal Large Language Models
by: Li, Jian, et al.
Published: (2024)

GRIT: Teaching MLLMs to Think with Images
by: Fan, Yue, et al.
Published: (2025)

MULTI: Multimodal Understanding Leaderboard with Text and Images
by: Zhu, Zichen, et al.
Published: (2024)

VIR-Bench: Evaluating Geospatial and Temporal Understanding of MLLMs via Travel Video Itinerary Reconstruction
by: Wang, Hao, et al.
Published: (2025)

Large Multimodal Agents: A Survey
by: Xie, Junlin, et al.
Published: (2024)

MMReason: An Open-Ended Multi-Modal Multi-Step Reasoning Benchmark for MLLMs Toward AGI
by: Yao, Huanjin, et al.
Published: (2025)

MANTA: Cross-Modal Semantic Alignment and Information-Theoretic Optimization for Long-form Multimodal Understanding
by: Zhong, Ziqi, et al.
Published: (2025)

PCA-Bench: Evaluating Multimodal Large Language Models in Perception-Cognition-Action Chain
by: Chen, Liang, et al.
Published: (2024)

SUDER: Self-Improving Unified Large Multimodal Models for Understanding and Generation with Dual Self-Rewards
by: Hong, Jixiang, et al.
Published: (2025)

On the Limits of Token Reduction for Efficient Unified Vision Language Training
by: Chen, Siyi, et al.
Published: (2026)

Real-Time Multimodal Cognitive Assistant for Emergency Medical Services
by: Weerasinghe, Keshara, et al.
Published: (2024)

Grasp Any Region: Towards Precise, Contextual Pixel Understanding for Multimodal LLMs
by: Wang, Haochen, et al.
Published: (2025)

Cube Bench: A Benchmark for Spatial Visual Reasoning in MLLMs
by: Anand, Dhruv, et al.
Published: (2025)

AdaCodec: A Predictive Visual Code for Video MLLMs
by: Hou, Haowen, et al.
Published: (2026)

LEMON: How Well Do MLLMs Perform Temporal Multimodal Understanding on Instructional Videos?
by: Yu, Zhuang, et al.
Published: (2026)

SIMPLOT: Enhancing Chart Question Answering by Distilling Essentials
by: Kim, Wonjoong, et al.
Published: (2024)

JanusFlow: Harmonizing Autoregression and Rectified Flow for Unified Multimodal Understanding and Generation
by: Ma, Yiyang, et al.
Published: (2024)