:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Dreyer, Florian, Kolos, Ekaterina, Matiash, Daria
Format:	Preprint
Published:	2025
Subjects:	Computation and Language Artificial Intelligence Computer Vision and Pattern Recognition
Online Access:	https://arxiv.org/abs/2503.01064
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

SciMDR: Advancing Scientific Multimodal Document Reasoning
by: Chen, Ziyu, et al.
Published: (2026)

MLLM-CompBench: A Comparative Reasoning Benchmark for Multimodal LLMs
by: Kil, Jihyung, et al.
Published: (2024)

Generative Universal Verifier as Multimodal Meta-Reasoner
by: Zhang, Xinchen, et al.
Published: (2025)

Multimodal LLMs Can Reason about Aesthetics in Zero-Shot
by: Jiang, Ruixiang, et al.
Published: (2025)

DaMO: A Data-Efficient Multimodal Orchestrator for Temporal Reasoning with Video LLMs
by: Chiu, Bo-Cheng, et al.
Published: (2025)

Cross-modal Information Flow in Multimodal Large Language Models
by: Zhang, Zhi, et al.
Published: (2024)

ScImage: How Good Are Multimodal Large Language Models at Scientific Text-to-Image Generation?
by: Zhang, Leixin, et al.
Published: (2024)

Multimodal LLMs as Customized Reward Models for Text-to-Image Generation
by: Zhou, Shijie, et al.
Published: (2025)

EasyGen: Easing Multimodal Generation with BiDiffuser and LLMs
by: Zhao, Xiangyu, et al.
Published: (2023)

PosterSum: A Multimodal Benchmark for Scientific Poster Summarization
by: Saxena, Rohit, et al.
Published: (2025)

SPIQA: A Dataset for Multimodal Question Answering on Scientific Papers
by: Pramanick, Shraman, et al.
Published: (2024)

VF-Eval: Evaluating Multimodal LLMs for Generating Feedback on AIGC Videos
by: Song, Tingyu, et al.
Published: (2025)

MEXA: Towards General Multimodal Reasoning with Dynamic Multi-Expert Aggregation
by: Yu, Shoubin, et al.
Published: (2025)

Learning Human-Perceived Fakeness in AI-Generated Videos via Multimodal LLMs
by: Fu, Xingyu, et al.
Published: (2025)

How Multimodal LLMs Solve Image Tasks: A Lens on Visual Grounding, Task Reasoning, and Answer Decoding
by: Yu, Zhuoran, et al.
Published: (2025)

MMSci: A Dataset for Graduate-Level Multi-Discipline Multimodal Scientific Understanding
by: Li, Zekun, et al.
Published: (2024)

Do Large Multimodal Models Solve Caption Generation for Scientific Figures? Lessons Learned from SciCap Challenge 2023
by: Hsu, Ting-Yao E., et al.
Published: (2025)

Vision Enhancing LLMs: Empowering Multimodal Knowledge Storage and Sharing in LLMs
by: Li, Yunxin, et al.
Published: (2023)

CrossWordBench: Evaluating the Reasoning Capabilities of LLMs and LVLMs with Controllable Puzzle Generation
by: Leng, Jixuan, et al.
Published: (2025)

Wiki-LLaVA: Hierarchical Retrieval-Augmented Generation for Multimodal LLMs
by: Caffagni, Davide, et al.
Published: (2024)

Multimodal Chain-of-Thought Reasoning in Language Models
by: Zhang, Zhuosheng, et al.
Published: (2023)

Multimodal Fact-Level Attribution for Verifiable Reasoning
by: Wan, David, et al.
Published: (2026)

Analyzing Finetuning Representation Shift for Multimodal LLMs Steering
by: Khayatan, Pegah, et al.
Published: (2025)

SciVerse: Unveiling the Knowledge Comprehension and Visual Reasoning of LMMs on Multi-modal Scientific Problems
by: Guo, Ziyu, et al.
Published: (2025)

MCIF: Multimodal Crosslingual Instruction-Following Benchmark from Scientific Talks
by: Papi, Sara, et al.
Published: (2025)

LLMs Meet Multimodal Generation and Editing: A Survey
by: He, Yingqing, et al.
Published: (2024)

Exploring Compositional Generalization of Multimodal LLMs for Medical Imaging
by: Cai, Zhenyang, et al.
Published: (2024)

DentalGPT: Incentivizing Multimodal Complex Reasoning in Dentistry
by: Cai, Zhenyang, et al.
Published: (2025)

ResAdapt: Adaptive Resolution for Efficient Multimodal Reasoning
by: Liao, Huanxuan, et al.
Published: (2026)

Lost in Time: Clock and Calendar Understanding Challenges in Multimodal LLMs
by: Saxena, Rohit, et al.
Published: (2025)

MME-Survey: A Comprehensive Survey on Evaluation of Multimodal LLMs
by: Fu, Chaoyou, et al.
Published: (2024)

SpatialThinker: Reinforcing 3D Reasoning in Multimodal LLMs via Spatial Rewards
by: Batra, Hunar, et al.
Published: (2025)

MARBLE: A Hard Benchmark for Multimodal Spatial Reasoning and Planning
by: Jiang, Yulun, et al.
Published: (2025)

MATRIX: Multimodal Agent Tuning for Robust Tool-Use Reasoning
by: Ashraf, Tajamul, et al.
Published: (2025)

Gemini in Reasoning: Unveiling Commonsense in Multimodal Large Language Models
by: Wang, Yuqing, et al.
Published: (2023)

MicroVQA: A Multimodal Reasoning Benchmark for Microscopy-Based Scientific Research
by: Burgess, James, et al.
Published: (2025)

Image Captioning Evaluation in the Age of Multimodal LLMs: Challenges and Future Perspectives
by: Sarto, Sara, et al.
Published: (2025)

Leveraging Multimodal-LLMs Assisted by Instance Segmentation for Intelligent Traffic Monitoring
by: Onsu, Murat Arda, et al.
Published: (2025)

Unlearning Sensitive Information in Multimodal LLMs: Benchmark and Attack-Defense Evaluation
by: Patil, Vaidehi, et al.
Published: (2025)

Euclid: Supercharging Multimodal LLMs with Synthetic High-Fidelity Visual Descriptions
by: Zhang, Jiarui, et al.
Published: (2024)