:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Imam, Mohamed Fazli, Lyu, Chenyang, Aji, Alham Fikri
Format:	Preprint
Published:	2025
Subjects:	Computer Vision and Pattern Recognition Computation and Language
Online Access:	https://arxiv.org/abs/2501.10674
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

CLIP meets DINO for Tuning Zero-Shot Classifier using Unlabeled Image Collections
by: Imam, Mohamed Fazli, et al.
Published: (2024)

LinguDistill: Recovering Linguistic Ability in Vision-Language Models via Selective Cross-Modal Distillation
by: Irawan, Patrick Amadeus, et al.
Published: (2026)

Vision Language Models are Confused Tourists
by: Irawan, Patrick Amadeus, et al.
Published: (2025)

Beyond Probabilities: Unveiling the Misalignment in Evaluating Large Language Models
by: Lyu, Chenyang, et al.
Published: (2024)

Maya: An Instruction Finetuned Multilingual Multimodal Model
by: Alam, Nahid, et al.
Published: (2024)

LLMs Can Compensate for Deficiencies in Visual Representations
by: Takishita, Sho, et al.
Published: (2025)

Rethinking the Mixture of Vision Encoders Paradigm for Enhanced Visual Understanding in Multimodal LLMs
by: Azadani, Mozhgan Nasr, et al.
Published: (2025)

UniVLR: Unifying Text and Vision in Visual Latent Reasoning for Multimodal LLMs
by: Jiang, Houcheng, et al.
Published: (2026)

Multimodal LLMs Can Reason about Aesthetics in Zero-Shot
by: Jiang, Ruixiang, et al.
Published: (2025)

LoraxBench: A Multitask, Multilingual Benchmark Suite for 20 Indonesian Languages
by: Aji, Alham Fikri, et al.
Published: (2025)

TOMATO: Assessing Visual Temporal Reasoning Capabilities in Multimodal Foundation Models
by: Shangguan, Ziyao, et al.
Published: (2024)

Can Multimodal LLMs See Materials Clearly? A Multimodal Benchmark on Materials Characterization
by: Lai, Zhengzhao, et al.
Published: (2025)

Can Large Vision-Language Models Understand Multimodal Sarcasm?
by: Wang, Xinyu, et al.
Published: (2025)

Understanding Alignment in Multimodal LLMs: A Comprehensive Study
by: Amirloo, Elmira, et al.
Published: (2024)

Textual Steering Vectors Can Improve Visual Understanding in Multimodal Large Language Models
by: Gan, Woody Haosheng, et al.
Published: (2025)

Reinforcing Multimodal Reasoning Against Visual Degradation
by: Liu, Rui, et al.
Published: (2026)

DaMO: A Data-Efficient Multimodal Orchestrator for Temporal Reasoning with Video LLMs
by: Chiu, Bo-Cheng, et al.
Published: (2025)

VidHalluc: Evaluating Temporal Hallucinations in Multimodal Large Language Models for Video Understanding
by: Li, Chaoyu, et al.
Published: (2024)

Improving Low-Resource Machine Translation via Round-Trip Reinforcement Learning
by: Attia, Ahmed, et al.
Published: (2026)

Selective Vision is the Challenge for Visual Reasoning: A Benchmark for Visual Argument Understanding
by: Chung, Jiwan, et al.
Published: (2024)

The Percept-V Challenge: Can Multimodal LLMs Crack Simple Perception Problems?
by: Ghosh, Samrajnee, et al.
Published: (2025)

Hierarchical Multimodal Pre-training for Visually Rich Webpage Understanding
by: Xu, Hongshen, et al.
Published: (2024)

CharXiv: Charting Gaps in Realistic Chart Understanding in Multimodal LLMs
by: Wang, Zirui, et al.
Published: (2024)

Hanfu-Bench: A Multimodal Benchmark on Cross-Temporal Cultural Understanding and Transcreation
by: Zhou, Li, et al.
Published: (2025)

VisualPRM: An Effective Process Reward Model for Multimodal Reasoning
by: Wang, Weiyun, et al.
Published: (2025)

Behind Maya: Building a Multilingual Vision Language Model
by: Alam, Nahid, et al.
Published: (2025)

Visual Thoughts: A Unified Perspective of Understanding Multimodal Chain-of-Thought
by: Cheng, Zihui, et al.
Published: (2025)

RAVENEA: A Benchmark for Multimodal Retrieval-Augmented Visual Culture Understanding
by: Li, Jiaang, et al.
Published: (2025)

Parameter-Inverted Image Pyramid Networks for Visual Perception and Multimodal Understanding
by: Wang, Zhaokai, et al.
Published: (2025)

Towards Perceiving Small Visual Details in Zero-shot Visual Question Answering with Multimodal LLMs
by: Zhang, Jiarui, et al.
Published: (2023)

Exploring the Role of Explicit Temporal Modeling in Multimodal Large Language Models for Video Understanding
by: Li, Yun, et al.
Published: (2025)

Latent Sketchpad: Sketching Visual Thoughts to Elicit Multimodal Reasoning in MLLMs
by: Zhang, Huanyu, et al.
Published: (2025)

MathCanvas: Intrinsic Visual Chain-of-Thought for Multimodal Mathematical Reasoning
by: Shi, Weikang, et al.
Published: (2025)

v1: Learning to Point Visual Tokens for Multimodal Grounded Reasoning
by: Chung, Jiwan, et al.
Published: (2025)

M4U: Evaluating Multilingual Understanding and Reasoning for Large Multimodal Models
by: Wang, Hongyu, et al.
Published: (2024)

MME-Finance: A Multimodal Finance Benchmark for Expert-level Understanding and Reasoning
by: Gan, Ziliang, et al.
Published: (2024)

AV-Odyssey Bench: Can Your Multimodal LLMs Really Understand Audio-Visual Information?
by: Gong, Kaixiong, et al.
Published: (2024)

ShredBench: Evaluating the Semantic Reasoning Capabilities of Multimodal LLMs in Document Reconstruction
by: Guo, Zichun, et al.
Published: (2026)

Daisy-TTS: Simulating Wider Spectrum of Emotions via Prosody Embedding Decomposition
by: Chevi, Rendi, et al.
Published: (2024)

Scientific Reasoning: Assessment of Multimodal Generative LLMs
by: Dreyer, Florian, et al.
Published: (2025)