Saved in:
| Main Authors: | Imam, Mohamed Fazli, Lyu, Chenyang, Aji, Alham Fikri |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2501.10674 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
CLIP meets DINO for Tuning Zero-Shot Classifier using Unlabeled Image Collections
by: Imam, Mohamed Fazli, et al.
Published: (2024)
by: Imam, Mohamed Fazli, et al.
Published: (2024)
LinguDistill: Recovering Linguistic Ability in Vision-Language Models via Selective Cross-Modal Distillation
by: Irawan, Patrick Amadeus, et al.
Published: (2026)
by: Irawan, Patrick Amadeus, et al.
Published: (2026)
Vision Language Models are Confused Tourists
by: Irawan, Patrick Amadeus, et al.
Published: (2025)
by: Irawan, Patrick Amadeus, et al.
Published: (2025)
Beyond Probabilities: Unveiling the Misalignment in Evaluating Large Language Models
by: Lyu, Chenyang, et al.
Published: (2024)
by: Lyu, Chenyang, et al.
Published: (2024)
Maya: An Instruction Finetuned Multilingual Multimodal Model
by: Alam, Nahid, et al.
Published: (2024)
by: Alam, Nahid, et al.
Published: (2024)
LLMs Can Compensate for Deficiencies in Visual Representations
by: Takishita, Sho, et al.
Published: (2025)
by: Takishita, Sho, et al.
Published: (2025)
Rethinking the Mixture of Vision Encoders Paradigm for Enhanced Visual Understanding in Multimodal LLMs
by: Azadani, Mozhgan Nasr, et al.
Published: (2025)
by: Azadani, Mozhgan Nasr, et al.
Published: (2025)
UniVLR: Unifying Text and Vision in Visual Latent Reasoning for Multimodal LLMs
by: Jiang, Houcheng, et al.
Published: (2026)
by: Jiang, Houcheng, et al.
Published: (2026)
Multimodal LLMs Can Reason about Aesthetics in Zero-Shot
by: Jiang, Ruixiang, et al.
Published: (2025)
by: Jiang, Ruixiang, et al.
Published: (2025)
LoraxBench: A Multitask, Multilingual Benchmark Suite for 20 Indonesian Languages
by: Aji, Alham Fikri, et al.
Published: (2025)
by: Aji, Alham Fikri, et al.
Published: (2025)
TOMATO: Assessing Visual Temporal Reasoning Capabilities in Multimodal Foundation Models
by: Shangguan, Ziyao, et al.
Published: (2024)
by: Shangguan, Ziyao, et al.
Published: (2024)
Can Multimodal LLMs See Materials Clearly? A Multimodal Benchmark on Materials Characterization
by: Lai, Zhengzhao, et al.
Published: (2025)
by: Lai, Zhengzhao, et al.
Published: (2025)
Can Large Vision-Language Models Understand Multimodal Sarcasm?
by: Wang, Xinyu, et al.
Published: (2025)
by: Wang, Xinyu, et al.
Published: (2025)
Understanding Alignment in Multimodal LLMs: A Comprehensive Study
by: Amirloo, Elmira, et al.
Published: (2024)
by: Amirloo, Elmira, et al.
Published: (2024)
Textual Steering Vectors Can Improve Visual Understanding in Multimodal Large Language Models
by: Gan, Woody Haosheng, et al.
Published: (2025)
by: Gan, Woody Haosheng, et al.
Published: (2025)
Reinforcing Multimodal Reasoning Against Visual Degradation
by: Liu, Rui, et al.
Published: (2026)
by: Liu, Rui, et al.
Published: (2026)
DaMO: A Data-Efficient Multimodal Orchestrator for Temporal Reasoning with Video LLMs
by: Chiu, Bo-Cheng, et al.
Published: (2025)
by: Chiu, Bo-Cheng, et al.
Published: (2025)
VidHalluc: Evaluating Temporal Hallucinations in Multimodal Large Language Models for Video Understanding
by: Li, Chaoyu, et al.
Published: (2024)
by: Li, Chaoyu, et al.
Published: (2024)
Improving Low-Resource Machine Translation via Round-Trip Reinforcement Learning
by: Attia, Ahmed, et al.
Published: (2026)
by: Attia, Ahmed, et al.
Published: (2026)
Selective Vision is the Challenge for Visual Reasoning: A Benchmark for Visual Argument Understanding
by: Chung, Jiwan, et al.
Published: (2024)
by: Chung, Jiwan, et al.
Published: (2024)
The Percept-V Challenge: Can Multimodal LLMs Crack Simple Perception Problems?
by: Ghosh, Samrajnee, et al.
Published: (2025)
by: Ghosh, Samrajnee, et al.
Published: (2025)
Hierarchical Multimodal Pre-training for Visually Rich Webpage Understanding
by: Xu, Hongshen, et al.
Published: (2024)
by: Xu, Hongshen, et al.
Published: (2024)
CharXiv: Charting Gaps in Realistic Chart Understanding in Multimodal LLMs
by: Wang, Zirui, et al.
Published: (2024)
by: Wang, Zirui, et al.
Published: (2024)
Hanfu-Bench: A Multimodal Benchmark on Cross-Temporal Cultural Understanding and Transcreation
by: Zhou, Li, et al.
Published: (2025)
by: Zhou, Li, et al.
Published: (2025)
VisualPRM: An Effective Process Reward Model for Multimodal Reasoning
by: Wang, Weiyun, et al.
Published: (2025)
by: Wang, Weiyun, et al.
Published: (2025)
Behind Maya: Building a Multilingual Vision Language Model
by: Alam, Nahid, et al.
Published: (2025)
by: Alam, Nahid, et al.
Published: (2025)
Visual Thoughts: A Unified Perspective of Understanding Multimodal Chain-of-Thought
by: Cheng, Zihui, et al.
Published: (2025)
by: Cheng, Zihui, et al.
Published: (2025)
RAVENEA: A Benchmark for Multimodal Retrieval-Augmented Visual Culture Understanding
by: Li, Jiaang, et al.
Published: (2025)
by: Li, Jiaang, et al.
Published: (2025)
Parameter-Inverted Image Pyramid Networks for Visual Perception and Multimodal Understanding
by: Wang, Zhaokai, et al.
Published: (2025)
by: Wang, Zhaokai, et al.
Published: (2025)
Towards Perceiving Small Visual Details in Zero-shot Visual Question Answering with Multimodal LLMs
by: Zhang, Jiarui, et al.
Published: (2023)
by: Zhang, Jiarui, et al.
Published: (2023)
Exploring the Role of Explicit Temporal Modeling in Multimodal Large Language Models for Video Understanding
by: Li, Yun, et al.
Published: (2025)
by: Li, Yun, et al.
Published: (2025)
Latent Sketchpad: Sketching Visual Thoughts to Elicit Multimodal Reasoning in MLLMs
by: Zhang, Huanyu, et al.
Published: (2025)
by: Zhang, Huanyu, et al.
Published: (2025)
MathCanvas: Intrinsic Visual Chain-of-Thought for Multimodal Mathematical Reasoning
by: Shi, Weikang, et al.
Published: (2025)
by: Shi, Weikang, et al.
Published: (2025)
v1: Learning to Point Visual Tokens for Multimodal Grounded Reasoning
by: Chung, Jiwan, et al.
Published: (2025)
by: Chung, Jiwan, et al.
Published: (2025)
M4U: Evaluating Multilingual Understanding and Reasoning for Large Multimodal Models
by: Wang, Hongyu, et al.
Published: (2024)
by: Wang, Hongyu, et al.
Published: (2024)
MME-Finance: A Multimodal Finance Benchmark for Expert-level Understanding and Reasoning
by: Gan, Ziliang, et al.
Published: (2024)
by: Gan, Ziliang, et al.
Published: (2024)
AV-Odyssey Bench: Can Your Multimodal LLMs Really Understand Audio-Visual Information?
by: Gong, Kaixiong, et al.
Published: (2024)
by: Gong, Kaixiong, et al.
Published: (2024)
ShredBench: Evaluating the Semantic Reasoning Capabilities of Multimodal LLMs in Document Reconstruction
by: Guo, Zichun, et al.
Published: (2026)
by: Guo, Zichun, et al.
Published: (2026)
Daisy-TTS: Simulating Wider Spectrum of Emotions via Prosody Embedding Decomposition
by: Chevi, Rendi, et al.
Published: (2024)
by: Chevi, Rendi, et al.
Published: (2024)
Scientific Reasoning: Assessment of Multimodal Generative LLMs
by: Dreyer, Florian, et al.
Published: (2025)
by: Dreyer, Florian, et al.
Published: (2025)
Similar Items
-
CLIP meets DINO for Tuning Zero-Shot Classifier using Unlabeled Image Collections
by: Imam, Mohamed Fazli, et al.
Published: (2024) -
LinguDistill: Recovering Linguistic Ability in Vision-Language Models via Selective Cross-Modal Distillation
by: Irawan, Patrick Amadeus, et al.
Published: (2026) -
Vision Language Models are Confused Tourists
by: Irawan, Patrick Amadeus, et al.
Published: (2025) -
Beyond Probabilities: Unveiling the Misalignment in Evaluating Large Language Models
by: Lyu, Chenyang, et al.
Published: (2024) -
Maya: An Instruction Finetuned Multilingual Multimodal Model
by: Alam, Nahid, et al.
Published: (2024)