:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Zhang, Jun, Meng, Desen, Zhang, Zhengming, Huang, Zhenpeng, Wu, Tao, Wang, Limin
Format:	Preprint
Published:	2024
Subjects:	Computer Vision and Pattern Recognition Computation and Language
Online Access:	https://arxiv.org/abs/2412.04449
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

VideoCap-R1: Enhancing MLLMs for Video Captioning via Structured Thinking
by: Meng, Desen, et al.
Published: (2025)

$γ-$MoD: Exploring Mixture-of-Depth Adaptation for Multimodal Large Language Models
by: Luo, Yaxin, et al.
Published: (2024)

VideoLLM-MoD: Efficient Video-Language Streaming with Mixture-of-Depths Vision Computation
by: Wu, Shiwei, et al.
Published: (2024)

MoD-DPO: Towards Mitigating Cross-modal Hallucinations in Omni LLMs using Modality Decoupled Preference Optimization
by: Chaubey, Ashutosh, et al.
Published: (2026)

LLaVA-MoD: Making LLaVA Tiny via MoE Knowledge Distillation
by: Shu, Fangxun, et al.
Published: (2024)

MoDES: Accelerating Mixture-of-Experts Multimodal Large Language Models via Dynamic Expert Skipping
by: Huang, Yushi, et al.
Published: (2025)

Octavius: Mitigating Task Interference in MLLMs via LoRA-MoE
by: Chen, Zeren, et al.
Published: (2023)

MoD-SLAM: Monocular Dense Mapping for Unbounded 3D Scene Reconstruction
by: Zhou, Heng, et al.
Published: (2024)

Mitigating Object Hallucinations in MLLMs via Multi-Frequency Perturbations
by: Li, Shuo, et al.
Published: (2025)

MedRCube: A Multidimensional Framework for Fine-Grained and In-Depth Evaluation of MLLMs in Medical Imaging
by: Bao, Zhijie, et al.
Published: (2026)

LongVPO: From Anchored Cues to Self-Reasoning for Long-Form Video Preference Optimization
by: Huang, Zhenpeng, et al.
Published: (2026)

CameraBench: Benchmarking Visual Reasoning in MLLMs via Photography
by: Fang, I-Sheng, et al.
Published: (2025)

Latent Sketchpad: Sketching Visual Thoughts to Elicit Multimodal Reasoning in MLLMs
by: Zhang, Huanyu, et al.
Published: (2025)

Interpretable and Reliable Detection of AI-Generated Images via Grounded Reasoning in MLLMs
by: Ji, Yikun, et al.
Published: (2025)

Seeing is Believing: Rich-Context Hallucination Detection for MLLMs via Backward Visual Grounding
by: Guo, Pinxue, et al.
Published: (2025)

MedXIAOHE: A Comprehensive Recipe for Building Medical MLLMs
by: Shi, Baorong, et al.
Published: (2026)

Unhackable Temporal Rewarding for Scalable Video MLLMs
by: Yu, En, et al.
Published: (2025)

MMReason: An Open-Ended Multi-Modal Multi-Step Reasoning Benchmark for MLLMs Toward AGI
by: Yao, Huanjin, et al.
Published: (2025)

MoPD: Mixture-of-Prompts Distillation for Vision-Language Models
by: Chen, Yang, et al.
Published: (2024)

Linking Perception, Confidence and Accuracy in MLLMs
by: Du, Yuetian, et al.
Published: (2026)

RLHF-V: Towards Trustworthy MLLMs via Behavior Alignment from Fine-grained Correctional Human Feedback
by: Yu, Tianyu, et al.
Published: (2023)

The Side Effects of Being Smart: Safety Risks in MLLMs' Multi-Image Reasoning
by: Chen, Renmiao, et al.
Published: (2026)

Med-MoE: Mixture of Domain-Specific Experts for Lightweight Medical Vision-Language Models
by: Jiang, Songtao, et al.
Published: (2024)

MoDification: Mixture of Depths Made Easy
by: Zhang, Chen, et al.
Published: (2024)

3D-MoE: A Mixture-of-Experts Multi-modal LLM for 3D Vision and Pose Diffusion via Rectified Flow
by: Ma, Yueen, et al.
Published: (2025)

Can MLLMs Read the Room? A Multimodal Benchmark for Assessing Deception in Multi-Party Social Interactions
by: Kang, Caixin, et al.
Published: (2025)

Breaking Down and Building Up: Mixture of Skill-Based Vision-and-Language Navigation Agents
by: Ma, Tianyi, et al.
Published: (2025)

D$^{3}$ToM: Decider-Guided Dynamic Token Merging for Accelerating Diffusion MLLMs
by: Chang, Shuochen, et al.
Published: (2025)

AMBER: An LLM-free Multi-dimensional Benchmark for MLLMs Hallucination Evaluation
by: Wang, Junyang, et al.
Published: (2023)

Human Cognitive Benchmarks Reveal Foundational Visual Gaps in MLLMs
by: Huang, Jen-Tse, et al.
Published: (2025)

MaskMoE: Boosting Token-Level Learning via Routing Mask in Mixture-of-Experts
by: Su, Zhenpeng, et al.
Published: (2024)

Uni-MoE: Scaling Unified Multimodal LLMs with Mixture of Experts
by: Li, Yunxin, et al.
Published: (2024)

CCD: Mitigating Hallucinations in Radiology MLLMs via Clinical Contrastive Decoding
by: Zhang, Xi, et al.
Published: (2025)

The Instinctive Bias: Spurious Images lead to Illusion in MLLMs
by: Han, Tianyang, et al.
Published: (2024)

SpatialViz-Bench: A Cognitively-Grounded Benchmark for Diagnosing Spatial Visualization in MLLMs
by: Wang, Siting, et al.
Published: (2025)

MoEController: Instruction-based Arbitrary Image Manipulation with Mixture-of-Expert Controllers
by: Li, Sijia, et al.
Published: (2023)

NeMo: Needle in a Montage for Video-Language Understanding
by: Hu, Zi-Yuan, et al.
Published: (2025)

Model Merging in LLMs, MLLMs, and Beyond: Methods, Theories, Applications and Opportunities
by: Yang, Enneng, et al.
Published: (2024)

Can MLLMs Understand the Deep Implication Behind Chinese Images?
by: Zhang, Chenhao, et al.
Published: (2024)

MoDoMoDo: Multi-Domain Data Mixtures for Multimodal LLM Reinforcement Learning
by: Liang, Yiqing, et al.
Published: (2025)